AWS Big Data Blog
Getting started with AWS Lake Formation
AWS Lake Formation enables you to set up a secure data lake. A data lake is a centralized, curated, and secured repository storing all your structured and unstructured data, at any scale. You can store your data as-is, without having first to structure it. And you can run different types of analytics to better guide decision-making—from dashboards and visualizations to big data processing, real-time analytics, and machine learning.
The challenges of data lakes
The main challenge to data lake administration stems from the storage of raw data without content oversight. To make the data in your lake usable, you need defined mechanisms for cataloging and securing that data.
Lake Formation provides the mechanisms to implement governance, semantic consistency, and access controls over your data lake. Lake Formation makes your data more usable for analytics and machine learning, providing better value to your business.
Lake Formation allows you to control data lake access and audit those who access data. The AWS Glue Data Catalog integrates data access policies, making sure of compliance regardless of the data’s origin.
Walkthrough
In this walkthrough, I show you how to build and use a data lake:
- Create a data lake administrator.
- Register an Amazon S3 path.
- Create a database.
- Grant permissions.
- Crawl the data with AWS Glue to create the metadata and table.
- Grant access to the table data.
- Query the data using Amazon Athena.
- Add a new user with restricted access and verify the results.
Prerequisites
You need the following resources for this walkthrough:
- An AWS account
- An IAM user with the AWSLakeFormationDataAdmin policy. For more information, see IAM Access Policies.
- An S3 bucket named
datalake-yourname-region
, in the US-East (N. Virginia) - A folder named
zipcode
within your new S3 bucket.
You also must download the sample dataset. For this walkthrough, I use a table of City of New York statistics. The data is available on the DATA.GOV site, in the City of New York Demographics Statistics by Zip table. Upload the file to your S3 bucket in the /zipcode folder.
You have set up the S3 bucket and put the dataset in place. Now, set up your data lake with Lake Formation.
Step 1: Create a data lake administrator
First, designate yourself a data lake administrator to allow access to any Lake Formation resource.
Step 2: Register an Amazon S3 path
Next, register an Amazon S3 path to contain your data in the data lake.
Step 3: Create a database
Next, create a database in the AWS Glue Data Catalog to contain the zipcode table definitions.
- For Database, enter zipcode-db.
- For Location, enter your S3 bucket/zipcode.
- For New tables in this database, do not select Grant All to Everyone.
Step 4: Grant permissions
Next, grant permissions for AWS Glue to use the zipcode-db database. For IAM role, select your user and AWSGlueServiceRoleDefault.
Grant your user and AWSServiceRoleForLakeFormationDataAccess permissions to use your data lake using a data location:
- For IAM role, choose your user and AWSServiceRoleForLakeFormationDataAccess.
- For Storage locations, enter
s3://datalake-yourname-region
.
Step 5: Crawl the data with AWS Glue to create the metadata and table
In this step, a crawler connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your AWS Glue Data Catalog.
Create a table using an AWS Glue crawler. Use the following configuration settings:
- Crawler name:
zipcodecrawler
. - Data stores: Select this field.
- Choose a data store: Select S3.
- Specified path: Select this field.
- Include path:
s3://datalake-yourname-location/zipcode
. - Add another data store: Choose No.
- Choose an existing IAM role: Select this field.
- IAM role: Select AWSGlueServiceRoleDefault.
- Run on demand: Select this field.
- Database: Select zipcode-db.
Choose Run it now? Wait for the crawler to stop before moving to the next step.
Step 6: Grant access to the table data
Set up your AWS Glue Data Catalog permissions to allow others to manage the data. Use the Lake Formation console to grant and revoke access to tables in the database.
- In the navigation pane, choose Tables.
- Choose Grant.
- Provide the following information:
- For IAM role, select your user and AWSGlueServiceRoleDefault.
- For Table permissions, choose Select all.
Step 7: Query the data with Athena
Next, query the data in the data lake using Athena.
- In the Athena console, choose Query Editor and select the zipcode-db
- Choose Tables and select the zipcode table.
- Choose Table Options (three vertical dots to the right of the table name).
- Select Preview table.
Athena issues the following query:
SELECT * FROM “zipcode”.”zipcode” limit 10;
As you can see from the following screenshot, the datalakeadmin user can see all of the data.
Step 8: Add a new user with restricted access and verify the results
This step shows how you, as the data lake administrator, can set up a user with restricted access to specific columns.
In the IAM console, create an IAM user with administrative rights, called user1, and add AWSLakeFormationDataAdmin policy. For more information, see Adding and Removing IAM Identity Permissions.
In the Lake Formation console, grant permissions to user1 and supply the following configuration settings:
- Database: Select zipcode-db.
- Table: Select zipcode.
- Columns: Choose The include columns.
- The include columns: Choose Jurisdiction name and Count participants.
- Table permissions: Select.
- Grantable permissions: Select.
To verify the results of the restricted permissions, repeat step 7 when you are logged in as user1. As the following screenshot shows, user1 can only see columns that the datalakeadmin user granted them permissions to view.
Conclusion
This post showed you how to build a secure data lake using Lake Formation. It provides the mechanisms to implement governance, semantic consistency, and access controls, making your data more usable for analytics and machine learning.
For more information, see the following posts:
About the Author
Gordon Heinrich is a solutions architect working with global system integrators. He works with AWS partners and customers to provide architectural guidance on building data lakes and using AWS machine learning services. In his spare time, he enjoys spending time with his family, skiing, hiking, and mountain biking in Colorado.