AWS Big Data Blog
Discovering metadata with AWS Lake Formation: Part 1
Data lakes are an increasingly popular way to create a single repository to store and analyze both structured and unstructured data. AWS Lake Formation makes it easy for you to set up, secure, and manage data lakes. This post walks you through the creation and exploration of a data lake using Lake Formation:
- Creating the data lake
o Adding data to your data lake
o Creating catalog databases
o Adding tables from Amazon S3 to catalog databases
- Editing and adding metadata within the catalog
o Editing standard metadata
o Adding custom metadata
For this post, you need the following:
- An AWS account.
- An AWS Identity and Access Management (IAM) user with access to Amazon S3, AWS Glue, and AWS Lake Formation.
Create the data lake
In the AWS Lake Formation console, in the left navigation pane, choose Register and ingest, Data lake locations. Select a single S3 bucket to house several independent data sources in your data lake. For more information, see What is AWS Lake Formation?
Add data to your data lake
Now that you have an S3 bucket configured as a storage resource for Lake Formation, you must add data to your data lake. You can add data to your data lake’s S3 bucket storage resource using AWS SDKs, AWS CLI, the S3 console, or a Lake Formation blueprint.
With Lake Formation, you can discover and set up the ingestion of your source data. When you add a workflow that loads or updates the data lake, you can choose a blueprint or template of the type of importer to add. Lake Formation provides several blueprints on the Lake Formation console for common source data types to simplify the creation of workflows. Workflows point to your data source and target and specify the frequency that they run.
For this post, use the AWS CLI to download sample data and then upload it to your S3 storage backend. Other import methods, such as Lake Formation data importers, are outside the scope of this post.
Sample from the following two datasets provided on the Registry of Open Data on AWS:
Make two copies of the Amazon customer reviews dataset in your data lake. You can use these to simulate “production” and “test” datasets and learn how to target one or both when searching your metadata catalog.
To demonstrate the flexibility of an AWS data lake, add both CSV and Parquet datasets to your data lake. In both cases, use the following naming convention for your S3 objects:
Add Amazon customer reviews to your data lake
AWS hosts a registry to help people share and discover a variety of datasets. For this post, copy a subset of the Amazon customer reviews dataset into your data lake. You don’t have to copy the complete reviews dataset, only the smaller 226-MB portion of watch reviews. You need two copies of this data in your data lake to simulate separate “production” and “test” databases.
- If you have not already, install and configure the AWS CLI with IAM user access keys that include permission to read from S3 and write to your Lake Formation S3 bucket.
- Copy the source to your data lake:
- In the S3 console, confirm that your S3 bucket now contains your two Amazon reviews datasets.
- Inspect the contents of the folders. The datasets are in Parquet format.
Add New York taxi ride history to your data lake
Much as you did with the Amazon customer reviews dataset, copy a small subset of New York taxi ride history from the Registry of Open Data on AWS into your data lake:
- Copy the source data to your data lake:
- In the S3 console, validate that your S3 bucket contains CSV data for NY taxi trips.
Create catalog databases
You have created an S3 bucket to act as your data lake storage backend and added data to the bucket. However, this data is not readily available in Lake Formation until you catalog the data.
Lake Formation maintains a Hive-compatible data catalog of data within your data lake. Before you can catalog data within your S3 storage backend or use Lake Formation data importers (discussed later) to push data to S3, you must first create a database.
A Lake Formation database is a logical construct to which you later add tables. Each table contains a mapping to one or more objects in S3 that, collectively, represent that table. Tables also contain basic metadata including but not limited to file format, S3 location, column headings, and column types. Lake Formation users can also optionally define arbitrary key-value pairs for tables and columns to better describe the data and act as query-able attributes for data discovery.
You can create one or more databases and populate their tables either manually in the console, programmatically using the AWS SDKs or AWS CLI, or automatically by defining AWS Glue crawlers.
For this post, you must define three logical databases:
Then, use the cataloging process to map to the two datasets that you previously uploaded to your S3 storage backend. Remember, you intentionally created two copies of the Amazon reviews dataset to simulate both a production and test database in your data lake.
Now, create your databases. First, configure IAM users and roles as administrators within Lake Formation.
Catalog permissions are permissions that the selected IAM principal can use directly. Grantable permissions are those that the IAM principal can grant to other IAM principals later.
For example, you might want to give your database administrator (DBA) the ability to create databases, by granting permissions to the catalog. However, you can prevent the DBA from accidentally giving this access to your developers by not enabling the grantable permission.
Now that you’ve granted necessary permissions, you can proceed to create your database within the catalog.
- For Name, enter
- For Location, enter
- For Description, enter a brief, meaningful description.
- Keep the Default permissions for newly created tables set as is. You can learn more about restricting permissions to your database in Part 2 of this post series.
Repeat the process for the other two databases:
After completing these steps, you should have three databases in your catalog:
Add tables from S3 to your catalog databases
In the previous section, you created three databases in your Lake Formation catalog. However, these catalog databases are empty and do not yet provide information about the specific tables, schema, file formats, or object paths in S3. To add this information, use one of the following two methods:
- Manually define your tables in the catalog using the console, SDKs, or AWS CLI.
- Use an AWS Glue crawler to search S3 and automatically add discovered tables to your catalog.
For this post, create and manually run one AWS Glue crawler for each of your three datasets in S3 and databases in the Lake Formation data catalog. A detailed walkthrough is outside the scope of this post. For guidance, see Working with Crawlers on the AWS Glue Console.
As you proceed, please bear the following in mind:
- Create one crawler for each of your three datasets. You should be able to accept most of the default crawler settings. However, the S3 path for your crawlers should read:
Before you run the crawlers to populate your catalog, you must assign them an IAM role. The role grants them permission to read from your data lake’s S3 bucket, write crawler logs to Amazon CloudWatch, and update your data catalog. Regardless of whether you create a new role, or use an existing role, make a note of the IAM role name. You need this information for the next step.
In addition to permissions defined within IAM, you must also explicitly grant IAM principals (roles or users) the ability to modify your Data Catalog from within Lake Formation itself. Conceptually, this is similar to the concept of bucket policies in S3 used with IAM. In the Lake Formation console, under Permissions, choose Data permissions.
Grant your AWS Glue crawlers the ability to modify your Data Catalog. Configure the following fields:
- For IAM users and roles, select the IAM roles that you previously used for your AWS Glue crawlers.
- For Database, select the amazon-reviews-prod, amazon-reviews-test, and ny-taxi databases.
- For Database permissions, select all permissions.
- Leave all Grantable permissions unselected.
After your AWS Glue crawlers have permission to modify your Lake Formation data catalog, return to the AWS Glue console and manually run your three crawlers. After a few minutes, the crawlers should complete their runs. Each should add one table to your data catalog:
Verify that your catalog was updated. In the Lake Formation console, under Data catalog, choose Tables, and view the three new tables added to the corresponding data lake databases, as shown in the following screenshot.
Edit and add metadata within the catalog
The AWS Glue crawlers populate standard metadata about the tables they discover in S3, including (but not limited to) attributes such as object location, file format, column headings, and column types.
However, you can manually edit standard metadata or add additional custom metadata to the catalog to make it easier to search and improve the overall value that it provides. In the following section, I walk through several examples of editing and adding to metadata.
Edit standard metadata
The AWS Glue crawlers infer the name of columns from the first line of CSV file. To view the auto-populated column names for the ny_taxi table, look at the table properties:
- Under Data catalog, choose Tables.
- Select ny_taxi and scroll down to the Schema section.
- Choose Edit Schema. Your data columns’ names must consistently use snake casing, which means using the ‘_’ character between words. Change all of the id columns to match the rest of the columns. If you look at the first row of the raw data, you notice that there is inconsistent naming used. Rather than changing those files, you can manually change the metadata.
- Select the
vendoridrow and choose Edit. Make your changes to include the snake casing and choose Save.
- Repeat the following steps for
pulocationid. After you make these changes, choose Save as new version.
- Under Data catalog, choose Tables. If you search for
pulocationid, no results should return.
- Search for the new column name,
pu_location_id. This search should return the expected result, the trip_data table from the ny-taxi database.
Add custom metadata
Now, try adding a couple of custom table properties to help organize your tables. The first table property to add is an environment variable to help you to determine whether a table is for development, testing, or production. The second table property to add is a department variable, which allows you to group tables by a department.
- In the Lake Formation console, under Data catalog, choose Databases.
- Select the
ny-taxidatabase and choose View tables.
- Select the
trip_datatable and choose Edit Table.
- Under Table properties, choose Add. Set the value of environment to dev and the value of department to research1. Choose Save.
- Under Data catalog, choose Tables. In the search bar, type “research,” and press Enter. No results return because there isn’t a table with the table property value of research. However, searching for research1 should return the trip_data table.
- Go back to your table properties for trip-data and update the department property from research1 to research. After you’ve made the edit, the trip-data table appears when entering “research” as a keyword in the table search:
Congratulations: You have successfully created and edited your first data lake using the Lake Formation. You used the service to secure and ingest data into an S3 data lake, catalog the data, and customize the metadata of the data sources. In part 2 of this series, I show you how to discover your data by using the metadata search capabilities of Lake Formation.
About the Authors
Julia Soscia is a solutions architect at Amazon Web Services based out of New York City. Her main focus is to help customers create well-architected environments on the AWS cloud platform. She is an experienced data analyst with a focus in Big Data and Analytics.
Eric Weinberg is a systems development engineer on the AWS Envision Engineering team. He has 15 years of experience building and designing software applications.
Francesco Marelli is a senior solutions architect at Amazon Web Services. He has more than twenty years experience in Analytics and Data Management.
Mat Werber is a solutions architect on the AWS Community SA Team. He is responsible for providing architectural guidance across the full AWS stack with a focus on Serverless, Redshift, DynamoDB, and RDS. He also has an audit background in IT governance, risk, and controls.