AWS Big Data Blog
Discover metadata with AWS Lake Formation: Part 2
January 2024: This post was reviewed and marked as outdated. The solution presented in this post no longer works and has been replaced with latest features from AWS Lake Formation.
Data lakes are an increasingly popular way to aggregate, store, and analyze both structured and unstructured data. AWS Lake Formation makes it easy for you to set up, secure, and manage your data lakes.
In Part 1 of this post series, you learned how to create and explore a data lake using Lake Formation. This post walks you through data discovery using the metadata search capabilities of Lake Formation in the console, and metadata search results restricted by column permissions.
Prerequisites
For this post, you need the following:
- An AWS account.
- An AWS Identity and Access Management (IAM) user with access to Amazon S3, AWS Glue, and AWS Lake Formation.
Metadata search in the console
In this post, we demonstrate the catalog search capabilities offered by the Lake Formation console:
- Search by classification
- Search by keyword
- Search by tag: attribute
- Multiple filter searches
Search by classification
Using the metadata catalog search capabilities, search across all tables within your data lake. Two share the name amazon_reviews but separately belong to your simulated “prod” and “test” databases, and the third is trip-data.
- In the Lake Formation console, under Data catalog, choose Tables.
- In the search bar, under Resource Attributes, choose Classification, type CSV, and press Enter. You should see only the trip_data table, which you formatted as CSV in your data lake. The amazon_reviews tables do not appear because they are in Parquet format.
- In the Name column, choose trip_data. Under Table details, you can see that the classification CSV is correctly identified by the metadata search filter.
Search by keyword
Next, search across your entire data lake filtering metadata by keyword.
- To refresh the list of tables, under Data catalog, choose Tables again.
- From the search bar, type
star_rating
, and press Enter. Now that you have applied the filter, you should see only the amazon_reviews tables because they both contain a column named star_rating. - By choosing either of the two tables, you can scroll down to the Schema section, and confirm that they contain a star_rating column.
Search by tag: attribute
Next, search across your data lake and filter results by metadata tags and their attribute value.
- To refresh the list of tables, under Data catalog, choose Tables.
- From the search bar, type
department:research
, and press Enter. Now that you have applied the filter, you should see only the trip_data table because this is the only table containing the value of ‘research’ in the table property of ‘department’. - Select the trip_data table. Under Table details, you can see the tag: attribute of department | research listed under Table properties.
Multiple filter searches
Finally, try searching across your entire data lake using multiple filters at one time.
- To refresh the list of tables, under Data catalog, choose Tables.
- In the search bar, choose Location, type
S3
, and press Enter. For this post, all of your catalog tables are in S3, so all three tables display. - In the search bar, choose Classification, type
parquet
, and press Enter. You should see only the amazon_reviews tables because they are the only tables stored in S3 in Parquet format. - Choose either of the displayed amazon_reviews tables. Under Table details, you can see that the following is true.
- Location: S3
- Classification: parquet
Metadata search results restricted by column permissions
The metadata search capabilities return results based on the permissions specified within Lake Formation. If a user or a role does not have permission to a particular database, table, or column, that element doesn’t appear in that user’s search results.
To demonstrate this, first create an IAM user, dataResearcher, with AWS Management Console access. Make sure to store the password somewhere safe. Next, create an IAM group with the AWSLakeFormationDataAdmin policy attached. Finally, add the dataResearcher user to this new group. We will use this user below to sign into our account with the group’s limited access.
In Part 1 of this series, you allowed Everyone to view the tables that the AWS Glue crawlers created. Now, revoke those permissions for the ny-taxi database.
- In the Lake Formation console, under Permissions, choose Data permissions.
- Scroll down or search until you see the Everyone record for the trip_data table.
- Select the record and choose Revoke, Revoke.
Now, your dataResearcher IAM user cannot see the ny-taxi database or the trip_data table. Resolve this issue by setting up Lake Formation permissions.
- Under Permissions, choose Data Permission, Grant.
- Select the dataResearcher user, the ny-taxi database, and the trip_data table.
- Under Table permissions, check Select and choose Grant.
- Log out of the console and sign back in using the dataResearcher IAM user that you created earlier.
- In the Lake Formation console, choose Tables, select the trip_data table, and look at its properties:
The dataResearcher user currently has visibility across all of these columns. However, you don’t want to allow this user to see the pickup or drop off locations, as those are potential privacy risks. Remove these columns from the dataResearcher user’s permissions.
- Log out of the dataResearcher user and log back in with your administrative account.
- In the Lake Formation console, under Permissions, choose Data Permissions.
- Select the dataResearcher record and choose Revoke.
- On the Revoke page, under Column, choose Include Columns and then choose the vendor_id, passenger_count, trip_distance, and total_amount columns.
- Under Table permissions, check Select. These settings revoke all permissions of the dataResearcher user to the trip_data table except those selected in the window. In other words, the dataResearcher user can only Select(view) the four selected columns.
- Choose Revoke.
- Log back in as the dataResearcher user.
- In the Lake Formation console, choose Data catalog, Tables. Search for
vendor_id
and press Enter. The trip_data table appears in the search, as shown in the following screenshot. - Search for pu_location_id. This returns no results because you revoked permissions to this column, as shown in the following screenshot.
Conclusion
Congratulations: You have learned how to use the metadata search capabilities of Lake Formation. By defining specific user permissions, Lake Formation allowed you to grant and revoke access to metadata in the Data Catalog as well as the underlying data stored in S3. Therefore, you can discover your data sources across your entire AWS environment using a single pane of glass. To learn more, see AWS Lake Formation.
About the Authors
Julia Soscia is a solutions architect at Amazon Web Services based out of New York City. Her main focus is to help customers create well-architected environments on the AWS cloud platform. She is an experienced data analyst with a focus in Big Data and Analytics.
Eric Weinberg is a systems development engineer on the AWS Envision Engineering team. He has 15 years of experience building and designing software applications.
Francesco Marelli is a senior solutions architect at Amazon Web Services. He has more than twenty years experience in Analytics and Data Management.
Mat Werber is a solutions architect on the AWS Community SA Team. He is responsible for providing architectural guidance across the full AWS stack with a focus on Serverless, Redshift, DynamoDB, and RDS. He also has an audit background in IT governance, risk, and controls.
Audit History
Last reviewed and updated in January 2024 by Francesco Marelli | Principal Solutions Architect