AWS Big Data Blog
Crawl Delta Lake tables using AWS Glue crawlers
June 2023: This post was reviewed and updated for accuracy.
In recent evolution in data lake technologies, it became popular to bring ACID (atomicity, consistency, isolation, and durability) transactions on Amazon Simple Storage Service (Amazon S3). You can achieve that by introducing open-source data lake formats such as Apache Hudi, Apache Iceberg, and Delta Lake. Delta Lake is one of the common open-source data lake formats.
Delta Lake is available with multiple AWS services, such as AWS Glue Spark jobs, Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. To utilize Delta Lake from Glue Spark jobs, follow this blog series. To utilize Delta Lake from Athena and Redshift Spectrum, you need to have specific table definitions on the AWS Glue Data Catalog, and there is an extra step to make it queryable from Athena and Redshift Spectrum.
One of the key capabilities of Delta Lake and other data lake formats is reading consistent snapshot using ACID transactions. Even when there are many concurrent writes, you can guarantee consistent version of the tables at the specific point in time without retrieving intermediate or incomplete state. It is highly demanded capability especially in complex data pipelines.
AWS Glue crawlers are designed to populate table definitions on the Data Catalog based on data dynamically. This year, AWS Glue crawlers started supporting Delta Lake. It simplifies those use cases by creating table definitions of Delta tables dynamically, populating the metadata from the Delta Lake transaction logs, and creating the manifest files in Amazon S3 for Athena and Redshift Spectrum to consume. With Delta lake crawler, you can easily read consistent snapshot from Athena and Redshift Spectrum. AWS Glue crawler integration with Delta Lake also supports AWS Lake Formation access control. You can grant Lake Formation permissions on the Delta tables created by the crawler to AWS principals that then query through Athena and Redshift Spectrum to access data in Delta tables.
This post demonstrates how AWS Glue crawlers work with Delta tables, and describes typical use cases to query Delta tables.
How AWS Glue Crawler works with Delta Lake
Delta Lake provides an abstraction known as a Delta table that encapsulates all metadata, data files, and partitions under a transaction log. Delta Lake stores the metadata and schema within the distributed file system rather than in a central data catalog.
To access data using the Delta Lake protocol, Redshift Spectrum and Athena need a manifest file that lists all files that are associated to a particular Delta table, along with the table metadata populated in the AWS Glue Data Catalog. Traditionally, this manifest file creation required running a GENERATE symlink_format_manifest
query on Apache Spark.
The AWS Glue crawler populates the metadata from the Delta Lake transaction log into the Data Catalog, and creates the manifest files in Amazon S3 for different query engines to consume. To simplify access to Delta tables, the crawler provides an option to select a Delta Lake data store, which encapsulates all parameters that are required for crawling. For each Delta Lake data store, the crawler scans the Delta table’s transaction log to detect metadata. It populates the _symlink_manifest
folder with the manifest files that are partitioned by the partition keys, based on configuration parameters that you choose.
Crawl Delta Lake tables using AWS Glue Crawler
In this tutorial, let’s go through how to crawl delta tables using AWS Glue Crawler.
Prerequisites
Complete the following prerequisite steps for this tutorial:
- Install and configure the AWS Command Line Interface (AWS CLI).
- Create your S3 bucket if you don’t have one.
- Create an AWS Identity and Access Management (IAM) role for your AWS Glue crawler if you don’t have one. For instructions, refer to Create an IAM role for AWS Glue.
- Run the following command to copy the sample Delta table into your S3 bucket (replace
your_s3_bucket
with your S3 bucket name):
Create a Delta Lake crawler
You can create a Delta Lake crawler via the AWS Glue console, the AWS Glue SDK, or the AWS CLI. In the SDK, specify a DeltaTarget
with the following configurations:
- DeltaTables – A list of Amazon S3
DeltaPath
values where the Delta tables are located. (Note that each path must be the parent of a_delta_log
folder). - WriteManifest – A Boolean value indicating whether or not the crawler should write the manifest files for each
DeltaPath
. - ConnectionName – An optional connection name stored in the Data Catalog that the crawler should use to access Delta tables backed by a VPC.
To create your crawler on the AWS Glue console, complete the following steps:
- On the AWS Glue console, choose Crawlers in the navigation pane.
- Choose Create crawler.
- For Name, enter
delta-lake-crawler
, and choose Next. - For Data source configuration, chooseNot yet.
- For Data source, choose Add a data source.
- For Data source, select Delta Lake.
- For Include delta lake table paths, enter
s3://your_s3_bucket/data/sample_delta_table/
. - Select Enable write manifest, then choose Add a Delta Lake data source. Choose Next.
- For IAM role, under Existing IAM role, choose your IAM role, then choose Next.
- For Target database, choose Add database, then Create a database page is shown up.
For Name, enterdelta_lake
, then choose Create database. Then come back to the previous page. For Target database, click the reload button, and selectdelta_lake
database. - For Frequency under Crawler schedule, choose On demand, then choose Next.
- Review your configuration, and choose Create crawler. You can trigger the crawler to run manually via the AWS Glue console, or through the SDK or AWS CLI using the
StartCrawl
API. You could also schedule a trigger via the AWS Glue console. For this post, we run the crawler via the AWS Glue console. - Select
delta-lake-crawler
, and choose Run. - Wait for the crawler to complete.
After the crawler runs, it writes a single manifest table in the Data Catalog for each DeltaPath
under its configuration that has a valid Delta table. The manifest table uses the format SymlinkTextInputFormat
and the manifest location s3://your_s3_bucket/data/sample_delta_table/_symlink_format_manifest/
.
You can see the Delta table definition on the AWS Glue console. The table location points to the preceding manifest location.
The table definition also includes an AdditionalLocations
field, which is an array that points to the location s3://your_s3_bucket/data/sample_delta_table/
. You can access this additional field through the following AWS CLI command:
Query Delta tables
After you create the manifest table, AWS query engines such as Athena and Redshift Spectrum are able to query against the files by reading the manifest file locations to filter which data files to query in the Delta table.
Query from Athena
Athena users need to point their catalog to the AWS Glue Data Catalog. Open the Athena console in the same Region as where your table is registered in the Data Catalog, and confirm that the data source is set to AwsDataCatalog
.
Now you’re ready to run queries on Athena. To access your Delta table, run the following query:
The following screenshot shows our output.
Query from Redshift Spectrum
Redshift Spectrum requires an external schema pointing to the database in which the Delta table was created.
To query with Redshift Spectrum, complete the following steps:
- Create an IAM role for an Amazon Redshift cluster with the following configuration:
- For permissions, use
arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess
, or your custom policy for reading your S3 bucket. - Use the following trust relationship:
To lean more, visit IAM policies for Amazon Redshift Spectrum.
- For permissions, use
- Launch a new Redshift cluster with the following configuration:
- Choose dc2.large, 1 node.
- Configure the IAM role you created in step 1.
- Connect with the Amazon Redshift query editor v2.For instructions, see Querying a database using the query editor v2.
- Create an external schema for the
delta_lake
database to use in Redshift Spectrum (replace <your IAM role ARN> with your IAM role ARN): - Run the following SQL against
spectrum.sample_delta_table
:
The following screenshot shows our output.
Limitations of Delta crawlers and manifest tables
When the data or schema in a Delta table is updated, the manifest tables in the AWS Glue Data Catalog may become out of sync. It means that you can still query the manifest table and get the consistent result, but the result of the table is at the previous point of time. To get the latest result, you must update the manifest tables using the crawler or manually update the manifest table through the AWS Glue SDK or AWS CLI. When you want to keep the manifest table up-to-date, you can run Delta Lake crawlers on a schedule (for example, once an hour).
When the Delta table data is updated, the manifest files under the _symlink_manifest
folder of your Delta tables may also become out of sync, in which case you need to rerun a crawler with writeManifest
enabled.
Use of Delta tables in EMR and Glue Spark
The delta lake crawler is designed for use in Athena, Redshift Spectrum, and other engines compatible with parquet-based manifest tables. For EMR Spark or Glue Spark jobs, you do not need to create a manifest table by running the delta lake crawler, instead, you can read from and write to delta table directly using Delta Lake library. You can follow this blog series to understand how to process Delta tables on Glue Spark jobs.
Secure Delta tables using Lake Formation permissions
Manifest tables created by the Delta Lake crawler support Lake Formation access control, including cell-level security. It allows Data Lake administrators to filter specific rows and columns for certain users of their manifest tables. Through the use of CreateDataCellFilter
and GrantPermissions
APIs, you can grant row and column filters to the Delta manifest table. You can query the Delta manifest table from Athena and Redshift Spectrum with the use of these filters configured on the Delta manifest tables.
To learn more about Lake Formation cell-level security, refer to the following blog posts:
- Effective data lakes using AWS Lake Formation, Part 4: Implementing cell-level and row-level security
- Effective data lakes using AWS Lake Formation, Part 5: Securing data lakes with row-level access control
Clean up
Now to the final step, cleaning up the resources:
- Delete the Amazon Redshift cluster.
- Delete your data under your S3 path:
s3://your_s3_bucket/data/sample_delta_table/
. - Delete the AWS Glue crawler
delta-lake-crawler
. - Delete the AWS Glue database
delta_lake
.
Conclusion
This post demonstrated how to crawl Delta tables using an AWS Glue crawler, and how to query against the crawled tables from Athena and Redshift Spectrum. With AWS Glue crawlers, the manifest files are automatically created, so you can easily integrate Delta tables with Athena and Redshift Spectrum without manual effort in maintaining manifest files. It also enables you to manage cell-level security on the Delta tables using Lake Formation permissions.
Let’s start using Glue crawlers for your own Delta tables. If you have comments or feedback, please feel free to leave them in the comments.
About the authors
Kyle Duong is a Software Development Engineer on the AWS Glue and AWS Lake Formation team. He is passionate about building big data technologies and distributed systems. In his free time, he enjoys cycling or playing basketball.
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. This summer, he enjoyed goldfish scooping with his children.