AWS Big Data Blog
Introducing Apache Hudi support with AWS Glue crawlers
Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance. Data engineers use Apache Hudi for streaming workloads as well as to create efficient incremental data pipelines. Hudi provides tables, transactions, efficient upserts and deletes, advanced indexes, streaming ingestion services, data clustering and compaction optimizations, and concurrency control, all while keeping your data in open source file formats. Hudi’s advanced performance optimizations make analytical workloads faster with any of the popular query engines including Apache Spark, Presto, Trino, Hive, and so on.
Many AWS customers adopted Apache Hudi on their data lakes built on top of Amazon S3 using AWS Glue, a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. AWS Glue Crawler is a component of AWS Glue, which allows you to create table metadata from data content automatically without requiring manual definition of the metadata.
AWS Glue crawlers now support Apache Hudi tables, simplifying the adoption of AWS Glue Data Catalog as the catalog for Hudi tables. One typical use case is to register Hudi tables, which does not have catalog table definition. Another typical use case is migration from other Hudi catalogs, such as Hive metastore. When migrating from other Hudi Catalogs, you can create and schedule an AWS Glue crawler and provide one or more Amazon S3 paths where the Hudi table files are located. You have the option to provide the maximum depth of Amazon S3 paths that the AWS Glue crawler can traverse. With each run, AWS Glue crawlers will extract schema and partition information and update AWS Glue Data Catalog with the schema and partition changes. AWS Glue crawlers updates the latest metadata file location in the AWS Glue Data Catalog that AWS analytical engines can directly use.
With this launch, you can create and schedule an AWS Glue crawler to register Hudi tables in AWS Glue Data Catalog. You can then provide one or multiple Amazon S3 paths where the Hudi tables are located. You have the option to provide the maximum depth of Amazon S3 paths that crawlers can traverse. With each crawler run, the crawler inspects each of the S3 paths and catalogs the schema information, such as new tables, deletes, and updates to schemas in the AWS Glue Data Catalog. Crawlers inspect partition information and add newly added partitions to AWS Glue Data Catalog. Crawlers also update the latest metadata file location in the AWS Glue Data Catalog that AWS analytical engines can directly use.
This post demonstrates how this new capability to crawl Hudi tables works.
How AWS Glue crawler works with Hudi tables
Hudi tables have two categories, with specific implications for each:
- Copy on write (CoW) – Data is stored in a columnar format (Parquet), and each update creates a new version of files during a write.
- Merge on read (MoR) – Data is stored using a combination of columnar (Parquet) and row-based (Avro) formats. Updates are logged to row-based
delta
files and are compacted as needed to create new versions of the columnar files.
With CoW datasets, each time there is an update to a record, the file that contains the record is rewritten with the updated values. With a MoR dataset, each time there is an update, Hudi writes only the row for the changed record. MoR is better suited for write- or change-heavy workloads with fewer reads. CoW is better suited for read-heavy workloads on data that change less frequently.
Hudi provides three query types for accessing the data:
- Snapshot queries – Queries that see the latest snapshot of the table as of a given commit or compaction action. For MoR tables, snapshot queries expose the most recent state of the table by merging the base and delta files of the latest file slice at the time of the query.
- Incremental queries – Queries only see new data written to the table, since a given commit or compaction. This effectively provides change streams to enable incremental data pipelines.
- Read optimized queries – For MoR tables, queries see the latest data compacted. For CoW tables, queries see the latest data committed.
For copy-on-write tables, crawlers create a single table in the AWS Glue Data Catalog with the ReadOptimized Serde org.apache.hudi.hadoop.HoodieParquetInputFormat
.
For merge-on-read tables, crawlers create two tables in AWS Glue Data Catalog for the same table location:
- A table with suffix
_ro
, which uses the ReadOptimized Serdeorg.apache.hudi.hadoop.HoodieParquetInputFormat
- A table with suffix
_rt
, which uses the RealTime Serde allowing for Snapshot queries:org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat
During each crawl, for each Hudi path provided, crawlers make an Amazon S3 list API call, filter based on the .hoodie
folders, and find the most recent metadata file under that Hudi table metadata folder.
Crawl a Hudi CoW table using AWS Glue crawler
In this section, let’s go through how to crawl a Hudi CoW using AWS Glue crawlers.
Prerequisites
Here are the prerequisites for this tutorial:
- Install and configure AWS Command Line Interface (AWS CLI).
- Create your S3 bucket if you do not have it.
- Create your IAM role for AWS Glue if you do not have it. You need
s3:GetObject
fors3://your_s3_bucket/data/sample_hudi_cow_table/
. - Run the following command to copy the sample Hudi table into your S3 bucket. (Replace
your_s3_bucket
with your S3 bucket name.)
This instruction guides you to copy sample data, but you can create any Hudi tables easily using AWS Glue. Learn more in Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor.
Create a Hudi crawler
In this instruction, create the crawler through the console. Complete the following steps to create a Hudi crawler:
- On the AWS Glue console, choose Crawlers.
- Choose Create crawler.
- For Name, enter
hudi_cow_crawler
. Choose Next. - Under Data source configuration, choose Add data source.
- For Data source, choose Hudi.
- For Include hudi table paths, enter
s3://your_s3_bucket/data/sample_hudi_cow_table/
. (Replaceyour_s3_bucket
with your S3 bucket name.) - Choose Add Hudi data source.
- Choose Next.
- For Existing IAM role, choose your IAM role, then choose Next.
- For Target database, choose Add database, then the Add database dialog appears. For Database name, enter
hudi_crawler_blog
, then choose Create. Choose Next. - Choose Create crawler.
Now a new Hudi crawler has been successfully created. The crawler can be triggered to run through the console or through the SDK or AWS CLI using the StartCrawl
API. It could also be scheduled through the console to trigger the crawlers at specific times. In this instruction, run the crawler through the console.
- Choose Run crawler.
- Wait for the crawler to complete.
After the crawler has run, you can see the Hudi table definition in the AWS Glue console:
You have successfully crawled the Hudi CoR table with data on Amazon S3 and created an AWS Glue Data Catalog table with the schema populated. After you create the table definition on AWS Glue Data Catalog, AWS analytics services such as Amazon Athena are able to query the Hudi table.
Complete the following steps to start queries on Athena:
- Open the Amazon Athena console.
- Run the following query.
The following screenshot shows our output:
Crawl a Hudi MoR table using AWS Glue crawler with AWS Lake Formation data permissions
In this section, let’s go through how to crawl a Hudi MoR table using AWS Glue. This time, you use AWS Lake Formation data permission for crawling Amazon S3 data sources instead of IAM and Amazon S3 permission. This is optional, but it simplifies permission configurations when your data lake is managed by AWS Lake Formation permissions.
Prerequisites
Here are the prerequisites for this tutorial:
- Install and configure AWS Command Line Interface (AWS CLI).
- Create your S3 bucket if you do not have it.
- Create your IAM role for AWS Glue if you do not have it. You need
lakeformation:GetDataAccess
. But you do not needs3:GetObject
fors3://your_s3_bucket/data/sample_hudi_mor_table/
because we use Lake Formation data permission to access the files. - Run the following command to copy the sample Hudi table into your S3 bucket. (Replace
your_s3_bucket
with your S3 bucket name.)
In addition to the processing steps, complete the following steps to update the AWS Glue Data Catalog settings to use Lake Formation permissions to control catalog resources instead of IAM-based access control:
- Sign in to the Lake Formation console as a data lake administrator.
- If this is the first time accessing the Lake Formation console, add yourself as the data lake administrator.
- Under Administration, choose Data catalog settings.
- For Default permissions for newly created databases and tables, deselect Use only IAM access control for new databases and Use only IAM access control for new tables in new databases.
- For Cross account version setting, choose Version 3.
- Choose Save.
The next step is to register your S3 bucket in Lake Formation data lake locations:
- On the Lake Formation console, choose Data lake locations, and choose Register location.
- For Amazon S3 path, enter
s3://your_s3_bucket/
. (Replaceyour_s3_bucket
with your S3 bucket name.) - Choose Register location.
Then, grant Glue crawler role access to data location so that the crawler can use Lake Formation permission to access the data and create tables in the location:
- On the Lake Formation console, choose Data locations and choose Grant.
- For IAM users and roles, select the IAM role you used for the crawler.
- For Storage location, enter
s3://your_s3_bucket/data
/. (Replaceyour_s3_bucket
with your S3 bucket name.) - Choose Grant.
Then, grant crawler role to create tables under the database hudi_crawler_blog
:
- On the Lake Formation console, choose Data lake permissions.
- Choose Grant.
- For Principals, choose IAM users and roles, and choose the crawler role.
- For LF tags or catalog resources, choose Named data catalog resources.
- For Database, choose the database
hudi_crawler_blog
. - Under Database permissions, select Create table.
- Choose Grant.
Create a Hudi crawler with Lake Formation data permissions
Complete the following steps to create a Hudi crawler:
- On the AWS Glue console, choose Crawlers.
- Choose Create crawler.
- For Name, enter
hudi_mor_crawler
. Choose Next. - Under Data source configuration, choose Add data source.
- For Data source, choose Hudi.
- For Include hudi table paths, enter
s3://your_s3_bucket/data/sample_hudi_mor_table
/. (Replaceyour_s3_bucket
with your S3 bucket name.) - Choose Add Hudi data source.
- Choose Next.
- For Existing IAM role, choose your IAM role.
- Under Lake Formation configuration – optional, select Use Lake Formation credentials for crawling S3 data source.
- Choose Next.
- For Target database, choose
hudi_crawler_blog
. Choose Next. - Choose Create crawler.
Now a new Hudi crawler has been successfully created. The crawler uses Lake Formation credentials for crawling Amazon S3 files. Let’s run the new crawler:
- Choose Run crawler.
- Wait for the crawler to complete.
After the crawler has run, you can see two tables of the Hudi table definition in the AWS Glue console:
sample_hudi_mor_table_ro
(read optimized table)
sample_hudi_mor_table_rt
(real time table)
You registered the data lake bucket with Lake Formation and enabled crawling access to the data lake using Lake Formation permissions. You have successfully crawled the Hudi MoR table with data on Amazon S3 and created an AWS Glue Data Catalog table with the schema populated. After you create the table definitions on AWS Glue Data Catalog, AWS analytics services such as Amazon Athena are able to query the Hudi table.
Complete the following steps to start queries on Athena:
- Open the Amazon Athena console.
- Run the following query.
The following screenshot shows our output:
- Run the following query.
The following screenshot shows our output:
Fine-grained access control using AWS Lake Formation permissions
To apply fine-grained access control on the Hudi table, you can benefit from AWS Lake Formation permissions. Lake Formation permissions allow you to restrict access to specific tables, columns, or rows and then query the Hudi tables through Amazon Athena with fine-grained access control. Let’s configure Lake Formation permission for the Hudi MoR table.
Prerequisites
Here are the prerequisites for this tutorial:
- Complete the previous section Crawl a Hudi MoR table using AWS Glue crawler with AWS Lake Formation data permissions.
- Create an IAM user DataAnalyst, who has AWS managed policy AmazonAthenaFullAccess.
Create a Lake Formation data cell filter
Let’s first set up a filter for the MoR read optimized table.
- Sign in to the Lake Formation console as a data lake administrator.
- Choose Data filters.
- Choose Create new filter.
- For Data filter name, enter
exclude_product_price
. - For Target database, choose the database
hudi_crawler_blog
. - For Target table, choose the table
sample_hudi_mor_table_ro
. - For Column-level access, select Exclude columns, and choose the column price.
- For Row filter expression, enter
true
. - Choose Create filter.
Grant Lake Formation permissions to the DataAnalyst user
Complete the following steps to grant Lake Formation permission to the DataAnalyst
user
- On the Lake Formation console, choose Data lake permissions.
- Choose Grant.
- For Principals, choose IAM users and roles, and choose the user
DataAnalyst
. - For LF tags or catalog resources, choose Named data catalog resources.
- For Database, choose the database
hudi_crawler_blog
. - For Table – optional, choose the table
sample_hudi_mor_table_ro
. - For Data filters – optional, select
exclude_product_price
. - For Data filter permissions, select Select.
- Choose Grant.
You granted Lake Formation permission on the database hudi_crawler_blog
and the table sample_hudi_mor_table_ro
, excluding the column price
to the DataAnalyst user. Now let’s validate user access to the data using Athena.
- Sign in to the Athena console as a DataAnalyst user.
- On the query editor, run the following query:
The following screenshot shows our output:
Now you validated that the column price
is not shown, but the other columns product_id
, product_name
, update_at
, and category
are shown.
Clean up
To avoid unwanted charges to your AWS account, delete the following AWS resources:
- Delete AWS Glue database
hudi_crawler_blog
. - Delete AWS Glue crawlers
hudi_cow_crawler
andhudi_mor_crawler
. - Delete Amazon S3 files under
s3://your_s3_bucket/data/sample_hudi_cow_table/
ands3://your_s3_bucket/data/sample_hudi_mor_table/
.
Conclusion
This post demonstrated how AWS Glue crawlers work for Hudi tables. With the support for Hudi crawler, you can quickly move to using AWS Glue Data Catalog as your primary Hudi table catalog. You can start building your serverless transactional data lake using Hudi on AWS using AWS Glue, AWS Glue Data Catalog, and Lake Formation fine-grained access controls for tables and formats supported by AWS analytical engines.
About the authors
Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He works based in Tokyo, Japan. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his road bike.
Kyle Duong is a Software Development Engineer on the AWS Glue and Lake Formation team. He is passionate about building big data technologies and distributed systems.
Sandeep Adwankar is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.