Posted On: Jul 21, 2023

AWS Glue Crawlers now supports Apache Hudi tables, allowing customers to query data in Apache Hudi tables directly from AWS analytics services like Amazon Athena. Apache Hudi is an open-source table format that brings database and data warehouse capabilities to the data lake. Apache Hudi helps data engineers manage continuously evolving data sets while maintaining query performance. 

To query data from Apache Hudi tables, previously Amazon Athena users had to manually create a table within the Glue Data Catalog and update partition changes to ensure the query results were current. With today’s launch, users can automatically register Apache Hudi tables into the Glue Catalog by running the Glue Crawler. Glue Crawler supports partitioned and non-partitioned Copy on write (CoW) and Merge on read (MoR) Hudi tables. Users can then query Glue Catalog Hudi tables across various analytics services and apply Lake Formation fine-grained permissions. With Glue Crawlers, users can also migrate data from other Hudi Catalogs to the Glue Catalog. 

To get started, users will need to create, run, or schedule a Glue Crawler, and provide one or more Amazon S3 paths to Hudi tables. With each run, Glue Crawler will extract schema, partition information, and update the Glue Catalog with the schema, partition changes and the latest Hudi metadata file location.

AWS Glue Crawler’s support for Hudi tables is available in all commercial regions where AWS Glue is available; see the AWS Region Table. To learn more, visit the AWS Glue Crawler documentation.