Posted On: May 9, 2023

You can now use Amazon Athena to query tables created with Apache Hudi 0.12.2 which includes support for improved scalability of queries accessing data sets in Amazon S3 data lake. The updated integration enables you to use Athena to query Hudi 0.12.2 tables managed via Amazon EMR, Apache Spark, Apache Hive or other compatible services.

Apache Hudi is an open-source data management framework used to simplify incremental data processing in S3 data lakes. Hudi provides record-level data processing that can help you simplify development of Change Data Capture (CDC) pipelines, comply with GDPR-driven updates and deletions, and better manage streaming data from sensors or devices that require data insertion and event updates. The 0.12.2 release includes support for Metadata Tables which are designed to eliminate the requirement for the “list files” operation in order to better support efficient scaling over larger data sets. The Metadata Table will instead proactively maintain the list of files and remove the need for recursive file listing operations to avoid running into request limits in the case of storage systems like Amazon S3.

Apache Hudi 0.12.2 support is available in Athena engine version 3 and is available in supported regions. To learn more about the new Apache Hudi 0.12.2 support in Athena, see Querying Hudi data sets in the Athena user documentation.