Apache Hudi on Amazon EMR
Why Apache Hudi on EMR?
Learn more
Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality. Hudi enables you to manage data at the record-level in Amazon S3 data lakes to simplify Change Data Capture (CDC) and streaming data ingestion and helps to handle data privacy use cases requiring record level updates and deletes. Data sets managed by Hudi are stored in S3 using open storage formats, while integrations with Presto, Apache Hive, Apache Spark, and AWS Glue Data Catalog give you near real-time access to updated data using familiar tools.
Hudi is supported in Amazon EMR and is automatically installed when you choose Spark, Hive, or Presto when deploying your EMR cluster. Using Hudi, you can handle either read-heavy or write-heavy use cases, and Hudi will manage the underlying data stored on S3 using Apache Parquet and Apache Avro. Data sets managed by Hudi are accessible not only from Spark (and PySpark) but also other engines such as Hive and Presto. Native integration with AWS Database Migration Service also provides another source for data as it changes.
Use cases
Record-level insert, update, and delete for privacy regulations and simplified pipelines
Simplified file management and near real-time data access
Streaming IoT and ingestion pipelines need to handle data insertion and update events without creating many small files that can cause performance issues for analytics. Data engineers need tools that enable them to use upserts to efficiently handle streaming data ingestion, automate and optimize storage, and enable analysts to query new data immediately. Previously, you had to build custom solutions that monitor and re-write many small files into fewer large files, and manage orchestration and monitoring. Apache Hudi will automatically track changes and merge files so they remain optimally sized.
A common use case is making data from Enterprise Data Warehouses (EDW) and Operational Data Stores (ODS) available for SQL query engines like Apache Hive and Presto for processing and analytics. With Hudi, individual changes are can be processed much more granularly, reducing overhead. You can query S3 data sets directly to view and provide your users with a near real-time view of your data.