Support record-level insert, update, and delete on Amazon S3 with Amazon EMR

Posted on: Nov 15, 2019

Amazon EMR release 5.28.0 now supports Apache Hudi (Incubating). Data engineers using Amazon EMR for data pipeline development and data processing can now use Apache Hudi to simplify incremental data management and data privacy use cases requiring record-level insert, updates, and delete operations. Apache Hudi enables Amazon S3-based data lakes to comply with data privacy laws, consume real time streams and change data capture logs, reinstate late arriving data, and track change history and rollback. Apache Hudi is open-source and supports storing data on Amazon S3 in vendor neutral, open source formats such as Apache Parquet and Apache Avro.

Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. Apache Hudi enables you to manage data at the record-level in Amazon S3 to simplify Change Data Capture (CDC) and streaming data ingestion, and provides a framework to handle data privacy use cases requiring record level updates and deletes. Data sets managed by Apache Hudi are stored in S3 using open storage formats, and integrations with Presto, Apache Hive, Apache Spark, and AWS Glue Data Catalog give you near real-time access to updated data using familiar tools.

Apache Hudi is natively supported in Amazon EMR, and is automatically installed when you choose Apache Spark, Hive, or Presto when deploying your EMR cluster. Using Apache Hudi, you can create data sets that are optimized for either read-heavy or write-heavy use cases, and Apache Hudi will manage the underlying data stored on S3 using Apache Parquet and Apache Avro for data storage.

Amazon EMR release 5.28.0 with Apache Hudi is now available in US East (N. Virginia & Ohio), US West (Oregon), South America (São Paulo), Europe (Ireland & Stockholm), AWS GovCloud (US-East & US-West), AWS (Beijing Region) Operated by Sinnet with more regions being added in the upcoming weeks.

You can stay up to date on Amazon EMR releases by subscribing to the feed for EMR release notes. Use the icon at the top of the EMR Release Guide to link the feed URL directly to your favorite feed reader.

To get a closer look at using Apache Hudi with EMR, please attend or re:Invent session, and workshop. 

Additional Links:
AWS News Blog: New - Insert, Update, Delete Data on S3 with Amazon EMR and Apache Hudi