Apache Hudi (Incubating) is an open-source data management framework used to simplify incremental data processing and data pipeline development. Apache Hudi enables you to manage data at the record-level in Amazon S3 to simplify Change Data Capture (CDC) and streaming data ingestion, and provides a framework to handle data privacy use cases requiring record level updates and deletes. Data sets managed by Apache Hudi are stored in S3 using open storage formats, and integrations with Presto, Apache Hive, Apache Spark, and AWS Glue Data Catalog give you near real-time access to updated data using familiar tools.
Apache Hudi is natively supported in Amazon EMR, and is automatically installed when you choose Spark, Hive, or Presto when deploying your EMR cluster. Using Apache Hudi, you can create data sets that are optimized for either read-heavy or write-heavy use cases, and Apache Hudi will manage the underlying data stored on S3 using Apache Parquet and Apache Avro for data storage. Data sets managed by Apache Hudi are accessible from Apache Spark, Apache Hive, and Presto.
Features and benefits
Record-level Insert, Update, and Delete
Due to recent privacy regulations, companies across many industries need to perform record-level updates and deletions when their users choose to exercise their right to be forgotten, or change their consent as to how their data can be used. Previously, you had to create custom data management and ingestion solutions to track individual changes, and manually manage the process of recreating data sets to include those changes often resulting in rewriting large data sets just to incorporate a small amount of changes. With Apache Hudi on Amazon EMR, you can use familiar insert, update, upsert, and delete operations to update data managed by Apache Hudi, and Apache Hudi will handle tracking transactions and changes to individual files on S3. You only have to tell Apache Hudi about what data you want to change, and Apache Hudi will take care of the rest.
Simplified File Management
Streaming IoT and other event-based ingestion pipelines need the ability to handle data insertion and update events without accruing a large number of small files on S3 that can cause performance issues with downstream analytical processing. Data engineers need tools that enable them to use upserts to efficiently handle streaming data ingestion, automate the management of files in S3, optimize data storage to produce a small number of larger files, and give their analysts the ability to query new data immediately. Previously, you had to build custom solutions that monitor and re-write many small files into fewer large files, and manage the orchestration and monitoring of those solutions. With Apache Hudi on Amazon EMR, data files on S3 are managed, allowing you to simply configure an optimal file size to store your data and Apache Hudi will track changes and merge files so they remain optimally sized.
Near Real-Time Data Access
A common use case is making data from Enterprise Data Warehouses (EDW) and Operational Data Stores (ODS) available in Amazon S3 so the data can be used with SQL query engines like Apache Hive and Presto for data processing and analytics. Today, you can choose from multiple solutions to track and ingest database change logs, but have to create your own custom solutions to apply those changes and build and refresh materialized data sets for analytics. With Apache Hudi, individual changes are applied and managed with transactional upserts, simplifying the application of new or updated records. With the integration of Apache Hive and Presto, you no longer have to create your own custom solutions to apply changes. You can query S3 data sets directly to view the latest data in near real time, and you can use Apache Hudi to manage changes and provide your users with a near real-time view of your data.
Simplified Data Pipeline Development
A common challenge when creating data pipelines is dealing with late arriving or incorrect data. Late arriving or incorrect data requires the data to be restated, and existing data sets updated to incorporate new, or updated records. Finding the right files to update, applying the changes, and then viewing the data before and after the updates is challenging and requires customers to create their own frameworks or conventions. With Apache Hudi, late arriving data can be “upserted” into an existing data set, relying on the framework to insert or update records based on their presence in the data set. When changes are made, Apache Hudi will manage the process of finding the appropriate files in S3, and rewrite them to incorporate the changed records. Apache Hudi also allows you to view your data set at specific points in time, helping you understand what difference the application of the late arriving data made on calculations and analytics. Each change to a data set is tracked as a commit, and can be easily rolled back, allowing you to find specific changes to a data set and “undo” them.