We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.
If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”
Customize cookie preferences
We use cookies and similar tools (collectively, "cookies") for the following purposes.
Essential
Essential cookies are necessary to provide our site and services and cannot be deactivated. They are usually set in response to your actions on the site, such as setting your privacy preferences, signing in, or filling in forms.
Performance
Performance cookies provide anonymous statistics about how customers navigate our site so we can improve site experience and performance. Approved third parties may perform analytics on our behalf, but they cannot use the data for their own purposes.
Allowed
Functional
Functional cookies help us provide useful site features, remember your preferences, and display relevant content. Approved third parties may set these cookies to provide certain site features. If you do not allow these cookies, then some or all of these services may not function properly.
Allowed
Advertising
Advertising cookies may be set through our site by us or our advertising partners and help us deliver relevant marketing content. If you do not allow these cookies, you will experience less relevant advertising.
Allowed
Blocking some types of cookies may impact your experience of our sites. You may review and change your choices at any time by selecting Cookie preferences in the footer of this site. We and selected third-parties use cookies or similar technologies as specified in the AWS Cookie Notice.
Your privacy choices
We display ads relevant to your interests on AWS sites and on other properties, including cross-context behavioral advertising. Cross-context behavioral advertising uses data from one site or app to advertise to you on a different company’s site or app.
To not allow AWS cross-context behavioral advertising based on cookies or similar technologies, select “Don't allow” and “Save privacy choices” below, or visit an AWS site with a legally-recognized decline signal enabled, such as the Global Privacy Control. If you delete your cookies or visit this site from a different browser or device, you will need to make your selection again. For more information about cookies and how we use them, please read our AWS Cookie Notice.
I contenuti di questa pagina non sono al momento disponibili nella lingua selezionata. Il nostro impegno è tuttavia di fornire più contenuti localizzati possibili. Grazie per la pazienza.
Apache Hudi is an open-source data management framework used to simplify incremental data processing and data pipeline development. This framework more efficiently manages business requirements like data lifecycle and improves data quality. Hudi enables you to manage data at the record-level in Amazon S3 data lakes to simplify Change Data Capture (CDC) and streaming data ingestion and helps to handle data privacy use cases requiring record level updates and deletes. Data sets managed by Hudi are stored in S3 using open storage formats, while integrations with Presto, Apache Hive, Apache Spark, and AWS Glue Data Catalog give you near real-time access to updated data using familiar tools.
Hudi is supported in Amazon EMR and is automatically installed when you choose Spark, Hive, or Presto when deploying your EMR cluster. Using Hudi, you can handle either read-heavy or write-heavy use cases, and Hudi will manage the underlying data stored on S3 using Apache Parquet and Apache Avro. Data sets managed by Hudi are accessible not only from Spark (and PySpark) but also other engines such as Hive and Presto. Native integration with AWS Database Migration Service also provides another source for data as it changes.
Use cases
Record-level insert, update, and delete for privacy regulations and simplified pipelines
Due to recent privacy regulations like GDPR and CCPA, companies across many industries need to perform record-level updates and deletions for people's "right to be forgotten" or changes to consent as to how their data can be used. Previously, you had to create custom data management and ingestion solutions to track individual changes and rewrite large data sets for just a few changes. With Apache Hudi on EMR, you can use familiar insert, update, upsert, and delete operations and Hudi will track transactions and make granular changes on S3 which simplifies your data pipelines.
Simplified file management and near real-time data access
Streaming IoT and ingestion pipelines need to handle data insertion and update events without creating many small files that can cause performance issues for analytics. Data engineers need tools that enable them to use upserts to efficiently handle streaming data ingestion, automate and optimize storage, and enable analysts to query new data immediately. Previously, you had to build custom solutions that monitor and re-write many small files into fewer large files, and manage orchestration and monitoring. Apache Hudi will automatically track changes and merge files so they remain optimally sized.
A common use case is making data from Enterprise Data Warehouses (EDW) and Operational Data Stores (ODS) available for SQL query engines like Apache Hive and Presto for processing and analytics. With Hudi, individual changes are can be processed much more granularly, reducing overhead. You can query S3 data sets directly to view and provide your users with a near real-time view of your data.
Simplified CDC data pipeline development
A common challenge when creating data pipelines is dealing with CDC. Late arriving or incorrect data requires the data to be rewritten for updated records. Finding the right files to update, applying the changes, and then viewing the data is challenging and requires customers to create their own frameworks or conventions. With Hudi, late arriving data can be “upserted” into an existing data set. When changes are made, Hudi will find the appropriate files in S3 and rewrite them to incorporate the changes. Hudi also allows you to view your data set at specific points in time. Each change to a data set is tracked and can be easily rolled back, should you need to “undo” them. Integration with AWS Database Migration Service (DMS) can also simplify loading data.