AWS Big Data Blog

Choosing an open table format for your transactional data lake on AWS

August 2023: This post was updated to include Apache Iceberg support in Amazon Redshift.

Disclaimer: Due to rapid advancements in AWS service support for open table formats, recent developments might not yet be reflected in this post. For the latest information on AWS service support for open table formats, refer to the official AWS service documentation.

A modern data architecture enables companies to ingest virtually any type of data through automated pipelines into a data lake, which provides highly durable and cost-effective object storage at petabyte or exabyte scale. This data is then projected into analytics services such as data warehouses, search systems, stream processors, query editors, notebooks, and machine learning (ML) models through direct access, real-time, and batch workflows. Data in customers’ data lakes is used to fulfil a multitude of use cases, from real-time fraud detection for financial services companies, inventory and real-time marketing campaigns for retailers, or flight and hotel room availability for the hospitality industry. Across all use cases, permissions, data governance, and data protection are table stakes, and customers require a high level of control over data security, encryption, and lifecycle management.

This post shows how open-source transactional table formats (or open table formats) can help you solve advanced use cases around performance, cost, governance, and privacy in your data lakes. We also provide insights into the features and capabilities of the most common open table formats available to support various use cases.

You can use this post for guidance when looking to select an open table format for your data lake workloads, facilitating the decision-making process and potentially narrowing down the available options. The content of this post is based on the latest open-source releases of the reviewed formats at the time of writing: Apache Hudi v0.13.0, Apache Iceberg 1.2.0, and Delta Lake 2.3.0.

Advanced use cases in modern data lakes

Data lakes offer one of the best options for cost, scalability, and flexibility to store data, allowing you to retain large volumes of structured and unstructured data at a low cost, and to use this data for different types of analytics workloads—from business intelligence reporting to big data processing, real-time analytics, and ML—to help guide better decisions.

Despite these capabilities, data lakes are not databases, and object storage does not provide support for ACID processing semantics, which you may require to effectively optimize and manage your data at scale across hundreds or thousands of users using a multitude of different technologies. For example:

  • Performing efficient record-level updates and deletes as data changes in your business
  • Managing query performance as tables grow to millions of files and hundreds of thousands of partitions
  • Ensuring data consistency across multiple concurrent writers and readers
  • Preventing data corruption from write operations failing partway through
  • Evolving table schemas over time without (partially) rewriting datasets

These challenges have become particularly prevalent in use cases such as CDC (change data capture) from relational database sources, privacy regulations requiring deletion of data, and streaming data ingestion, which can result in many small files. Typical data lake file formats such as CSV, JSON, Parquet, or Orc only allow for writes of entire files, making the aforementioned requirements hard to implement, time consuming, and costly.

To help overcome these challenges, open table formats provide additional database-like functionality that simplifies the optimization and management overhead of data lakes, while still supporting storage on cost-effective systems like Amazon Simple Storage Service (Amazon S3). These features include:

  • ACID transactions – Allowing a write to completely succeed or be rolled back in its entirety
  • Record-level operations – Allowing for single rows to be inserted, updated, or deleted
  • Indexes – Improving performance in addition to data lake techniques like partitioning
  • Concurrency control – Allowing for multiple processes to read and write the same data at the same time
  • Schema evolution – Allowing for columns of a table to be added or modified over the life of a table
  • Time travel – Enabling you to query data as of a point in time in the past

In general, open table formats implement these features by storing multiple versions of a single record across many underlying files, and use a tracking and indexing mechanism that allows an analytics engine to see or modify the correct version of the records they are accessing. When records are updated or deleted, the changed information is stored in new files, and the files for a given record are retrieved during an operation, which is then reconciled by the open table format software. This is a powerful architecture that is used in many transactional systems, but in data lakes, this can have some side effects that have to be addressed to help you align with performance and compliance requirements. For instance, when data is deleted from an open table format, in some cases only a delete marker is stored, with the original data retained until a compaction or vacuum operation is performed, which performs a hard deletion. For updates, previous versions of the old values of a record may be retained until a similar process is run. This can mean that data that should be deleted isn’t, or that you store a significantly larger number of files than you intend to, increasing storage cost and slowing down read performance. Regular compaction and vacuuming must be run, either as part of the way the open table format works, or separately as a maintenance procedure.

The three most common and prevalent open table formats are Apache Hudi, Apache Iceberg, and Delta Lake. AWS supports all three of these open table formats, and in this post, we review the features and capabilities of each, how they can be used to implement the most common transactional data lake use cases, and which features and capabilities are available in AWS’s analytics services. Innovation around these table formats is happening at an extremely rapid pace, and there are likely features available in these file formats that aren’t covered here. All due care has been taken to provide the correct information as of time of writing, but we also expect this information to change quickly, and we’ll update this post frequently to contain the most accurate information. Also, this post focuses only on the open-source versions of the covered table formats and it doesn’t cover experimental or preview features. Extensions or proprietary features available from individual third-party vendors are also out of the scope.

How to use this post

We encourage you to use the high-level guidance in this post with the mapping of functional fit and supported integrations for your use cases. Combine both aspects to identify what table format is likely a good fit for a specific use case, and then prioritize your proof of concept efforts accordingly. Most organizations have a variety of workloads that can benefit from an open table format, but today no single table format is a “one size fits all.” You may wish to select a specific open table format on a case-by-case basis to get the best performance and features for your requirements, or you may wish to standardize on a single format and understand the trade-offs that you may encounter as your use cases evolve.

This post doesn’t promote a single table format for any given use case. The functional evaluations are only intended to help speed up your decision-making process by highlighting key features and attention points for each table format with each use case. It is crucial that you perform testing to ensure that a table format meets your specific use case requirements.

This post is not intended to provide detailed technical guidance (e.g. best practices) or benchmarking of each of the specific file formats, which are available in AWS Technical Guides and benchmarks from the open-source community respectively.

Choosing an open table format

When choosing an open table format for your data lake, we believe that there are two critical aspects that should be evaluated:

  • Functional fit – Does the table format offer the features required to efficiently implement your use case with the required performance? Although they all offer common features, each table format has a different underlying technical design and may support unique features. Each format can handle a range of use cases, but they also offer specific advantages or trade-offs, and may be more efficient in certain scenarios as a result of its design.
  • Supported integrations – Does­ the table format integrate seamlessly with your data environment? When evaluating a table format, it’s important to consider supported engine integrations on dimensions such as support for reads/writes, data catalog integration, supported access control tools, and so on that you have in your organization. This applies to both integration with AWS services and with third-party tools.

General features and considerations

The following table summarizes general features and considerations for each file format that you may want to take into account, regardless of your use case. In addition to this, it is also important to take into account other aspects such as the complexity of the table format and in-house skills.

. Apache Hudi Apache Iceberg Delta Lake
Primary API
  • Spark DataFrame
  • SQL
  • Spark DataFrame & SQL
Write modes
  • Copy On Write approach only
Supported data file formats
  • Parquet
  • ORC
  • HFile
  • Parquet
  • ORC
  • Avro
  • Parquet
File layout management
  • Compaction to reorganize data (sort) and merge small files together
Query optimization
S3 optimizations
  • Metadata reduces file listing operations
Table maintenance
  • Automatic within writer
  • Separate processes
  • Separate processes
  • Separate processes
Time travel
Schema evolution
Operations
  • Hudi CLI for table management, troubleshooting, and table inspection
  • No out-of-the-box options
Monitoring
  • No out-of-the-box options that are integrated with AWS services
  • No out-of-the-box options that are integrated with AWS services
Data Encryption
  • Server-side encryption on Amazon S3 supported
  • Server-side encryption on Amazon S3 supported
Configuration Options
  • High configurability:

Extensive configuration options for customizing read/write behavior (such as index type or merge logic) and automatically performed maintenance and optimizations (such as file sizing, compaction, and cleaning)

  • Medium configurability:

Configuration options for basic read/write behavior (Merge On Read or Copy On Write operation modes)

  • Low configurability:

Limited configuration options for table properties (for example, indexed columns)

Other
  • Savepoints allow you to restore tables to a previous version without having to retain the entire history of files
  • Iceberg supports S3 Access Points in Spark, allowing you to implement failover across AWS Regions using a combination of S3 access points, S3 cross-Region replication, and the Iceberg Register Table API
  • Shallow clones allow you to efficiently run tests or experiments on Delta tables in production, without creating copies of the dataset or affecting the original table.
AWS Analytics Services Support*
Amazon EMR Read and write Read and write Read and write
AWS Glue for Apache Spark Read and write Read and write Read and write
Amazon Athena (SQL) Read Read and write Read
Amazon Athena (Spark) Read and write Read and write Read and write
Amazon Redshift (Spectrum) Read Read Read
AWS Glue Data Catalog Yes Yes Yes

Spark API with most extensive feature support for the table format
* For table format support in third-party tools, consult the official documentation for the respective tool.
Amazon Redshift only supports Delta Symlink tables (see Creating external tables for data managed in Delta Lake for more information).
Refer to Working with other AWS services in the Lake Formation documentation for an overview of table format support when using Lake Formation with other AWS services.

Functional fit for common use cases

Now let’s dive deep into specific use cases to understand the capabilities of each open table format.

Getting data into your data lake

In this section, we discuss the capabilities of each open table format for streaming ingestion, batch load and change data capture (CDC) use cases.

Streaming ingestion

Streaming ingestion allows you to write changes from a queue, topic, or stream into your data lake. Although your specific requirements may vary based on the type of use case, streaming data ingestion typically requires the following features:

  • Low-latency writes – Supporting record-level inserts, updates, and deletes, for example to support late-arriving data
  • File size management – Enabling you to create files that are sized for optimal read performance (rather than creating one or more files per streaming batch, which can result in millions of tiny files)
  • Support for concurrent readers and writers – Including schema changes and table maintenance
  • Automatic table management services – Enabling you to maintain consistent read performance

In this section, we talk about streaming ingestion where records are just inserted into files, and you aren’t trying to update or delete previous records based on changes. A typical example of this is time series data (for example sensor readings), where each event is added as a new record to the dataset. The following table summarizes the features.

. Apache Hudi Apache Iceberg Delta Lake
Functional fit
Considerations Hudi’s default configurations are tailored for upserts, and need to be tuned for append-only streaming workloads. For example, Hudi’s automatic file sizing in the writer minimizes operational effort/complexity required to maintain read performance over time, but can add a performance overhead at write time. If write speed is of critical importance, it can be beneficial to turn off Hudi’s file sizing, write new data files for each batch (or micro-batch), then run clustering later to create better sized files for read performance (using a similar approach as Iceberg or Delta).
  • Iceberg doesn’t optimize file sizes or run automatic table services (for example, compaction or clustering) when writing, so streaming ingestion will create many small data and metadata files. Frequent table maintenance needs to be performed to prevent read performance from degrading over time.
  • Delta doesn’t optimize file sizes or run automatic table services (for example, compaction or clustering) when writing, so streaming ingestion will create many small data and metadata files. Frequent table maintenance needs to be performed to prevent read performance from degrading over time.
Supported AWS integrations
  • Amazon EMR (Spark Structured Streaming (streaming sink and forEachBatch), Flink, Hudi DeltaStreamer)
  • AWS Glue (Spark Structured Streaming, Hudi DeltaStreamer)
  • Amazon Kinesis Data Analytics
  • Amazon Managed Streaming for Apache Kafka (MSK Connect)
Conclusion Good functional fit for all append-only streaming when configuration tuning for append-only workloads is acceptable. Good fit for append-only streaming with larger micro-batch windows, and when operational overhead of table management is acceptable. Good fit for append-only streaming with larger micro-batch windows, and when operational overhead of table management is acceptable.

When streaming data with updates and deletes into a data lake, a key priority is to have fast upserts and deletes by being able to efficiently identify impacted files to be updated.

. Apache Hudi Apache Iceberg Delta Lake
Functional fit
  • Iceberg offers a Merge On Read strategy to enable fast writes.
  • Streaming upserts into Iceberg tables are natively supported with Flink, and Spark can implement streaming ingestion with updates and deletes using a micro-batch approach with MERGE INTO.
  • Using column statistics, Iceberg offers efficient updates on tables that are sorted on a “key” column.
  • Streaming ingestion with updates and deletes into OSS Delta Lake tables can be implemented using a micro-batch approach with MERGE INTO.
  • Using data skipping with column statistics, Delta offers efficient updates on tables that are sorted on a “key” column.
Considerations
  • Hudi’s automatic optimizations in the writer (for example, file sizing) add performance overhead at write time.
  • Reading from Merge On Read tables is generally slower than Copy On Write tables due to log files. Frequent compaction can be used to optimize read performance.
  • Iceberg uses a MERGE INTO approach (a join) for upserting data. This is more resource intensive and less performant for streaming data ingestion with frequent commits on (large unsorted) tables, because full table or partition scans would be performed on unsorted tables.
  • Iceberg does not optimize file sizes or run automatic table services (for example, compaction) when writing, so streaming ingestion will create many small data and metadata files. Frequent table maintenance needs to be performed to prevent read performance from degrading over time.
  • Reading from tables using the Merge On Read approach is generally slower than tables using only the Copy On Write approach due to delete files. Frequent compaction can be used to optimize read performance.
  • Iceberg Merge On Read currently does not support dynamic file pruning using its column statistics during merges and updates. This has impact on write performance, resulting in full table joins.
  • Delta uses a Copy On Write strategy that is not optimized for fast (streaming) writes, as it rewrites entire files for record updates.
  • Delta uses a MERGE INTO approach (a join). This is more resource intensive (less performant) and not suited for streaming data ingestion with frequent commits on large unsorted tables, because full table or partition scans would be performed on unsorted tables.
  • No auto file sizing is performed; separate table management processes are required (which can impact writes).
Supported AWS integrations
  • Amazon EMR (Spark Structured Streaming (streaming sink and forEachBatch), Flink, Hudi DeltaStreamer)
  • AWS Glue (Spark Structured Streaming (streaming sink and forEachBatch), Hudi DeltaStreamer)
  • Amazon Kinesis Data Analytics
  • Amazon Managed Streaming for Apache Kafka (MSK Connect)
  • Amazon EMR (Spark Structured Streaming (only forEachBatch), Flink)
  • AWS Glue (Spark Structured Streaming (only forEachBatch))
  • Amazon Kinesis Data Analytics
  • Amazon EMR (Spark Structured Streaming (only forEachBatch))
  • AWS Glue (Spark Structured Streaming (only forEachBatch))
  • Amazon Kinesis Data Analytics
Conclusion Good fit for lower-latency streaming with updates and deletes thanks to native support for streaming upserts, indexes for upserts, and automatic file sizing and compaction. Good fit for streaming with larger micro-batch windows and when the operational overhead of table management is acceptable. Can be used for streaming data ingestion with updates/deletes if latency is not a concern, because a Copy-On-Write strategy may not deliver the write performance required by low latency streaming use cases.

Change data capture

Change data capture (CDC) refers to the process of identifying and capturing changes made to data in a database and then delivering those changes in real time to a downstream process or system—in this case, delivering CDC data from databases into Amazon S3.

In addition to the aforementioned general streaming requirements, the following are key requirements for efficient CDC processing:

  • Efficient record-level updates and deletes – With the ability to efficiently identify files to be modified (which is important to support late-arriving data).
  • Native support for CDC – With the following options:
    • CDC record support in the table format – The table format understands how to process CDC-generated records and no custom preprocessing is required for writing CDC records to the table.
    • CDC tools natively supporting the table format – CDC tools understand how to process CDC-generated records and apply them to the target tables. In this case, the CDC engine writes to the target table without another engine in between.

Without support for the two CDC options, processing and applying CDC records correctly into a target table will require custom code. With a CDC engine, each tool likely has its own CDC record format (or payload). For example, Debezium and AWS Database Migration Service (AWS DMS) each have their own specific record formats, and need to be transformed differently. This must be considered when you are operating CDC at scale across many tables.

All three table formats allow you to implement CDC from a source database into a target table. The difference for CDC with each format lies mainly in the ease of implementing CDC pipelines and supported integrations.

. Apache Hudi Apache Iceberg Delta Lake
Functional fit
  • Hudi’s DeltaStreamer utility provides a no-code/low-code option to efficiently ingest CDC records from different sources into Hudi tables.
  • Upserts using indexes allow you to quickly identify the target files for updates, without having to perform a full table join.
  • Unique record keys and deduplication natively enforce source databases’ primary keys and prevent duplicates in the data lake.
  • Out of order records are handled via the pre-combine feature.
  • Native support (through record payload formats) is offered for CDC formats like AWS DMS and Debezium, eliminating the need to write custom CDC preprocessing logic in the writer application to correctly interpret and apply CDC records to the target table. Writing CDC records to Hudi tables is as simple as writing any other records to a Hudi table.
  • Partial updates are supported, so the CDC payload format does not need to include all record columns.
  • Flink CDC offers a convenient way to set up CDC from downstream data sources into Iceberg tables. It supports upsert mode and can interpret CDC formats such as Debezium natively.
  • Using column statistics, Iceberg offers efficient updates on tables that are sorted on a “key” column.
  • CDC into Delta tables can be implemented using third-party tools or using Spark with custom processing logic.
  • Using data skipping with column statistics, Delta offers efficient updates on tables that are sorted on a “key” column.
Considerations
  • Natively supported payload formats can be found in the Hudi code repo. For other formats, consider creating a custom payload or adding custom logic to the writer application to correctly process and apply CDC records of that format to target Hudi tables.
  • Iceberg uses a MERGE INTO approach (a join) for upserting data. This is more resource intensive and less performant, particularly on large unsorted tables where a MERGE INTO operation could require a full table scan.
  • Regular compaction should be implemented to maintain sort order over time in order to prevent MERGE INTO performance degrading.
  • Iceberg has no native support for CDC payload formats (for example, AWS DMS or Debezium). When using other engines than Flink CDC (such as Spark), custom logic needs to be added to the writer application in order to correctly process and apply CDC records to target Iceberg tables (for example, deduplication or ordering based on operation).
  • Deduplication to enforce primary key constraints needs to be handled in the Iceberg writer application.
  • No support for out of order records handling.
  • Delta does not use indexes for upserts, but uses a MERGE INTO approach instead (a join). This is more resource intensive and less performant on large unsorted tables because those would require full table or partition scans.
  • Regular clustering should be implemented to maintain sort order over time in order to prevent MERGE INTO performance degrading.
  • Delta Lake has no native support for CDC payload formats (for example, AWS DMS or Debezium). When using Spark for ingestion, custom logic needs to be added to the writer application in order to correctly process and apply CDC records to target Delta tables (for example, deduplication or ordering based on operation).
  • Record updates on unsorted Delta tables results in full table or partition scans
  • No support for out of order records handling.
Natively supported CDC formats
  • AWS DMS
  • Debezium
  • None
  • None
CDC tool integrations
  • DeltaStreamer
  • Flink CDC
  • Debezium
  • Flink CDC
  • Debezium
  • Debezium
Conclusion All three formats can implement CDC workloads. Apache Hudi offers the best overall technical fit for CDC workloads as well as the most options for efficient CDC pipeline design: no-code/low-code with DeltaStreamer, third-party CDC tools offering native Hudi integration, or a Spark/Flink engine using CDC record payloads offered in Hudi.

Batch loads

If your use case requires only periodic writes but frequent reads, you may want to use batch loads and optimize for read performance.

Batch loading data with updates and deletes is perhaps the simplest use case to implement with any of the three table formats. Batch loads typically don’t require low latency, allowing them to benefit from the operational simplicity of a Copy On Write strategy. With Copy On Write, data files are rewritten to apply updates and add new records, minimizing the complexity of having to run compaction or optimization table services on the table.

. Apache Hudi Apache Iceberg Delta Lake
Functional fit
  • Copy On Write is supported.
  • Automatic file sizing while writing is supported, including optimizing previously written small files by adding new records to them.
  • Multiple index types are provided to optimize update performance for different workload patterns.
  • Copy On Write is supported.
  • File size management is performed within each incoming data batch (but it is not possible to optimize previously written data files by adding new records to them).
  • Copy On Write is supported.
  • File size can be indirectly managed within each data batch by setting the max number of records per file (but it is not possible to optimize previously written data files by adding new records to them).
Considerations
  • Configuring Hudi according to your workload pattern is imperative for good performance (see Apache Hudi on AWS for guidance).
  • Data deduplication needs to be handled in the writer application.
  • If a single data batch does not contain sufficient data to reach a target file size, compaction can be performed to merge smaller files together afterwards.
  • Ensuring data is sorted on a “key” column is imperative for good update performance. Regular sorting compaction should be considered to maintain sorted data over time.
  • Data deduplication needs to be handled in the writer application.
  • If a single data batch does not contain sufficient data to reach a target file size, compaction can be performed to merge smaller files together afterwards.
  • Ensuring data is sorted on a “key” column is imperative for good update performance. Regular clustering should be considered to maintain sorted data over time.
Supported AWS integrations
  • Amazon EMR (Spark)
  • AWS Glue (Spark)
  • Amazon EMR (Spark, Presto, Trino, Hive)
  • AWS Glue (Spark)
  • Amazon Athena (SQL)
  • Amazon EMR (Spark, Trino)
  • AWS Glue (Spark)
Conclusion All three formats are well suited for batch loads. Apache Hudi supports the most configuration options and may increase the effort to get started, but provides lower operational effort due to automatic table management. On the other hand, Iceberg and Delta are simpler to get started with, but require some operational overhead for table maintenance.

Working with open table formats

In this section, we discuss the capabilities of each open table format for common use cases when working with open table formats: optimizing read performance, incremental data processing and processing deletes to comply with privacy regulations.

Optimizing read performance

The preceding sections primarily focused on write performance for specific use cases. Now let’s explore how each open table format can support optimal read performance. Although there are some cases where data is optimized purely for writes, read performance is typically a very important dimension on which you should evaluate an open table format.

Open table format features that improve query performance include the following:

  • Indexes, (column) statistics, and other metadata – Improves query planning and file pruning, resulting in reduced data scanned
  • File layout optimization – Enables query performance:
  • File size management – Properly sized files provide better query performance
  • Data colocation (through clustering) according to query patterns – Reduces the amount of data scanned by queries
. Apache Hudi Apache Iceberg Delta Lake
Functional fit
  • Auto file sizing when writing results in good file sizes for read performance. On Merge On Read tables, automatic compaction and clustering improves read performance.
  • Metadata tables eliminate slow S3 file listing operations. Column statistics in the metadata table can be used for better file pruning in query planning (data skipping feature).
  • Clustering data for better data colocation with hierarchical sorting or z-ordering.
  • Hidden partitioning prevents unintentional full table scans by users, without requiring them to specify partition columns explicitly.
  • Column and partition statistics in manifest files speed up query planning and file pruning, and eliminate S3 file listing operations.
  • Optimized file layout for S3 object storage using random prefixes is supported, which minimizes chances of S3 throttling.
  • Clustering data for better data colocation with hierarchical sorting or z-ordering.
  • File size can be indirectly managed within each data batch by setting the max number of records per file (but not optimizing previously written data files by adding new records to existing files).
  • Generated columns avoid full table scans.
  • Data skipping is automatically used in Spark.
  • Clustering data for better data colocation using z-ordering.
Considerations
  • Data skipping using metadata column stats has to be supported in the query engine (currently only in Apache Spark).
  • Snapshot queries on Merge On Read tables have higher query latencies than on Copy On Write tables. This latency impact can be reduced by increasing the compaction frequency.
  • Separate table maintenance needs to be performed to maintain read performance over time.
  • Reading from tables using a Merge On Read approach is generally slower than tables using only a Copy On Write approach due to delete files. Frequent compaction can be used to optimize read performance.
  • Currently, only Apache Spark and the Python clients can use data skipping.
  • Separate table maintenance needs to be performed to maintain read performance over time.
Optimization & Maintenance Processes
  • Compaction of log files in Merge On Read tables can be run as part of the writing application or as a separate job using Spark on Amazon EMR or AWS Glue. Compaction does not interfere with other jobs or queries.
  • Clustering runs as part of the writing application or in a separate job using Spark on Amazon EMR or AWS Glue because clustering can interfere with other transactions.
  • See Apache Hudi on AWS for guidance.
  • Compaction API in Delta Lake can group small files or cluster data, and it can interfere with other transactions.
  • This process has to be scheduled separately by the user on a time or event basis.
  • Spark can be used to perform compaction in services like Amazon EMR or AWS Glue.
Conclusion For achieving good read performance, it’s important that your query engine supports the optimization features offered by the table formats. When using Spark, all three formats provide good read performance when properly configured. When using Trino (and therefore Athena as well), Iceberg will likely provide better query performance because the data skipping feature of Hudi and Delta is not supported in the Trino engine. Make sure to evaluate this feature support for your query engine of choice.

Incremental processing of data on the data lake

At a high level, incremental data processing is the movement of new or fresh data from a source to a destination. To implement incremental extract, transform, and load (ETL) workloads efficiently, we need to be able to retrieve only the data records that have been changed or added since a certain point in time (incrementally) so we don’t need to reprocess unnecessary data (such as entire partitions). When your data source is an open table format table, we can take advantage of incremental queries to facilitate more efficient reads in these table formats.

. Apache Hudi Apache Iceberg Delta Lake
Functional fit
  • Full incremental pipelines can be built using Hudi’s incremental queries, which capture record-level changes on a Hudi table (including updates and deletes) without the need to store and manage change data records.
  • Hudi’s DeltaStreamer utility offers simple no-code/low-code options to build incremental Hudi pipelines.
  • Iceberg incremental queries can only read new records (no updates) from upstream Iceberg tables and replicate to downstream tables.
  • Incremental pipelines with record-level changes (including updates and deletes) can be implemented using the changelog view procedure.
  • Full incremental pipelines can be built using Delta’s Change Data Feed (CDF) feature, which captures record-level changes (including updates and deletes) using change data records.
Considerations
  • ETL engine used needs to support Hudi’s incremental query type.
  • A view has to be created to incrementally read data between two table snapshots containing updates and deletes.
  • A new view has to be created (or recreated) for reading changes from new snapshots.
  • Record-level changes can only be captured from the moment CDF is turned on.
  • CDF stores change data records on storage, so a storage overhead is incurred.
Supported AWS integrations Incremental queries are supported in:

  • Amazon EMR (Spark, Flink, Hive, Hudi DeltaStreamer)
  • AWS Glue (Spark, Hudi DeltaStreamer)
  • Amazon Kinesis Data Analytics
Incremental queries supported in:

  • Amazon EMR (Spark, Flink)
  • AWS Glue (Spark)
  • Amazon Kinesis Data Analytics

CDC view supported in:

  • Amazon EMR (Spark)
  • AWS Glue (Spark)
CDF supported in:

  • Amazon EMR (Spark)
  • AWS Glue (Spark)
Conclusion Best functional fit for incremental ETL pipelines using a variety of engines, without any storage overhead. Good fit for implementing incremental pipelines using Spark if the overhead of creating views is acceptable. Good fit for implementing incremental pipelines using Spark if the additional storage overhead is acceptable.

Processing deletes to comply with privacy regulations

Due to privacy regulations like the General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA), companies across many industries need to perform record-level deletes on their data lake for “right to be forgotten” or to correctly store changes to consent on how their customers’ data can be used.

The ability to perform record-level deletes without rewriting entire (or large parts of) datasets is the main requirement for this use case. For compliance regulations, it’s important to perform hard deletes (deleting records from the table and physically removing them from Amazon S3).

. Apache Hudi Apache Iceberg Delta Lake
Functional fit Hard deletes are performed by Hudi’s automatic cleaner service. Hard deletes can be implemented as a separate process. Hard deletes can be implemented as a separate process.
Considerations Hudi cleaner needs to be configured according to compliance requirements to automatically remove older file versions in time (within a compliance window), otherwise time travel or rollback operations could recover deleted records. Previous snapshots need to be (manually) expired after the delete operation, otherwise time travel operations could recover deleted records. The vacuum operation needs to be run after the delete, otherwise time travel operations could recover deleted records.
Conclusion This use case can be implemented using all three formats, and in each case, you must ensure that your configuration or background pipelines implement the cleanup procedures required to meet your data retention requirements.

Conclusion

Today, no single table format is the best fit for all use cases, and each format has its own unique strengths for specific requirements. It’s important to determine which requirements and use cases are most crucial and select the table format that best meets those needs.

To speed up the selection process of the right table format for your workload, we recommend the following actions:

  • Identify what table format is likely a good fit for your workload using the high-level guidance provided in this post
  • Perform a proof of concept with the identified table format from the previous step to validate its fit for your specific workload and requirements

Keep in mind that these open table formats are open source and rapidly evolve with new features and enhanced or new integrations, so it can be valuable to also take into consideration product roadmaps when deciding on the format for your workloads.

AWS will continue to innovate on behalf of our customers to support these powerful file formats and to help you be successful with your advanced use cases for analytics in the cloud. For more support on building transactional data lakes on AWS, get in touch with your AWS Account Team, AWS Support, or review the following resources:


About the Authors

Shana Schipers is an Analytics Specialist Solutions Architect at AWS, focusing on big data. She supports customers worldwide in building transactional data lakes using open table formats like Apache Hudi, Apache Iceberg and Delta Lake on AWS.

Ian Meyers is a Director of Product Management for AWS Analytics Services. He works with many of AWS largest customers on emerging technology needs, and leads several data and analytics initiatives within AWS including support for Data Mesh.


Carlos Rodrigues is a Big Data Specialist Solutions Architect at AWS. He helps customers worldwide building transactional data lakes on AWS using open table formats like Apache Hudi and Apache Iceberg.