What is Apache Iceberg?
Apache Iceberg is a distributed, community-driven, Apache 2.0-licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes. Data engineers use Apache Iceberg because it is fast, efficient, and reliable at any scale and keeps records of how datasets change over time. Apache Iceberg offers easy integrations with popular data processing frameworks such as Apache Spark, Apache Flink, Apache Hive, Presto, and more.
What is a transactional data lake?
A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. A data transaction is a series of data exchanges that are conducted in a single operation. For example, when a customer withdraws money from a bank account, the bank conducts several data exchanges at the same time in one data transaction, including verifying the account has sufficient balance, verifying identity, and debiting the withdrawal from the account. A transactional data lake is a type of data lake that not only stores data at scale but also supports transactional operations and ensures that data is accurate, consistent, and allows you to track how data and data structure changes over time. These properties are collectively known as Atomicity, Consistency, Isolation, and Durability (ACID):
- Atomicity guarantees that each transaction is a single event that either succeeds or fails completely; there is no half-way status.
- Consistency ensures that all data written is valid according to the defined rules of the data lake, ensuring that data is accurate and reliable.
- Isolation ensures multiple transactions can occur at the same time without interfering with each other, ensuring that each transaction executes independently.
- Durability means that data is not lost or corrupted once a transaction is submitted. Data can be recovered in the event of a system failure, such as a power outage.
What are the benefits of using Apache Iceberg?
Some of the key benefits of using Apache Iceberg for transactional data lakes include:
- Familiarity of SQL: Structured query language (SQL) is a popular query language that is frequently used in all types of applications. Data analysts and developers learn and use SQL because it integrates well with different programming languages and is also fairly easy to learn as it uses common English keywords in its statements. Apache Iceberg allows anyone who is familiar with structured query language (SQL) to build data lakes and perform most data lake operations without needing to learn a new language.
- Data Consistency: Apache Iceberg provides data consistency to ensure that any user who reads and writes to the data sees the same data.
- Data structure: Apache Iceberg allows for easy changes to your data structure, also known as schema evolution, meaning that users can add, rename, or remove columns from a data table without disrupting the underlying data.
- Data Versioning: Apache Iceberg provides support for data versioning, which allows users to track changes to data overtime. This enables the time travel feature, which allows users to access and query historical versions of data and analyze changes to the data between updates and deletes.
- Cross-platform support: Apache Iceberg supports a variety of different storage systems and query engines, including Apache Spark, Apache Hive, and Presto. This makes it easy to use Iceberg in a variety of different data processing environments.
- Incremental processing: Iceberg supports incremental processing, which allows users to process only the data that has changed since the last run, also known as CDC (Change Data Capture). This can help improve data processing efficiency and performance.
What are common use cases for Apache Iceberg?
Apache Iceberg is suited for many data lake use cases, including:
- Data tables in data lakes that require frequent deletes, such as when enforcing data privacy laws.
- Data tables in data lake that require record level updates. This is helpful when your dataset requires frequent updates after data settles, for example, sales data that may change due to later events such as customer returns. Iceberg provides capabilities to update individual records without needing to republish the entire data set.
- Data tables in data lakes that have unpredictable changes, such as Slowly Changing Dimension (SCD) tables. An example of an SCD is a customer record table that includes name, location, and contact information which may change over time at unknown intervals.
- When transactions with the data lake requires guaranteed data validity, durability, and reliability, Apache Iceberg table formats can be deployed to ensure ACID transactions.
- When there is a need to go back in time to query historical versions of data to perform trend analysis, analyze changes to data over a period of time, or to restore or rollback to a previous version to correct issues.
Who uses Apache Iceberg?
Data engineers, data administrators, data analysts, and data scientists are among the personas that use Apache Iceberg. Data engineers and administrators can use Apache Iceberg to design and build scalable data storage systems. Data analysts and data scientists can use Apache Iceberg to analyze large datasets efficiently.
Why should you choose Apache Iceberg?
Apache Iceberg offers a fast, efficient way to process large datasets at scale. It brings the following benefits:
- Open source: Apache Iceberg is an open source project, which means that it is free to use and can be customized to meet your specific needs. It also has an active community of developers who are continually improving and adding new features to the project.
- Scalability: Apache Iceberg is designed to handle large datasets efficiently. It can partition and organize data across multiple nodes, which helps distribute the workload and speed up data processing.
- Performance: Apache Iceberg has a variety of features to optimize query performance, including columnar storage and compression techniques such as predicate push down and schema evolution.
- Flexibility: Apache Iceberg allows you to change how your data is organized so that it can evolve over time without requiring you to rewrite your queries or rebuild your data structures. It also supports multiple data formats and data sources, which makes it easy to integrate with existing systems.
- Reliability: Apache Iceberg ensures data consistency and reliability through its support for transactions. You can track how data changes over time and roll-back to historical versions to help you correct issues.
What AWS services support Iceberg?
Apache Iceberg supports popular data processing frameworks such as Apache Spark, Apache Flink, Apache Hive and Presto. AWS services such as Amazon Redshift, Amazon Athena, Amazon EMR, and AWS Glue, include native support for transactional data lake frameworks including Apache Iceberg. Apache Iceberg in combination with supported AWS services enable a transactional data lake, often based on storage in S3.
- Amazon Redshift is a fully managed, AI powered, SQL analytics and data warehousing service to drive business decision making for customers, quickly and cost effectively. Amazon Redshift delivers a Massively Parallel Processing (MPP), multi-cluster, and serverless architecture with zero-ETL capabilities that drive near real-time analytics and AI/ML use cases with data sharing. Customers can use Amazon Redshift to query Apache Iceberg tables in the data lake, accessed through the AWS Glue Data catalog and join it with their tables in the data warehouse for comprehensive analytics.
- Amazon Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives. Athena provides native support for read, time travel, write, and DDL queries for Apache Iceberg tables that use the Apache Parquet format for data and the AWS Glue catalog for their metastore.
- Amazon EMR is a big data solution for petabyte-scale data processing, interactive analytics, and machine learning using open-source frameworks such as Apache Spark, Hadoop, Presto, and Hive. Starting with Amazon EMR 6.5.0, you can use Apache Spark 3 on Amazon EMR clusters with the Iceberg table format. EMR frameworks including Spark, Trino, Flink and Hive support Apache Iceberg.
- AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. AWS Glue 3.0 and later supports the Apache Iceberg framework for data lakes. You can use AWS Glue to perform read and write operations on Iceberg tables in Amazon S3, or work with Iceberg tables using the AWS Glue Data Catalog. Additional operations including insert, update, and all Spark queries, Spark writes are also supported.