Apache Iceberg: An Introduction from Rackspace on Running the New Open Table Format on AWS
By Chaitanya Varma Mudundi, Professional Services Big Data Engineer – Rackspace
Data-driven decision making is accelerating and defining the way organizations work. With this transformation, there has been a rapid adoption of data lakes across the industry.
To fuel this transformation, data lakes have evolved over the last decade. Apache Hive is a standard for data lakes, but while Apache Hive can solve some of the issues with the processing of data, it falls short at a few other objectives for next-generation data processing.
In this post, I will discuss the drawbacks of existing data lake architecture, what Apache Iceberg is, and how it overcomes the shortcomings of the current state of data lakes. Additionally, I will review design differences between Apache Hive and Iceberg.
As a Professional Services Big Data Engineer at Rackspace Technology, I have architected enterprise-level solutions which includes developing data lakes, designing data warehouses, and implementing event-driven architectures.
Rackspace is an AWS Premier Tier Services Partner and Managed Cloud Services Provider (MSP) that helps businesses tap the power of Amazon Web Services (AWS) from a trusted partner with a track record of managing business-critical applications.
Why Use Apache Iceberg?
Apache Iceberg is an open table format, originally designed at Netflix in order to overcome the challenges faced when using existing data lake formats like Apache Hive. It was open sourced in 2018 as an Apache Incubator project and graduated from the incubator in 2020.
The key problems Iceberg tries to address are:
- Using data lakes at scale (petabyte-scalable tables).
- Data and schema evolution.
- Consistent and concurrent writes in parallel.
Iceberg is a new table format design which addresses the issues faced by Apache Hive. When used at scale with large datasets, there are many issues due to its design.
Some of the key challenges faced by Apache Hive are:
- Data changes for large datasets are inefficient. When changes are made to the existing data, like when updates or deletes are performed, the changes cannot be handled at a file level in Apache Hive.
- This is inefficient for large partitions since the complete partitions need to be rewritten to a new location frequently, for each update or delete.
- Concurrent writes on the same dataset are not a safe operation in Apache Hive. There’s a possibility of data loss as the last write operation wins and querying during these concurrent writes provides different results.
- Fetching the entire directory list from a partition level takes a long time for large tables. In Apache Hive, the files in a partition are scanned at runtime, while in Iceberg there is a manifest file which improves performance.
- Querying of data from Apache Hive takes a long time as the datasets grow over a period due to its directory structure to store partitions. If multiple partitions are present, this adds an additional layer of overheads to querying datasets. Users even need to keep track of the physical layout of tables while writing queries.
- Updates to existing partitions in Apache Hive needs a recreation of existing table mapping to a new location, as partitions are defined at the creation of table and cannot be modified as the tables grow. For example, if your raw zone was
\raw\year\month\dayand you wanted to change it to
\raw\year\month\day\hourthen you would need to rebuild your entire raw zone partition structure.
Design Benefits of Apache Iceberg
Apache Iceberg is designed to overcome the drawbacks faced when using Apache Hive. The key difference is in how Iceberg stores records in object storage.
Figure 1 – Apache Iceberg table architecture.
- The design structure of Iceberg is different from Apache Hive, where the metadata layer and data layer are managed and maintained on object storage like Hadoop or Amazon Simple Storage Service (Amazon S3).
- It uses a file structure (metadata and manifest files) that is managed in the metadata layer. Each commit at any timeline is stored as an event on the data layer when data is added. The metadata layer manages the snapshot list, and Iceberg supports integration with multiple query engines.
- Concurrent commits on the same datasets ensure atomicity of transaction with optimistic concurrency control.
- Any update or delete to the data layer creates a new snapshot in the metadata layer from the previous latest snapshot and parallelly chains up the snapshot. This enables faster query processing, as the query provided by users pulls data at the file level rather than the partition level.
- Schema and partition changes on an existing table can be performed with ease, as these changes are tracked as separate components in snapshots on the metadata layer. When a partition is changed, Iceberg stores the previous and latest partitions as separate plans. When a query is performed on old data, Iceberg does a split plan and pulls data with different partitions from multiple snapshots.
- Iceberg uses a snapshot-based querying model, where data files are mapped using manifest and metadata files. Even when data grows at scale, querying on these table gives high performance, as data is accessible at the file level. These mappings are stored in an Iceberg catalog.
Key Features of Apache Iceberg
Apache Iceberg supports flexible SQL commands to merge new data, update existing rows, and perform targeted deletes on tables. Due to its architecture under the hood, Iceberg supports execution of analytical queries on data lakes.
Adding, renaming, and reordering the column names works well and schema changes never require rewriting of the complete table. The column names are uniquely identified in the metadata layer with IDs rather than the name of the column itself.
Partitioning in Iceberg is dynamic. For example, if an event time (timestamp) column is present in the table, the table can be partitioned by date from the timestamp column. Iceberg manages the relationship between the event timestamp column and the date, and the partitioning is managed by Iceberg. Additional levels of partitioning can be performed, and these are tacked on snapshot via metadata files.
Time Travel and Rollback
Iceberg supports two types of read options for snapshots, which support time travel and incremental reads.
These are the options supported:
snapshot-id– selects a specific table snapshot.
as-of-timestamp– selects the current snapshot at a timestamp, in milliseconds.
Apache Iceberg has multiple AWS service integrations with query engines, catalogues, and infrastructure to run.
AWS supports integrations with the following engines and setting up custom catalogs:
- Spark – Spark 3.0 and AWS client version 2.15.40 supports integration with Apache Iceberg.
- Flink – AWS Flink module supports creation of Iceberg tables for Flink SQL client.
- Apache Hive – AWS module with Hive included with dependencies enables to create iceberg tables.
There are multiple options users can choose from to build an Iceberg catalog with AWS.
AWS Glue Catalog
Iceberg supports integration with AWS Glue catalog, where an Iceberg namespace is stored as a Glue database, an Iceberg table is stored as a Glue table, and every Iceberg table version is stored as a Glue table version.
The following is an example for configuring Spark SQL and Glue catalog:
For commit locking, Glue catalog uses Amazon DynamoDB for concurrent commits and for file IO and storage, glue utilizes Amazon S3.
Amazon DynamoDB Catalog
The Amazon DynamoDB catalog capability is still in the preview stage. DynamoDB Catalog avoids hot partition issues during heavy write traffic to the tables. DynamoDB provides the best performance through optimistic locking when high rates of read and write throughputs are required.
Amazon RDS JDBC Catalog
The JDBC catalog uses a table in a relational database to manage the Iceberg tables. The tables can serve Amazon Relational Database Service (Amazon RDS) as the catalog, which is recommended when the organization already has an existing serverless managed table. This provides easy integration.
Here’s an example to configure Iceberg with RDS as a catalog for the spark engine:
How to Run Apache Iceberg on AWS
Amazon Athena provides integration with Iceberg, which is currently in preview. To run Iceberg queries with Athena, create a workload with the name “AmazonAthenaIcebergPreview” and run the Iceberg-related queries using this workload. Athena currently engine supports read, write, and update to Iceberg tables.
Amazon EMR 6.5.0 and later has Iceberg dependencies pre-installed without requiring any additional bootstrap-actions. However, for versions before 6.5.0, these dependencies need to be added to bootstrap-actions to use iceberg tables. Amazon EMR provided Spark, Hive, Flink, and Trino that can run Iceberg.
Amazon EKS and Amazon Kinesis
Amazon Elastic Kubernetes Service (Amazon EKS) can run any Spark, Hive, Flink, Trino, and Presto clusters which can be integrated with Iceberg. Similarly, Amazon Kinesis can be integrated with Flink, which has connectivity to use Apache Iceberg.
Apache Iceberg has integrations with various query and execution engines, where the Iceberg tables can be created and managed by these connectors. The engines that support Iceberg are Spark, Flink, Hive, Presto, Trino, Dremio, and Snowflake.
As organizations move towards data-driven decision making, the importance of lake house-style architectures are increasing rapidly. Apache Iceberg is an open table format which can scale and evolve seamlessly while providing key benefits over its predecessor, Apache Hive.
Apache Iceberg is best suited for batch and micro-batch processing of datasets. The growing open source community and integrations from multiple cloud providers makes it easier to integrate Iceberg on to existing architecture effectively.
For any questions about modernizing your data strategy, feel free to contact Rackspace at firstname.lastname@example.org.
The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this post.
Rackspace Technology – AWS Partner Spotlight
Rackspace is an AWS Premier Tier Services Partner and MSP that helps businesses tap the power of AWS from a trusted partner with a track record of managing business-critical applications.