Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and AWS Glue
Most businesses store their critical data in a data lake, where you can bring data from various sources to a centralized storage. The data is processed by specialized big data compute engines, such as Amazon Athena for interactive queries, Amazon EMR for Apache Spark applications, Amazon SageMaker for machine learning, and Amazon QuickSight for data visualization.
Apache Iceberg is an open-source table format for data stored in data lakes. It is optimized for data access patterns in Amazon Simple Storage Service (Amazon S3) cloud object storage. Iceberg helps data engineers tackle complex challenges in data lakes such as managing continuously evolving datasets while maintaining query performance. Iceberg allows you to do the following:
- Maintain transactional consistency where files can be added, removed, or modified atomically with full read isolation and multiple concurrent writes
- Implement full schema evolution to process safe table schema updates as the table data evolves
- Organize tables into flexible partition layouts with partition evolution, enabling updates to partition schemes as queries and data volume changes without relying on physical directories
- Perform row-level update and delete operations to satisfy new regulatory requirements such as the General Data Protection Regulation (GDPR)
- Provide versioned tables and support time travel queries to query historical data and verify changes between updates
- Roll back tables to prior versions to return tables to a known good state in case of any issues
In 2021, AWS teams contributed the Apache Iceberg integration with the AWS Glue Data Catalog to open source, which enables you to use open-source compute engines like Apache Spark with Iceberg on AWS Glue. In 2022, Amazon Athena announced support of Iceberg and Amazon EMR added support of Iceberg starting with version 6.5.0.
In this post, we show you how to use Amazon EMR Spark to create an Iceberg table, load sample books review data, and use Athena to query, perform schema evolution, row-level update and delete, and time travel, all coordinated through the AWS Glue Data Catalog.
We use the Amazon Customer Reviews public dataset as our source data. The dataset contains data files in Apache Parquet format on Amazon S3. We load all the book-related Amazon review data as an Iceberg table to demonstrate the advantages of using the Iceberg table format on top of raw Parquet files. The following diagram illustrates our solution architecture.
To set up and test this solution, we complete the following high-level steps:
- Create an S3 bucket.
- Create an EMR cluster.
- Create an EMR notebook.
- Configure a Spark session.
- Load data into the Iceberg table.
- Query the data in Athena.
- Perform a row-level update in Athena.
- Perform a schema evolution in Athena.
- Perform time travel in Athena.
- Consume Iceberg data across Amazon EMR and Athena.
To follow along with this walkthrough, you must have the following:
- An AWS Account with a role that has sufficient access to provision the required resources.
Create an S3 bucket
To create an S3 bucket that holds your Iceberg data, complete the following steps:
- On the Amazon S3 console, choose Buckets in the navigation pane.
- Choose Create bucket.
- For Bucket name, enter a name (for this post, we enter
Because S3 bucket names are globally unique, choose a different name when you create your bucket.
- For AWS Region, choose your preferred Region (for this post, we use
- Complete the remaining steps to create your bucket.
- If this is the first time that you’re using Athena to run queries, create another globally unique S3 bucket to hold your Athena query output.
Create an EMR cluster
Now we’re ready to start an EMR cluster to run Iceberg jobs using Spark.
- On the Amazon EMR console, choose Create cluster.
- Choose Advanced options.
- For Software Configuration, choose your Amazon EMR release version.
Iceberg requires release 6.5.0 and above.
- Select JupyterEnterpriseGateway and Spark as the software to install.
- For Edit software settings, select Enter configuration and enter
- Leave other settings at their default and choose Next.
- You can change the hardware used by the Amazon EMR cluster in this step. In this demo, we use the default setting.
- Choose Next.
- For Cluster name, enter
Iceberg Spark Cluster.
- Leave the remaining settings unchanged and choose Next.
- You can configure security settings such as adding an EC2 key pair to access your EMR cluster locally. In this demo, we use the default setting.
- Choose Create cluster.
You’re redirected to the cluster detail page, where you wait for the EMR cluster to transition from
Create an EMR notebook
When the cluster is active and in the
Waiting state, we’re ready to run Spark programs in the cluster. For this demo, we use an EMR notebook to run Spark commands.
- On the Amazon EMR console, choose Notebooks in the navigation pane.
- Choose Create notebook.
- For Notebook name, enter a name (for this post, we enter
- For Cluster, select Choose an existing cluster and choose Iceberg Spark Cluster.
- For AWS service role, choose Create a new role to create
EMR_Notebook_DefaultRoleor choose a different role to access resources in the notebook.
- Choose Create notebook.
You’re redirected to the notebook detail page.
- Choose Open in JupyterLab next to your notebook.
- Choose to create a new notebook.
- Under Notebook, choose Spark.
Configure a Spark session
In your notebook, run the following code:
This sets the following Spark session configurations:
- spark.sql.catalog.demo – Registers a Spark catalog named
demo, which uses the Iceberg Spark catalog plugin
- spark.sql.catalog.demo.catalog-impl – The
demoSpark catalog uses AWS Glue as the physical catalog to store Iceberg database and table information
- spark.sql.catalog.demo.warehouse – The
demoSpark catalog stores all Iceberg metadata and data files under the root path
- spark.sql.extensions – Adds support to Iceberg Spark SQL extensions, which allows you to run Iceberg Spark procedures and some Iceberg-only SQL commands (you use this in a later step)
Load data into the Iceberg table
In our Spark session, run the following commands to load data:
Iceberg format v2 is needed to support row-level updates and deletes. See Format Versioning for more details.
It may take up to 15 minutes for the commands to complete. When it’s complete, you should be able to see the table on the AWS Glue console, under the
reviews database, with the
table_type property shown as
Query in Athena
Navigate to the Athena console and choose Query editor. If this is your first time using the Athena query editor, you need to configure to use the S3 bucket you created earlier to store the query results.
book_reviews is available for querying. Run the following query:
The following screenshot shows the first five records from the table being displayed.
Perform a row-level update in Athena
In the next few steps, let’s focus on a record in the table with review ID
RZDVOUQG1GBG7. Currently, it has no total votes when we run the following query:
Let’s update the total_votes value to 2 using the following query:
After your update command runs successfully, run the below query and note the updated result showing a total of two votes:
Athena enforces ACID transaction guarantee for all the write operations against an Iceberg table. This is done through the Iceberg format’s optimistic locking specification. When concurrent attempts are made to update the same record, a commit conflict occurs. In this scenario, Athena displays a transaction conflict error, as shown in the following screenshot.
Delete queries work in a similar way; see DELETE for more details.
Perform a schema evolution in Athena
Suppose the review suddenly goes viral and gets 10 billion votes:
Based on the AWS Glue table information, the
total_votes is an integer column. If you try to update a value of 10 billion, which is greater than the maximum allowed integer value, you get an error reporting a type mismatch.
Iceberg supports most schema evolution features as metadata-only operations, which don’t require a table rewrite. This includes add, drop, rename, reorder column, and promote column types. To solve this issue, you can change the integer column
total_votes to a
BIGINT type by running the following
You can now update the value successfully:
Querying the record now gives us the expected result in
Perform time travel in Athena
In Iceberg, the transaction history is retained, and each transaction commit creates a new version. You can perform time travel to look at a historical version of a table. In Athena, you can use the following syntax to travel to a time that is after when the first version was committed:
Consume Iceberg data across Amazon EMR and Athena
One of the most important features of a data lake is for different systems to seamlessly work together through the Iceberg open-source protocol. After all the operations are performed in Athena, let’s go back to Amazon EMR and confirm that Amazon EMR Spark can consume the updated data.
First, run the same Spark SQL and see if you get the same result for the review used in the example:
Spark shows 10 billion total votes for the review.
Check the transaction history of the operation in Athena through Spark Iceberg’s history system table:
This shows three transactions corresponding to the two updates you ran in Athena.
Iceberg offers a variety of Spark procedures to optimize the table. For example, you can run an
expire_snapshots procedure to remove old snapshots, and free up storage space in Amazon S3:
Note that, after running this procedure, time travel can no longer be performed against expired snapshots.
Examine the history system table again and notice that it shows you only the most recent snapshot.
Running the following query in Athena results in an error “No table snapshot found before timestamp…” as older snapshots were deleted, and you are no longer able to time travel to the older snapshot:
To avoid incurring ongoing costs, complete the following steps to clean up your resources:
- Run the following code in your notebook to drop the AWS Glue table and database:
- On the Amazon EMR console, choose Notebooks in the navigation pane.
- Select the notebook
iceberg-spark-notebookand choose Delete.
- Choose Clusters in the navigation pane.
- Select the cluster
Iceberg Spark Clusterand choose Terminate.
- Delete the S3 buckets and any other resources that you created as part of the prerequisites for this post.
In this post, we showed you an example of using Amazon S3, AWS Glue, Amazon EMR, and Athena to build an Iceberg data lake on AWS. An Iceberg table can seamlessly work across two popular compute engines, and you can take advantage of both to design your customized data production and consumption use cases.
With AWS Glue, Amazon EMR, and Athena, you can already use many features through AWS integrations, such as SageMaker Athena integration for machine learning, or QuickSight Athena integration for dashboard and reporting. AWS Glue also offers the Iceberg connector, which you can use to author and run Iceberg data pipelines.
In addition, Iceberg supports a variety of other open-source compute engines that you can choose from. For example, you can use Apache Flink on Amazon EMR for streaming and change data capture (CDC) use cases. The strong transaction guarantee and efficient row-level update, delete, time travel, and schema evolution experience offered by Iceberg offers a sound foundation and infinite possibilities for users to unlock the power of big data.
About the Authors
Kishore Dhamodaran is a Senior Solutions Architect at AWS. Kishore helps strategic customers with their cloud enterprise strategy and migration journey, leveraging his years of industry and cloud experience.
Mohit Mehta is a Principal Architect at AWS with expertise in AI/ML and data analytics. He holds 12 AWS certifications and is passionate about helping customers implement cloud enterprise strategies for digital transformation. In his free time, he trains for marathons and plans hikes across major peaks around the world.
Giovanni Matteo Fumarola is the Engineering Manager of the Athena Data Lake and Storage team. He is an Apache Hadoop Committer and PMC member. He has been focusing in the big data analytics space since 2013.
Jared Keating is a Senior Cloud Consultant with AWS Professional Services. Jared assists customers with their cloud infrastructure, compliance, and automation requirements, drawing from his 20+ years of IT experience.