AWS Big Data Blog
Understanding Apache Iceberg on AWS with the new technical guide
We’re excited to announce the launch of the Apache Iceberg on AWS technical guide. Whether you are new to Apache Iceberg on AWS or already running production workloads on AWS, this comprehensive technical guide offers detailed guidance on foundational concepts to advanced optimizations to build your transactional data lake with Apache Iceberg on AWS.
Apache Iceberg is an open source table format that simplifies data processing on large datasets stored in data lakes. It does so by bringing the familiarity of SQL tables to big data and capabilities such as ACID transactions, row-level operations (merge, update, delete), partition evolution, data versioning, incremental processing, and advanced query scanning. Apache Iceberg seamlessly integrates with popular open source big data processing frameworks like Apache Spark, Apache Hive, Apache Flink, Presto, and Trino. It is natively supported by AWS analytics services such as AWS Glue, Amazon EMR, Amazon Athena, and Amazon Redshift.
The following diagram illustrates a reference architecture of a transactional data lake with Apache Iceberg on AWS.
AWS customers and data engineers use the Apache Iceberg table format for its many benefits, as well as for its high performance and reliability at scale to build transactional data lakes and write-optimized solutions with Amazon EMR, AWS Glue, Athena, and Amazon Redshift on Amazon Simple Storage Service (Amazon S3).
We believe Apache Iceberg adoption on AWS will continue to grow rapidly, and you can benefit from this technical guide that delivers productive guidance on working with Apache Iceberg on supported AWS services, best practices on cost-optimization and performance, and effective monitoring and maintenance policies.
Related resources
- Orca Security’s journey to a petabyte-scale data lake with Apache Iceberg and AWS Analytics
- Choosing an open table format for your transactional data lake on AWS
- Build a serverless transactional data lake with Apache Iceberg, Amazon EMR Serverless, and Amazon Athena
- Load data incrementally from transactional data lakes to data warehouses
- Build a data lake with Apache Flink on Amazon EMR
- Use Apache Iceberg in your data lake with Amazon S3, AWS Glue, and Snowflake
- Apache Iceberg optimization: Solving the small files problem in Amazon EMR
- Improve operational efficiencies of Apache Iceberg tables built on Amazon S3 data lakes
About the Authors
Carlos Rodrigues is a Big Data Specialist Solutions Architect at AWS. He helps customers worldwide build transactional data lakes on AWS using open table formats like Apache Iceberg and Apache Hudi. He can be reached via LinkedIn.
Imtiaz (Taz) Sayed is the WW Tech Leader for Analytics at AWS. He is an expert on data engineering and enjoys engaging with the community on all things data and analytics. He can be reached via LinkedIn.
Shana Schipers is an Analytics Specialist Solutions Architect at AWS, focusing on big data. She supports customers worldwide in building transactional data lakes using open table formats like Apache Hudi, Apache Iceberg, and Delta Lake on AWS.