Posted On: Nov 15, 2023

AWS Glue Data Catalog now supports automatic compaction of Apache Iceberg tables, making it easier for you to keep your transactional data lakes always performant. Enabling automatic compaction on Apache Iceberg tables reduces metadata overhead on your Iceberg tables and improves query performance. 

Apache Iceberg is an open table format that provides fast query performance over large tables in data lakes. Apache Iceberg tracks data files of a table in its metadata on Amazon S3. As more table changes are made, more data files are created, and the queries can become less efficient. To improve performance and control cost, organizations had to create custom data pipelines that periodically compact small files. Building these custom pipelines are time-consuming and expensive. This launch provides automatic compaction of Apache Iceberg tables on AWS Glue Data Catalog. Once enabled, AWS Glue Data Catalog continuously monitors new data writes, tracks the small files in underlying Amazon S3 storage, and automatically triggers compaction jobs in the background with no additional input from you. You can now get an always-optimized Amazon S3 layout for your Iceberg tables that results in faster read performance on data lakes.

In addition to the AWS console, customers can also use AWS CLI or AWS SDKs to automate enabling of compaction for Apache Iceberg tables. For more details, please go here.

Automatic compaction for Iceberg tables is available in Asia Pacific (Tokyo), US East (N. Virginia), US East (Ohio), US West (Oregon) and Europe (Ireland). To learn more, read the blog, and visit the AWS Glue Data Catalog documentation.