AWS Glue Data Catalog offers advanced automatic optimization for Apache Iceberg tables

Posted on: Dec 19, 2024

AWS Glue Data Catalog now offers advanced automatic optimization for Apache Iceberg tables. This update includes supporting compaction of delete files, nested data types, partial progress commits, and partition evolution support, making it easier to maintain consistently performant transactional data lakes. These features address challenges faced by customers with streaming data continuously ingested into Apache Iceberg tables, resulting in a large number of delete files that track changes in data files.

With this new capability, Glue Data Catalog constantly monitors table partitions for positional and equality delete files, initiates the compaction process, and regularly commits partial progress to reduce conflicts. Glue Catalog optimizers now support schema evolution as you reorder or rename columns as well as partition spec evolution. In addition, Glue Catalog has expanded support for heavily nested complex data and support for parquet compression codecs - zstd, brotli, lz4, gzip, snappy. Enabling automatic compaction reduces delete files and metadata overhead on your Iceberg tables and improves query performance. These new features are automatically applied to existing and new Glue Catalog optimizers.

In addition to the AWS console, customers can also use the AWS CLI or AWS SDKs to automate optimization for Apache Iceberg tables. The feature is available in 14 AWS regions US East (N. Virginia, Ohio), US West (Oregon), Europe (Ireland, London, Frankfurt, Stockholm), Canada (central), Asia Pacific (Tokyo, Seoul, Mumbai, Singapore, Sydney), South America (São Paulo). To learn more, read the blog, and visit the AWS Glue Data Catalog documentation.