AWS Glue Data catalog now supports generating statistics for Apache Iceberg tables

Posted on: Jul 9, 2024

AWS Glue Data Catalog now supports generating column-level aggregated statistics for Apache Iceberg tables. These statistics are now integrated with cost-based optimizer (CBO) from Amazon Redshift Spectrum, resulting in improved query performance and potential cost savings.

Apache Iceberg support statistics such as nulls, min, max, but lacks support for generating aggregation statistics such as number of distinct values (NDV). With this launch, you now have integrated end-to-end experience where NDVs are collected on columns of Apache Iceberg table and stored in Apache Iceberg Puffin files. Amazon Redshift use these aggregation statistics to optimize queries by applying the most restrictive filters as early as possible in the query processing, thereby limiting memory usage and the number of records read to provide the query results.

To get started, you can generate statistics for an Apache Iceberg table using AWS Glue Console or AWS Glue APIs. With each run, Glue Catalog will compute statistics for current Iceberg table snapshot, store in an Iceberg puffin file and Glue Catalog. As you run queries from Amazon Redshift Spectrum, you will automatically get the query performance improvements with built-in integration with Apache Iceberg.

The support for generating AWS Glue Catalog statistics is generally available in the following AWS regions: US East (Ohio), US West (N. California), Europe (Frankfurt), Asia Pacific (Mumbai). Read the blog post and visit AWS Glue Catalog documentation to learn more.