AWS Glue Data Catalog now supports scheduled generation of column level statistics

Posted on: Nov 13, 2024

AWS Glue Data Catalog now supports the scheduled generation of column-level statistics for Apache Iceberg tables and file formats such as Parquet, JSON, CSV, XML, ORC, and ION. With this launch, you can simplify and automate the generation of statistics by creating a recurring schedule in the Glue Data Catalog. These statistics are integrated with the cost-based optimizer (CBO) from Amazon Redshift Spectrum and Amazon Athena, resulting in improved query performance and potential cost savings.

Previously, to setup recurring statistics generation schedule, you had to call AWS services using a combination of AWS Lambda and Amazon EventBridge Scheduler. With this new feature, you can now provide the recurring schedule as an additional configuration to Glue Data Catalog along with sampling percentage. For each scheduled run, the number of distinct values (NDVs) are collected for Apache Iceberg tables, and additional statistics such as the number of nulls, maximum, minimum, and average length are collected for other file formats. As the statistics are updated, Amazon Redshift and Amazon Athena use them to optimize queries, using optimizations such as optimal join order or cost based aggregation pushdown. You have visibility into the status and timing of each statistics generation run, as well as the updated statistics values.

To get started, you can schedule statistics generation using the AWS Glue Data Catalog Console or AWS Glue APIs. The support for scheduled generation of AWS Glue Catalog statistics is generally available in all regions where Amazon EventBridge Scheduler is available. Visit AWS Glue Catalog documentation to learn more.