Posted On: Nov 17, 2023

AWS Glue Data Catalog now supports generating column-level statistics for AWS Glue tables. These statistics are now integrated with cost-based optimizer (CBO) from Amazon Athena and Amazon Redshift Spectrum, resulting in improved query performance and potential cost savings.

With this launch, customers now have integrated end-to-end experience where statistics on Glue tables are collected and stored in Glue Catalog, and made available to analytics services for improved query planning and execution. These statistics are column-level statistics such as number of distinct, number of nulls, max, and min on files such as Parquet, ORC, JSON, ION, CSV, XML. With statistics, analytics services such as Amazon Athena and Amazon Redshift can optimize queries by applying the most restrictive filters as early as possible in the query processing, thereby limiting memory usage and the number of records read to provide the query results.

To get started, users can generate statistics and view statistics for AWS Glue Catalog table using AWS Glue Console or AWS Glue APIs. As customers run queries from Amazon Athena and Amazon Redshift Spectrum, they will automatically get the query performance improvements with built-in integration with AWS Glue Catalog.

The support for generating AWS Glue Catalog statistics is generally available in the following AWS regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Asia Pacific (Tokyo, Osaka) regions. Read the Athena blog post, and visit AWS Glue Catalog documentation to learn more.