AWS Big Data Blog

Accelerate query performance with Apache Iceberg statistics on the AWS Glue Data Catalog

August 2024: This post was updated with Amazon Athena support.

Today, we are pleased to announce a new capability for the AWS Glue Data Catalog: generating column-level aggregation statistics for Apache Iceberg tables to accelerate queries. These statistics are utilized by cost-based optimizer (CBO) in Amazon Redshift Spectrum and Amazon Athena, resulting in improved query performance and potential cost savings.

Apache Iceberg is an open table format that provides the capability of ACID transactions on your data lakes. It’s designed to process large analytics datasets and is efficient for even small row-level operations. It also enables useful features such as time-travel, schema evolution, hidden partitioning, and more.

AWS has invested in service integration with Iceberg to enable Iceberg workloads based on customer feedback. One example is the AWS Glue Data Catalog. The Data Catalog is a centralized repository that stores metadata about your organization’s datasets, making the data visible, searchable, and queryable for users. The Data Catalog supports Iceberg tables and tracks the table’s current metadata. It also allows automatic compaction of individual small files produced by each transactional write on tables into a few large files for faster read and scan operations.

In 2023, the Data Catalog announced support for column-level statistics for non-Iceberg tables. That feature collects table statistics used by the query engine’s CBO. Now, the Data Catalog expands this support to Iceberg tables. The Iceberg table’s column statistics that the Data Catalog generates are based on Puffin Spec and stored on Amazon Simple Storage Service (Amazon S3) with other table data. This way, various engines supporting Iceberg can utilize and update them.

This post demonstrates how column-level statistics for Iceberg tables work with Redshift Spectrum and Amazon Athena. Furthermore, we showcase the performance benefit of the Iceberg column statistics with the TPC-DS dataset.

How Iceberg table’s column statistics works

AWS Glue Data Catalog generates table column statistics using the Theta Sketch algorithm on Apache DataSketches to estimate the number of distinct values (NDV) and stores them in Puffin file.

For SQL planners, NDV is an important statistic to optimize query planning. There are a few scenarios where NDV statistics can potentially optimize query performance. For example, when joining two tables on a column, the optimizer can use the NDV to estimate the selectivity of the join. If one table has a low NDV for the join column compared to the other table, the optimizer may choose to use a broadcast join instead of a shuffle join, reducing data movement and improving query performance. Moreover, when there are more than two tables to be joined, the optimizer can estimate the output size of each join and plan the efficient join order. Furthermore, NDV can be used for various optimizations such as group by, distinct, and count query.

However, calculating NDV continuously with 100% accuracy requires O(N) space complexity. Instead, Theta Sketch is an efficient algorithm that allows you to estimate the NDV in a dataset without needing to store all the distinct values on memory and storage. The key idea behind Theta Sketch is to hash the data into a range between 0–1, and then select only a small portion of the hashed values based on a threshold (denoted as θ). By analyzing this small subset of data, the Theta Sketch algorithm can provide an accurate estimate of the NDV in the original dataset.

Iceberg’s Puffin file is designed to store information such as indexes and statistics as a blob type. One of the representative blob types that can be stored is apache-datasketches-theta-v1, which is serialized values for estimating the NDV using the Theta Sketch algorithm. Puffin files are linked to a snapshot-id on Iceberg’s metadata and are utilized by the query engine’s CBO to optimize query plans.

Leverage Iceberg column statistics through Amazon Redshift

To demonstrate the performance benefit of this capability, we employ the industry-standard TPC-DS 3 TB dataset. We compare the query performance with and without Iceberg column statistics for the tables by running queries in Redshift Spectrum and Amazon Athena. We have included the queries used in this post, and we recommend trying your own queries by following the workflow.

The following is the overall steps:

  1. Run AWS Glue Job that extracts TPS-DS dataset from Public Amazon S3 bucket and saves them as an Iceberg table in your S3 bucket. AWS Glue Data Catalog stores those tables’ metadata location. Query these tables using Amazon Redshift Spectrum and Amazon Athena.
  2. Generate column statistics: Employ the enhanced capabilities of AWS Glue Data Catalog to generate column statistics for each tables. It generates puffin files storing Theta Sketch.
  3. Query with Amazon Redshift Spectrum and Amazon Athena: Evaluate the performance benefit of column statistics on query performance by utilizing Amazon Redshift Spectrum and Amazon Athena to run queries on the dataset.

The following diagram illustrates the architecture.

To try this new capability, we complete the following steps:

  1. Set up resources with AWS CloudFormation.
  2. Run an AWS Glue job to create Iceberg tables for the 3TB TPC-DS dataset in your S3 bucket. The Data Catalog stores those tables’ metadata location.
  3. Run queries on Redshift Spectrum and Amazon Athena and note the query duration.
  4. Generate Iceberg column statistics for Data Catalog tables.
  5. Run queries on Redshift Spectrum and Amazon Athena and compare the query duration with the previous run.
  6. Optionally, schedule AWS Glue column statistics jobs using AWS Lambda and an Amazon EventBridge

Set up resources with AWS CloudFormation

This post includes a CloudFormation template for a quick setup. You can review and customize it to suit your needs. Note that this CloudFormation template requires a region with at least 3 Availability Zones. The template generates the following resources:

  • A virtual private cloud (VPC), public subnet, private subnets, and route tables
  • An Amazon Redshift Serverless workgroup and namespace
  • An S3 bucket to store the TPC-DS dataset, column statistics, job scripts, and so on
  • Data Catalog databases
  • An AWS Glue job to extract the TPS-DS dataset from the public S3 bucket and save the data as an Iceberg table in your S3 bucket
  • AWS Identity and Access Management (AWS IAM) roles and policies
  • A Lambda function and EventBridge schedule to run the AWS Glue column statistics on a schedule

To launch the CloudFormation stack, complete the following steps:

  1. Sign in to the AWS CloudFormation console.
  2. Choose Launch Stack.
  3. Choose Next.
  4. Leave the parameters as default or make appropriate changes based on your requirements, then choose Next.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create.

This stack can take around 10 minutes to complete, after which you can view the deployed stack on the AWS CloudFormation console.

Run an AWS Glue job to create Iceberg tables for the 3TB TPC-DS dataset

When the CloudFormation stack creation is complete, run the AWS Glue job to create Iceberg tables for the TPC-DS dataset. This AWS Glue job extracts the TPC-DS dataset from the public S3 bucket and transforms the data into Iceberg tables. Those tables are loaded into your S3 bucket and registered to the Data Catalog.

To run the AWS Glue job, complete the following steps:

  1. On the AWS Glue console, choose ETL jobs in the navigation pane.
  2. Choose InitialDataLoadJob-<your-stack-name>.
  3. Choose Run.

This AWS Glue job can take around 30 minutes to complete. The process is complete when the job processing status shows as Succeeded.

The AWS Glue job creates tables storing the TPC-DS dataset in two identical databases: tpcdsdbnostats and tpcdsdbwithstats. The tables in tpcdsdbnostats will have no generated statistics, and we use them as reference. We generate statistics on tables in tpcdsdbwithstats. Confirm the creation of those two databases and underlying tables on the AWS Glue console. At this time, those databases hold the same data and there are no statistics generated on the tables.

Run queries on Redshift Spectrum without statistics

In the previous steps, you set up a Redshift Serverless workgroup with the given RPU (128 by default), prepared the TPC-DS 3TB dataset in your S3 bucket, and created Iceberg tables (which currently don’t have statistics).

To run your query in Amazon Redshift, complete the following steps:

  1. Download the Amazon Redshift queries.
  2. In the Redshift query editor v2, run the queries listed in the Redshift Query for tables without column statistics section in the downloaded file redshift-tpcds-sample.sql.
  3. Note the query runtime of each query.

Run queries on Amazon Athena without statistics

Let’s query the TPC-DS tables (which currently don’t have statistics) from Amazon Athena as well. To run your query in Amazon Athena, complete the following steps:

  1. Download the Amazon Athena queries.
  2. In the Athena query editor, run the queries listed in the Athena Query for tables without column statistics section in the downloaded file athena-tpcds-sample.sql.
  3. Note the query runtime of each query.

Generate Iceberg column statistics

To generate statistics on the Data Catalog tables, complete the following steps:

  1. On the AWS Glue console, choose Databases under Data Catalog in the navigation pane.
  2. Choose the tpcdsdbwithstats database to view all available tables.
  3. Select any of these tables (for example, call_center).
  4. Go to Column statistics – new and choose Generate statistics.
  5. Keep the default options:
    1. For Choose columns, select Table (All columns).
    2. For Row sampling options, select All rows.
    3. For IAM role, choose AWSGluestats-blog-<your-stack-name>.
  6. Choose Generate statistics.

You’ll be able to see status of the statistics generation run as shown in the following screenshot.

After you generate the Iceberg table column statistics, you should be able to see detailed column statistics for that table.

In the advanced properties section, you will find the table property use_iceberg_statistics=true. This parameter is added by the Glue Column Statistics job. Amazon Athena tries to utilize column statistics only when it is set to true. Amazon Redshift, on the other hand, utilizes statistics if they’re available by default, regardless of this parameter.

Following the statistics generation, you will find an <id>.stat file in the AWS Glue table’s underlying data location in Amazon S3. This file is a Puffin file that stores the Theta Sketch data structure. Query engines can use this Theta Sketch algorithm to efficiently estimate the NDV when operating on the table, which helps optimize query performance.

Reiterate the previous steps to generate statistics for all tables, such as catalog_sales, catalog_returns, warehouse, item, date_dim, store_sales, customer, customer_address, web_sales, time_dim, ship_mode, web_site, and web_returns. Alternatively, you can manually run the Lambda function that instructs AWS Glue to generate column statistics for all tables. We discuss the details of this function later in this post.

After you generate statistics for all tables, you can assess the query performance for each query.

Run queries on Redshift Spectrum with statistics

In the previous steps, you set up a Redshift Serverless workgroup with the given RPU (128 by default), prepared the TPC-DS 3TB dataset in your S3 bucket, and created Iceberg tables with column statistics.

To run the provided query using Redshift Spectrum on the statistics tables, complete the following steps:

  1. In the Redshift query editor v2, run the queries listed in Redshift Query for tables with column statistics section in the downloaded file redshift-tpcds-sample.sql.
  2. Note the query runtime of each query.

With Redshift Serverless 128 RPU and the TPC-DS 3TB dataset, we conducted sample runs for 10 selected TPC-DS queries where NDV information was expected to be beneficial. We ran each query 10 times. The results shown in the following table are sorted by the percentage of the performance improvement for the queries with column statistics.

TPC-DS 3T Queries Without Column Statistics With Column Statistics Performance Improvement (%)
Query 16 305.0284 51.7807 83.0
Query 75 398.0643 110.8366 72.2
Query 78 169.8358 52.8951 68.9
Query 95 35.2996 11.1047 68.5
Query 94 160.52 57.0321 64.5
Query 68 14.6517 7.4745 49.0
Query 4 217.8954 121.996 44.0
Query 72 123.8698 76.215 38.5
Query 29 22.0769 14.8697 32.6
Query 25 43.2164 32.8602 24.0

The results demonstrated clear performance benefits ranging from 24.0.–83.0%.

To dive deep, let’s explore query 16, which showed the highest performance benefit:

TPC-DS Query 16:

select
   count(distinct cs_order_number) as "order count"
  ,sum(cs_ext_ship_cost) as "total shipping cost"
  ,sum(cs_net_profit) as "total net profit"
from
   "awsdatacatalog"."tpcdsdbwithstats"."catalog_sales" cs1
  ,"awsdatacatalog"."tpcdsdbwithstats"."date_dim"
  ,"awsdatacatalog"."tpcdsdbwithstats"."customer_address"
  ,"awsdatacatalog"."tpcdsdbwithstats"."call_center"
where
    d_date between '2000-2-01' 
    and dateadd(day, 60, cast('2000-2-01' as date))
    and cs1.cs_ship_date_sk = d_date_sk
    and cs1.cs_ship_addr_sk = ca_address_sk
    and ca_state = 'AL'
    and cs1.cs_call_center_sk = cc_call_center_sk
    and cc_county in ('Dauphin County','Levy County','Luce County','Jackson County',
                    'Daviess County')
and exists (select *
            from "awsdatacatalog"."tpcdsdbwithstats"."catalog_sales" cs2
            where cs1.cs_order_number = cs2.cs_order_number
            and cs1.cs_warehouse_sk <> cs2.cs_warehouse_sk)
and not exists(select *
               from "awsdatacatalog"."tpcdsdbwithstats"."catalog_returns" cr1
               where cs1.cs_order_number = cr1.cr_order_number)
order by count(distinct cs_order_number)
limit 100;

You can compare the difference between the query plans with and without column statistics with the ANALYZE query.

The following screenshot shows the results without column statistics.

The following screenshot shows the results with column statistics.

You can observe some notable differences as a result of using column statistics. At a high level, the overall estimated cost of the query is significantly reduced, from 20633217995813352.00 to 331727324110.36.

The two query plans chose different join strategies.

The following is one line included in the query plan without column statistics:

XN Hash Join DS_DIST_BOTH (cost45365031.50 rows=10764790749 width=44)
" Outer Dist Key: ""outer"".cs_order_number"
Inner Dist Key: volt_tt_61c54ae740984.cs_order_number
" Hash Cond: ((""outer"".cs_order_number = ""inner"".cs_order_number) AND (""outer"".cs_warehouse_sk = ""inner"".cs_warehouse_sk))"

The following is the corresponding line in the query plan with column statistics:

XN Hash Join DS_BCAST_INNER (cost=307193250965.64..327130154786.68 rows=17509398 width=32)
" Hash Cond: ((""outer"".cs_order_number = ""inner"".cs_order_number) AND (""outer"".cs_warehouse_sk = ""inner"".cs_warehouse_sk))"

The query plan for the table without column statistics used DS_DIST_BOTH when joining large tables, whereas the query plan for the table with column statistics chose DS_BCAST_INNER. The join order has also changed based on the column statistics. Those join strategy and join order changes are mainly driven by more accurate join cardinality estimations, which are possible with column statistics, and result in a more optimized query plan.

Run queries on Amazon Athena with statistics

Let’s also examine how Iceberg’s Statistics affect performance in Amazon Athena.

To run the provided query using Amazon Athena on the statistics tables, complete the following steps:

  1. In the Athena query editor, run the queries listed in Athena Query for tables with column statistics section in the downloaded file athena-tpcds-sample.sql.
  2. Note the query runtime of each query.

With Amazon Athena and the TPC-DS 3TB dataset, we conducted sample runs for 10 selected TPC-DS queries where NDV information was expected to be beneficial. We ran each query 10 times. The results shown in the following table are sorted by the percentage of the performance improvement for the queries with column statistics.

TPC-DS 3T Queries Without Column Statistics With Column Statistics Performance Improvement (%)
Query 70 65.831 22.075 66.47
Query 95 24.231 8.334 65.56
Query 36 41.497 17.073 58.86
Query 94 17.787 6.122 58.6
Query 35 44.749 19.05 57.43
Query 16 18.609 8.696 53.27
Query 18 10.487 5.965 43.12
Query 2 32.823 18.788 42.76
Query 59 39.496 24.15 38.85
Query 11 58.844 39.224 33.34

The results demonstrated clear performance benefits ranging from 33.34.–66.47%.

To dive deep, let’s explore query 70, which showed the highest performance benefit:

TPC-DS Query 70:

select
   sum(ss_net_profit) total_sum
    ,s_state
    ,s_county
    ,(grouping (s_state) + grouping (s_county)) lochierarchy
    ,rank() over (partition by (grouping (s_state) + grouping (s_county))
    ,(case when (grouping (s_county) = 0) then s_state end)
   order by sum(ss_net_profit) desc) rank_within_parent
from
   tpcdsdbnostats.store_sales
  ,tpcdsdbnostats.date_dim d1
  ,tpcdsdbnostats.store
where (d1.d_month_seq between 1200 and (1200 + 11))
   and (d1.d_date_sk = ss_sold_date_sk)
   and (s_store_sk = ss_store_sk)
   and (s_state in (
      select s_state
      from
         (
         select
            s_state s_state
            ,rank() over (partition by s_state
         order by sum(ss_net_profit) desc) ranking
         from
            tpcdsdbnostats.store_sales
            ,tpcdsdbnostats.store
            ,tpcdsdbnostats.date_dim
         where (d_month_seq between 1200 and (1200 + 11))
         and (d_date_sk = ss_sold_date_sk)
         and (s_store_sk = ss_store_sk)
         group by s_state
        ) tmp1
      where (ranking <= 5)
))
group by rollup (s_state, s_county)
order by lochierarchy desc,
(case when (lochierarchy = 0) then s_state end) asc, rank_within_parent asc
limit 100;

You can compare the difference between the query plans with and without column statistics with the EXPLAIN query.

The following screenshot shows the results without column statistics.

The following screenshot shows the results with column statistics.

The use of column statistics reveals some significant differences.

Without statistics, many estimates for number of records to be scanned, cpu cost are “?”, while with statistics, specific numbers are provided.

The following are the lines included in the query plan without column statistics:

Project[projectLocality = LOCAL, protectedBarrier = NONE]
       │   Layout: [s_state$gid_92:varchar, expr_98:varchar, s_county$gid:varchar, expr_95:integer, rank_97:bigint, sum_93:decimal(38,2)]
       │   Estimates: {rows: ? (?), cpu: ?, memory: 0B, network: 0B}

The following are the the lines in the query plan with column statistics:

Project[projectLocality = LOCAL, protectedBarrier = NONE]
       │   Layout: [s_state$gid_92:varchar, expr_98:varchar, s_county$gid:varchar, expr_95:integer, rank_97:bigint, sum_93:decimal(38,2)]
       │   Estimates: {rows: 1447198784 (264.17GB), cpu: 264.17G, memory: 0B, network: 0B}

CBO can generate more efficient plan by having these specific estimations.

For example, the two query plans chose different join strategies.

The following is one line included in the query plan without column statistics:

InnerJoin[criteria = ("s_state" = "s_state_53"), hash = [$hashvalue_105, $hashvalue_126], distribution = PARTITIONED]

The following is the corresponding line in the query plan with column statistics:

InnerJoin[criteria = ("ss_store_sk" = "s_store_sk"), hash = [$hashvalue_109, $hashvalue_110], distribution = REPLICATED]

Without statistics, the optimizer chose PARTITIONED distribution, requiring extensive data movement across nodes. With statistics, it can choose REPLICATED distribution when appropriate, broadcasting smaller tables to all nodes. This reduces network traffic and improves parallel processing efficiency.

Schedule AWS Glue column statistics Runs

Maintaining up-to-date column statistics is crucial for optimal query performance. This section guides you through automating the process of generating Iceberg table column statistics using Lambda and EventBridge Scheduler. This automation keeps your column statistics up to date without manual intervention.

The required Lambda function and EventBridge schedule are already created through the CloudFormation template. The Lambda function is used to invoke the AWS Glue column statistics run. First, complete the following steps to explore how the Lambda function is configured:

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Open the function GlueTableStatisticsFunctionv1.

For a clearer understanding of the Lambda function, we recommend reviewing the code in the Code section and examining the environment variables under Configuration.

As shown in the following code snippet, the Lambda function invokes the start_column_statistics_task_run API through the AWS SDK for Python (Boto3) library.

Next, complete the following steps to explore how the EventBridge schedule is configured:

  1. On the EventBridge console, choose Schedules under Scheduler in the navigation pane.
  2. Locate the schedule created by the CloudFormation console.

This page is where you manage and configure the schedules for your events. As shown in the following screenshot, the schedule is configured to invoke the Lambda function daily at a specific time—in this case, 08:27 PM UTC. This makes sure the AWS Glue column statistics runs on a regular and predictable basis.

Clean up

When you have finished all the above steps, remember to clean up all the AWS resources you created using AWS CloudFormation:

  1. Delete the CloudFormation stack.
  2. Delete S3 bucket storing the Iceberg table for the TPC-DS dataset and the AWS Glue job script.

Conclusion

This post introduced a new feature in the Data Catalog that enables you to create Iceberg table column-level statistics. The Iceberg table stores Theta Sketch, which can be used to estimate NDV efficiently in a Puffin file. The Redshift Spectrum CBO can use that to optimize the query plan, resulting in improved query performance and potential cost savings.

Try out this new feature in the Data Catalog to generate column-level statistics and improve query performance, and let us know your feedback in the comments section. Visit the AWS Glue Catalog documentation to learn more.


About the Authors

Sotaro Hikita is a Solutions Architect. He supports customers in a wide range of industries, especially the financial industry, to build better solutions. He is particularly passionate about big data technologies and open source software.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Kyle Duong is a Senior Software Development Engineer on the AWS Glue and AWS Lake Formation team. He is passionate about building big data technologies and distributed systems.

Kalaiselvi Kamaraj is a Senior Software Development Engineer with Amazon. She has worked on several projects within the Amazon Redshift query processing team and currently focusing on performance-related projects for Redshift data lakes.

Sandeep Adwankar is a Senior Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.