AWS Big Data Blog

How to use Parquet Column Indexes with Amazon Athena

Amazon Athena recently added support for reading Parquet Column Indexes in Apache Iceberg tables on November 21, 2025. With this optimization, Athena can perform page-level data pruning to skip unnecessary data within Parquet row groups, potentially reducing the amount of data scanned and improving query runtime for queries with selective filters. For data teams, this may help enable faster insights and help reduce costs when analyzing large-scale data lakes.

Data teams building data lakes often choose Apache Iceberg for its ACID transactions, schema evolution, and metadata management capabilities. Athena is a serverless query engine that allows you to query Amazon S3-based data lakes using SQL, and you don’t need to manage infrastructure. Based on the type of data and query logic, Athena can apply multiple query optimizations to improve performance and reduce costs.

In this blog post, we use Athena and Amazon SageMaker Unified Studio to explore Parquet Column Indexes and demonstrate how they can improve Iceberg query performance. We explain what Parquet Column Indexes are, demonstrate their performance benefits, and show you how to use them in your applications.

Overview of Parquet Column Indexes

Parquet Column Indexes store metadata that query engines can use to skip irrelevant data with greater precision than row group statistics alone. To understand how they work, consider how data is structured within Parquet files and how engines like Athena process them.

Parquet files organize data hierarchically by dividing data into row groups (typically 128-512 MB each) and further subdividing them into pages (typically 1 MB each). Traditionally, Parquet maintains metadata on the contents of each row group level in the form of min/max statistics, allowing engines like Athena to skip row groups that don’t satisfy query predicates. Although this approach reduces the bytes scanned and query runtime, it has limitations. If even a single page within a row group overlaps with the values you are searching for, Athena scans all pages within the row group.

Parquet Column Indexes help address this problem by storing page-level min/max statistics in the Parquet file footer. Row group statistics provide coarse-grained filtering, but Parquet Column Indexes enable finer-grained filtering by allowing query engines like Athena to skip individual pages within a row group. Consider a Parquet file with a single row group containing 5 pages for a column. The row group has min/max statistics of (1, 20), and each page for that column has the following min/max statistics.

row-group-0: min=1, max=20
    page-0: min=1, max=10
    page-1: min=1, max=10
    page-2: min=5, max=15
    page-3: min=6, max=16
    page-4: min=10, max=20

When Athena runs a query filtering for values equal to 2, it first checks the row group statistics and confirms that 2 falls within the range (1, 20). Athena will then plan to scan the pages within that row group. Without Parquet Column Indexes, Athena scans each of the 5 pages in the row group. With Parquet Column Indexes, Athena examines the page-level statistics and determines that only page-0 and page-1 need to be read, skipping the remaining 3 pages.

How to use Parquet Column Indexes with Athena

Athena uses Parquet Column Indexes based on table type:

  • Amazon S3 Tables: Athena automatically uses Parquet Column Indexes by default when they are present.
  • Iceberg tables in S3 general purpose buckets: Athena does not use Parquet Column Indexes by default. To allow Athena to use Parquet Column Indexes, add an AWS Glue table property named use_iceberg_parquet_column_index and set it to true. Use the AWS Glue console or AWS Glue UpdateTable API to perform these actions.

Read more about how to use this feature in Use Parquet column indexing.

Measuring Athena performance gains when using Parquet Column Indexes

Now that we understand what Parquet Column Indexes are, we’ll demonstrate the performance benefits of using Parquet Column Indexes by analyzing the catalog_sales table from a 3TB TPC-DS dataset. This table contains ecommerce transaction data including order dates, sales amounts, customer IDs, and product information. This dataset is a good proxy for the types of business analysis that you might perform on your own data, such as identifying sales trends, analyzing customer purchasing patterns, and calculating revenue metrics. We compare query execution statistics with and without Parquet Column Indexes to quantify the performance improvement.

Prerequisites

Before you begin, you must have the following resources:

  1. A SageMaker Unified Studio IAM-based domain.
  2. An Execution IAM Role configured within the SageMaker Unified Studio IAM-based domain with access to S3, AWS Glue Data Catalog, and Athena.
  3. An S3 bucket in your account to store Iceberg table data and Athena query results.

Create catalog_sales Iceberg table

Complete the following steps using SageMaker Unified Studio notebooks. There, you can use SageMaker Unified Studio’s multi-dialect notebook functionality to work with your data using the Athena SQL and Spark engines. To create a catalog_sales Iceberg table in your account, follow these steps:

  1. Navigate to Amazon SageMaker in the AWS Management Console and choose Open under Get started with Amazon SageMaker Unified Studio.
  2. From the side navigation, select Notebooks and choose Create Notebook. The subsequent steps in this post will execute scripts in this notebook.
  3. Create a new SQL cell in the notebook and set the connection type to Athena (Spark). Execute the following query to create a database for the tables in this post.
    CREATE DATABASE parquet_column_index_blog;
  4. Create a new SQL cell in the notebook and verify the connection type is Athena (Spark). Execute the following query to create a Hive table pointing to the location of the TPC-DS catalog_sales table data at the public S3 bucket.
    CREATE TABLE IF NOT EXISTS parquet_column_index_blog.catalog_sales_hive (
    	  cs_sold_time_sk int,
    	  cs_ship_date_sk int,
    	  cs_bill_customer_sk int,
    	  cs_bill_cdemo_sk int,
    	  cs_bill_hdemo_sk int,
    	  cs_bill_addr_sk int,
    	  cs_ship_customer_sk int,
    	  cs_ship_cdemo_sk int,
    	  cs_ship_hdemo_sk int,
    	  cs_ship_addr_sk int,
    	  cs_call_center_sk int,
    	  cs_catalog_page_sk int,
    	  cs_ship_mode_sk int,
    	  cs_warehouse_sk int,
    	  cs_item_sk int,
    	  cs_promo_sk int,
    	  cs_order_number bigint,
    	  cs_quantity int,
    	  cs_wholesale_cost decimal(7, 2),
    	  cs_list_price decimal(7, 2),
    	  cs_sales_price decimal(7, 2),
    	  cs_ext_discount_amt decimal(7, 2),
    	  cs_ext_sales_price decimal(7, 2),
    	  cs_ext_wholesale_cost decimal(7, 2),
    	  cs_ext_list_price decimal(7, 2),
    	  cs_ext_tax decimal(7, 2),
    	  cs_coupon_amt decimal(7, 2),
    	  cs_ext_ship_cost decimal(7, 2),
    	  cs_net_paid decimal(7, 2),
    	  cs_net_paid_inc_tax decimal(7, 2),
    	  cs_net_paid_inc_ship decimal(7, 2),
    	  cs_net_paid_inc_ship_tax decimal(7, 2),
    	  cs_net_profit decimal(7, 2))
    	USING parquet
    	PARTITIONED BY (cs_sold_date_sk int)
    	LOCATION 's3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/catalog_sales/'
    	TBLPROPERTIES (
    	  'parquet.compression'='SNAPPY'
    	);
  5. Create a new SQL cell in the notebook and verify the connection type is Athena (Spark). Execute the following query to add the Hive partitions to the AWS Glue metadata.
    MSCK REPAIR TABLE parquet_column_index_blog.catalog_sales_hive;
  6. Create a new SQL cell in the notebook and verify the connection type is Athena (Spark). Replace s3://amzn-s3-demo-bucket/athena_parquet_column_index_blog/catalog_sales/ with the S3 URI where you want to store your Iceberg table data, then execute the following query to create the catalog_sales Iceberg table from the Hive table.
    CREATE TABLE parquet_column_index_blog.catalog_sales
    	USING iceberg
    	PARTITIONED BY (cs_sold_date_sk)
    	LOCATION 's3://amzn-s3-demo-bucket/athena_parquet_column_index_blog/catalog_sales/'
    	AS
    	SELECT * FROM parquet_column_index_blog.catalog_sales_hive;
  7. Create a new SQL cell in the notebook and verify the connection type is Athena (Spark). Execute the following query to delete the catalog_sales_hive table, which was only needed to create the catalog_sales Iceberg table.
    DROP TABLE parquet_column_index_blog.catalog_sales_hive;

Run an Athena query without Parquet Column Indexes

After creating the catalog_sales Iceberg table in the preceding steps, we run a simple query that analyzes shipping delays of the top 10 most ordered items. This type of analysis could be critical for ecommerce and retail operations. By identifying which popular items experience the greatest delays, fulfillment teams can focus resources where they matter most. For example, you can adjust inventory placement, change warehouse assignments, or address carrier issues. Additionally, popular items with significant shipping delays are more likely to result in order cancellations or returns, so proactively identifying these issues helps protect revenue.

SELECT cs_item_sk,
    SUM(cs_quantity) as total_orders,
    AVG(cs_ship_date_sk - cs_sold_date_sk) as avg_ship_delay_days,
    MIN(cs_ship_date_sk - cs_sold_date_sk) as min_ship_delay,
    MAX(cs_ship_date_sk - cs_sold_date_sk) as max_ship_delay,
    SUM(
        CASE
            WHEN cs_ship_date_sk - cs_sold_date_sk > 7 THEN 1 ELSE 0
        END
    ) as late_shipments,
    SUM(
        CASE
            WHEN cs_ship_date_sk - cs_sold_date_sk > 7 THEN 1 ELSE 0
        END
    ) * 100.0 / COUNT(*) as late_shipment_pct,
    AVG(cs_ext_ship_cost) as avg_shipping_cost
FROM parquet_column_index_blog.catalog_sales
WHERE cs_item_sk IN (
        SELECT cs_item_sk
        FROM parquet_column_index_blog.catalog_sales
        WHERE cs_item_sk IS NOT NULL
        GROUP BY cs_item_sk
        ORDER BY SUM(cs_quantity) DESC
        LIMIT 10
    )
    AND cs_ship_date_sk IS NOT NULL
    AND cs_sold_date_sk IS NOT NULL
GROUP BY cs_item_sk
ORDER BY avg_ship_delay_days DESC;

Additionally, this query is a good candidate for demonstrating the effectiveness of using Parquet Column Indexes because it has a selective filter predicate on a single column cs_item_sk. When Athena executes this query, it first identifies row groups whose min/max ranges overlap with the top 10 most ordered items. Without using Parquet Column Indexes, Athena has to scan every page of data within those matched row groups. However, when using Parquet Column Indexes, Athena can prune data further by skipping individual pages within those row groups whose min/max ranges do not overlap with the ids. Complete the following steps to establish baseline query performance when Athena does not use Parquet Column Indexes during the query.

  1. Create a new Python cell in the notebook. Replace s3://amzn-s3-demo-bucket/athena_parquet_column_index_blog/query_results/ with the S3 URI where you want to store your Athena query results, then execute the following script. Note the runtime and bytes scanned that will be printed. The script will run the query five times with query result reuse disabled and chooses the minimum runtime and the corresponding bytes scanned among those iterations. See our numbers in the Run Athena query with Parquet Column Indexes section.
    import boto3
    import time
    
    # Configuration
    DATABASE = "parquet_column_index_blog"
    OUTPUT_LOCATION = "s3://amzn-s3-demo-bucket/athena_parquet_column_index_blog/query_results/"
    
    def run_athena_query(query: str, database: str, output_location: str):
        athena_client = boto3.client('athena')
        
        response = athena_client.start_query_execution(
            QueryString=query,
            QueryExecutionContext={'Database': database},
            ResultConfiguration={'OutputLocation': output_location}
        )
        
        query_execution_id = response['QueryExecutionId']
        
        while True:
            result = athena_client.get_query_execution(QueryExecutionId=query_execution_id)
            state = result['QueryExecution']['Status']['State']
            
            if state in ['SUCCEEDED', 'FAILED', 'CANCELLED']:
                break
            
            time.sleep(5)
        
        if state != 'SUCCEEDED':
            raise Exception(f"Query failed with state: {state}")
        
        stats = result['QueryExecution']['Statistics']
        
        return {
            'execution_time_sec': stats['EngineExecutionTimeInMillis'] / 1000,
            'data_scanned_gb': stats['DataScannedInBytes'] / (1024 ** 3)
        }
    
    
    def benchmark_query(query: str, database: str, output_location: str, num_runs: int = 5):
        results = []
        
        for i in range(num_runs):
            stats = run_athena_query(query, database, output_location)
            results.append(stats)
        
        best_run = min(results, key=lambda r: r['execution_time_sec'])
        
        execution_time = round(best_run['execution_time_sec'], 1)
        data_scanned = round(best_run['data_scanned_gb'], 1)
        
        print(f"Execution time: {execution_time} sec")
        print(f"Data scanned: {data_scanned} GB")
    
    
    QUERY = """
    SELECT cs_item_sk,
        SUM(cs_quantity) as total_orders,
        AVG(cs_ship_date_sk - cs_sold_date_sk) as avg_ship_delay_days,
        MIN(cs_ship_date_sk - cs_sold_date_sk) as min_ship_delay,
        MAX(cs_ship_date_sk - cs_sold_date_sk) as max_ship_delay,
        SUM(
            CASE
                WHEN cs_ship_date_sk - cs_sold_date_sk > 7 THEN 1 ELSE 0
            END
        ) as late_shipments,
        SUM(
            CASE
                WHEN cs_ship_date_sk - cs_sold_date_sk > 7 THEN 1 ELSE 0
            END
        ) * 100.0 / COUNT(*) as late_shipment_pct,
        AVG(cs_ext_ship_cost) as avg_shipping_cost
    FROM parquet_column_index_blog.catalog_sales
    WHERE cs_item_sk IN (
            SELECT cs_item_sk
            FROM parquet_column_index_blog.catalog_sales
            WHERE cs_item_sk IS NOT NULL
            GROUP BY cs_item_sk
            ORDER BY SUM(cs_quantity) DESC
            LIMIT 10
        )
        AND cs_ship_date_sk IS NOT NULL
        AND cs_sold_date_sk IS NOT NULL
    GROUP BY cs_item_sk
    ORDER BY avg_ship_delay_days DESC;
    """
    
    # Run benchmark
    benchmark_query(QUERY, DATABASE, OUTPUT_LOCATION, num_runs=5)

Sort the catalog_sales table

Before rerunning the query with Athena using Parquet Column Indexes, you need to sort the catalog_sales table by the cs_item_sk column. In the preceding query, there is a dynamic filter as a subquery on the cs_item_sk column:

cs_item_sk IN (
        SELECT cs_item_sk
        FROM parquet_column_index_blog.catalog_sales
        WHERE cs_item_sk IS NOT NULL
        GROUP BY cs_item_sk
        ORDER BY SUM(cs_quantity) DESC
        LIMIT 10
    )

When executing this query, Athena pushes down the filter predicate to the data source level, fetching only rows that match the top 10 most ordered items. To maximize page pruning with Parquet Column Indexes, rows with the same cs_item_sk values should be stored near each other in the Parquet file. Without sorting, matching values could be scattered across many pages, forcing Athena to read more data. Sorting the table by cs_item_sk clusters similar values together, enabling Athena to read fewer pages.

Let’s examine the Parquet Column Indexes in one of the Parquet files to understand how the data in the catalog_sales table is currently organized. First, download the Parquet file from the cs_sold_date_sk = 2450815 partition and install the open-source parquet-cli tool on your local machine. Replace <local-path-to-parquet-file> with the path to the downloaded Parquet file, then run the following command on your local machine:

parquet column-index <local-path-to-parquet-file>

This displays Parquet Column Indexes for all columns. For brevity, only the first 11 pages of the cs_item_sk column from the first row group are shown in the following example:

row-group 0:
column index for column cs_item_sk:
Boundary order: UNORDERED
         null_count  min  max
page-0            0    4  359989
page-1            0    2  359996
page-2            0   10  359995
page-3            0   13  359996
page-4            0   22  359989
page-5            0   25  359984
page-6            0   13  359989
page-7            0   56  359990
page-8            0   14  359984
page-9            0    7  359978
page-10           0    1  359998

Notice that nearly every page contains a wide range of values. This overlap means Athena cannot eliminate pages when filtering with Parquet Column Indexes on cs_item_sk. For example, searching for cs_item_sk = 100 requires scanning each of the 11 pages because the value 100 falls within every page’s min/max range. With this overlap, enabling Athena to use Parquet Column Indexes would provide no performance benefit. Sorting the data by cs_item_sk eliminates this overlap, creating distinct, non-overlapping ranges for each page. To make Parquet Column Indexes more effective, sort the table by completing the following step:

  1. Create a new SQL cell in the notebook and verify the connection type is Athena (Spark). Execute the query to sort the cs_item_sk column values of the catalog_sales table in ascending order and to put all the null values in the last few Parquet pages. New Iceberg data files will be generated from this query.
    CALL spark_catalog.system.rewrite_data_files(
    table => 'parquet_column_index_blog.catalog_sales', 
    strategy => 'sort', 
    sort_order => 'cs_item_sk ASC NULLS LAST', 
    options => map('target-file-size-bytes', '1073741824', 
    'rewrite-all', 'true', 'max-concurrent-file-group-rewrites', '200'));

Running the parquet column-index command on the sorted data file from the cs_sold_date_sk = 2450815 partition shows that the Parquet Column Indexes are now sorted and have non-overlapping ranges. The first 11 pages of the cs_item_sk column from the first row group are shown in the following example:

row-group 0:
column index for column cs_item_sk:
Boundary order: ASCENDING
         null_count  min    max
page-0           0      1   5282
page-1           0   5282  10556
page-2           0  10556  15842
page-3           0  15842  21154
page-4           0  21154  26434
page-5           0  26434  31669
page-6           0  31669  36916
page-7           0  36916  42205
page-8           0  42205  47528
page-9           0  47528  52808
page-10          0  52808  58189

Now when searching for cs_item_sk = 100, Athena only needs to read page-0, skipping the remaining 10 pages entirely.

Run Athena query with Parquet Column Indexes

Now that the data is sorted to eliminate overlapping pages within the row groups for the cs_item_sk column, we run two experiments on the sorted data. The first measures the impact of sorting alone, and the second measures the combined effect of sorting with Parquet Column Indexes.

  1. Create a new Python cell in the notebook. Execute the same script in the section Run Athena query without Parquet Column Indexes and take note of the query runtime and bytes scanned results. This measures the performance of querying sorted data without using Parquet Column Indexes.
  2. Create a new Python cell in the notebook. Execute the following Python script to set the use_iceberg_parquet_column_index table property to true for the catalog_sales table in the AWS Glue Data Catalog.
    import boto3
    
    def add_iceberg_parquet_column_index(database_name: str, table_name: str):
        glue_client = boto3.client('glue')
        
        # Get current table definition
        response = glue_client.get_table(DatabaseName=database_name, Name=table_name)
        table = response['Table']
        
        # Build TableInput with only allowed fields
        table_input = {'Name': table['Name']}
        
        allowed_fields = [
            'Description', 'Owner', 'LastAccessTime', 'LastAnalyzedTime',
            'Retention', 'StorageDescriptor', 'PartitionKeys', 'ViewOriginalText',
            'ViewExpandedText', 'TableType', 'Parameters', 'TargetTable'
        ]
        
        for field in allowed_fields:
            if field in table:
                table_input[field] = table[field]
        
        # Add the property
        if 'Parameters' not in table_input:
            table_input['Parameters'] = {}
        table_input['Parameters']['use_iceberg_parquet_column_index'] = 'true'
        
        # Update the table
        glue_client.update_table(DatabaseName=database_name, TableInput=table_input)
    
    # Usage
    add_iceberg_parquet_column_index("parquet_column_index_blog", "catalog_sales")
  3. Create a new Python cell in the notebook. Execute the same script in the section Run Athena query without Parquet Column Indexes and take note of the query runtime and bytes scanned results. This measures the performance of querying sorted data using Parquet Column Indexes.

Athena query time and bytes scanned improvement

The following table summarizes the results from each experiment. The percentage improvements for the sorted experiments are measured against the unsorted baseline.

Experiment Runtime (sec) Bytes Scanned (GB)
Unsorted without Parquet Column Indexes 20.6 45.2
Sorted without Parquet Column Indexes 15.4 (25.2% faster) 27.8 (38.5% fewer bytes)
Sorted with Parquet Column Indexes 10.3 (50.0% faster) 13.0 (71.2% fewer bytes)

Recommendations

To maximize Athena’s ability to use Parquet Column Indexes and achieve optimal query performance, we recommend the following.

  1. Sort data by frequently filtered columns. This allows Athena to efficiently read Parquet Column Indexes and skip irrelevant pages, potentially reducing scan time. When data is sorted by a filter column, similar values are clustered together within pages. Because Parquet Column Indexes store min/max values for each page, Athena can quickly determine which pages contain matching values and skip the rest.
  2. Sort data by high-cardinality columns. This creates distinct value ranges between pages, maximizing the opportunity for Athena to skip pages during query execution. High-cardinality (many distinct values) columns produce non-overlapping min/max ranges across pages, allowing Athena to more effectively filter out irrelevant pages. In contrast, low-cardinality columns such as boolean or status fields result in overlapping ranges across many pages, reducing the number of skipped pages.

Clean up

When you have finished the steps in this post, complete the following cleanup actions to avoid incurring ongoing charges:

  1. Create a new SQL cell in the notebook and set the connection type to Athena (Spark). Execute the following command to drop the parquet_column_index_blog database and the catalog_sales table.
    DROP DATABASE parquet_column_index_blog CASCADE;
  2. Delete the Iceberg table data and the Athena query results from your S3 bucket.
  3. Delete the SageMaker Unified Studio IAM-based domain if it is no longer needed.

Conclusion

In this post, we showed you how Athena uses Parquet Column Indexes to speed up queries and reduce the number of bytes scanned. By using Parquet Column Indexes, Athena can skip irrelevant data pages to improve query performance, especially for queries with selective filters on sorted data. Refer to Optimize Iceberg tables to learn more about this feature and try it out on your own queries.


About the Author

Portrait photograph of a young Asian male in his twenties wearing a black t-shirt against a neutral gray background

Matt Wong

Matt is a Software Development Engineer on Amazon Athena. He has worked on several projects within the Amazon Athena Datalake and Storage team and is continuing to build out more Athena features. Outside of work, Matthew likes to spend time juggling, biking, and running with family and friends.