AWS Big Data Blog
How to use Parquet Column Indexes with Amazon Athena
Amazon Athena recently added support for reading Parquet Column Indexes in Apache Iceberg tables on November 21, 2025. With this optimization, Athena can perform page-level data pruning to skip unnecessary data within Parquet row groups, potentially reducing the amount of data scanned and improving query runtime for queries with selective filters. For data teams, this may help enable faster insights and help reduce costs when analyzing large-scale data lakes.
Data teams building data lakes often choose Apache Iceberg for its ACID transactions, schema evolution, and metadata management capabilities. Athena is a serverless query engine that allows you to query Amazon S3-based data lakes using SQL, and you don’t need to manage infrastructure. Based on the type of data and query logic, Athena can apply multiple query optimizations to improve performance and reduce costs.
In this blog post, we use Athena and Amazon SageMaker Unified Studio to explore Parquet Column Indexes and demonstrate how they can improve Iceberg query performance. We explain what Parquet Column Indexes are, demonstrate their performance benefits, and show you how to use them in your applications.
Overview of Parquet Column Indexes
Parquet Column Indexes store metadata that query engines can use to skip irrelevant data with greater precision than row group statistics alone. To understand how they work, consider how data is structured within Parquet files and how engines like Athena process them.
Parquet files organize data hierarchically by dividing data into row groups (typically 128-512 MB each) and further subdividing them into pages (typically 1 MB each). Traditionally, Parquet maintains metadata on the contents of each row group level in the form of min/max statistics, allowing engines like Athena to skip row groups that don’t satisfy query predicates. Although this approach reduces the bytes scanned and query runtime, it has limitations. If even a single page within a row group overlaps with the values you are searching for, Athena scans all pages within the row group.
Parquet Column Indexes help address this problem by storing page-level min/max statistics in the Parquet file footer. Row group statistics provide coarse-grained filtering, but Parquet Column Indexes enable finer-grained filtering by allowing query engines like Athena to skip individual pages within a row group. Consider a Parquet file with a single row group containing 5 pages for a column. The row group has min/max statistics of (1, 20), and each page for that column has the following min/max statistics.
When Athena runs a query filtering for values equal to 2, it first checks the row group statistics and confirms that 2 falls within the range (1, 20). Athena will then plan to scan the pages within that row group. Without Parquet Column Indexes, Athena scans each of the 5 pages in the row group. With Parquet Column Indexes, Athena examines the page-level statistics and determines that only page-0 and page-1 need to be read, skipping the remaining 3 pages.
How to use Parquet Column Indexes with Athena
Athena uses Parquet Column Indexes based on table type:
- Amazon S3 Tables: Athena automatically uses Parquet Column Indexes by default when they are present.
- Iceberg tables in S3 general purpose buckets: Athena does not use Parquet Column Indexes by default. To allow Athena to use Parquet Column Indexes, add an AWS Glue table property named
use_iceberg_parquet_column_indexand set it totrue. Use the AWS Glue console or AWS Glue UpdateTable API to perform these actions.
Read more about how to use this feature in Use Parquet column indexing.
Measuring Athena performance gains when using Parquet Column Indexes
Now that we understand what Parquet Column Indexes are, we’ll demonstrate the performance benefits of using Parquet Column Indexes by analyzing the catalog_sales table from a 3TB TPC-DS dataset. This table contains ecommerce transaction data including order dates, sales amounts, customer IDs, and product information. This dataset is a good proxy for the types of business analysis that you might perform on your own data, such as identifying sales trends, analyzing customer purchasing patterns, and calculating revenue metrics. We compare query execution statistics with and without Parquet Column Indexes to quantify the performance improvement.
Prerequisites
Before you begin, you must have the following resources:
- A SageMaker Unified Studio IAM-based domain.
- An Execution IAM Role configured within the SageMaker Unified Studio IAM-based domain with access to S3, AWS Glue Data Catalog, and Athena.
- An S3 bucket in your account to store Iceberg table data and Athena query results.
Create catalog_sales Iceberg table
Complete the following steps using SageMaker Unified Studio notebooks. There, you can use SageMaker Unified Studio’s multi-dialect notebook functionality to work with your data using the Athena SQL and Spark engines. To create a catalog_sales Iceberg table in your account, follow these steps:
- Navigate to Amazon SageMaker in the AWS Management Console and choose Open under Get started with Amazon SageMaker Unified Studio.
- From the side navigation, select Notebooks and choose Create Notebook. The subsequent steps in this post will execute scripts in this notebook.
- Create a new SQL cell in the notebook and set the connection type to Athena (Spark). Execute the following query to create a database for the tables in this post.
- Create a new SQL cell in the notebook and verify the connection type is Athena (Spark). Execute the following query to create a Hive table pointing to the location of the TPC-DS
catalog_salestable data at the public S3 bucket. - Create a new SQL cell in the notebook and verify the connection type is Athena (Spark). Execute the following query to add the Hive partitions to the AWS Glue metadata.
- Create a new SQL cell in the notebook and verify the connection type is Athena (Spark). Replace
s3://amzn-s3-demo-bucket/athena_parquet_column_index_blog/catalog_sales/with the S3 URI where you want to store your Iceberg table data, then execute the following query to create thecatalog_salesIceberg table from the Hive table. - Create a new SQL cell in the notebook and verify the connection type is Athena (Spark). Execute the following query to delete the
catalog_sales_hivetable, which was only needed to create thecatalog_salesIceberg table.
Run an Athena query without Parquet Column Indexes
After creating the catalog_sales Iceberg table in the preceding steps, we run a simple query that analyzes shipping delays of the top 10 most ordered items. This type of analysis could be critical for ecommerce and retail operations. By identifying which popular items experience the greatest delays, fulfillment teams can focus resources where they matter most. For example, you can adjust inventory placement, change warehouse assignments, or address carrier issues. Additionally, popular items with significant shipping delays are more likely to result in order cancellations or returns, so proactively identifying these issues helps protect revenue.
Additionally, this query is a good candidate for demonstrating the effectiveness of using Parquet Column Indexes because it has a selective filter predicate on a single column cs_item_sk. When Athena executes this query, it first identifies row groups whose min/max ranges overlap with the top 10 most ordered items. Without using Parquet Column Indexes, Athena has to scan every page of data within those matched row groups. However, when using Parquet Column Indexes, Athena can prune data further by skipping individual pages within those row groups whose min/max ranges do not overlap with the ids. Complete the following steps to establish baseline query performance when Athena does not use Parquet Column Indexes during the query.
- Create a new Python cell in the notebook. Replace
s3://amzn-s3-demo-bucket/athena_parquet_column_index_blog/query_results/with the S3 URI where you want to store your Athena query results, then execute the following script. Note the runtime and bytes scanned that will be printed. The script will run the query five times with query result reuse disabled and chooses the minimum runtime and the corresponding bytes scanned among those iterations. See our numbers in the Run Athena query with Parquet Column Indexes section.
Sort the catalog_sales table
Before rerunning the query with Athena using Parquet Column Indexes, you need to sort the catalog_sales table by the cs_item_sk column. In the preceding query, there is a dynamic filter as a subquery on the cs_item_sk column:
When executing this query, Athena pushes down the filter predicate to the data source level, fetching only rows that match the top 10 most ordered items. To maximize page pruning with Parquet Column Indexes, rows with the same cs_item_sk values should be stored near each other in the Parquet file. Without sorting, matching values could be scattered across many pages, forcing Athena to read more data. Sorting the table by cs_item_sk clusters similar values together, enabling Athena to read fewer pages.
Let’s examine the Parquet Column Indexes in one of the Parquet files to understand how the data in the catalog_sales table is currently organized. First, download the Parquet file from the cs_sold_date_sk = 2450815 partition and install the open-source parquet-cli tool on your local machine. Replace <local-path-to-parquet-file> with the path to the downloaded Parquet file, then run the following command on your local machine:
This displays Parquet Column Indexes for all columns. For brevity, only the first 11 pages of the cs_item_sk column from the first row group are shown in the following example:
Notice that nearly every page contains a wide range of values. This overlap means Athena cannot eliminate pages when filtering with Parquet Column Indexes on cs_item_sk. For example, searching for cs_item_sk = 100 requires scanning each of the 11 pages because the value 100 falls within every page’s min/max range. With this overlap, enabling Athena to use Parquet Column Indexes would provide no performance benefit. Sorting the data by cs_item_sk eliminates this overlap, creating distinct, non-overlapping ranges for each page. To make Parquet Column Indexes more effective, sort the table by completing the following step:
- Create a new SQL cell in the notebook and verify the connection type is Athena (Spark). Execute the query to sort the
cs_item_skcolumn values of thecatalog_salestable in ascending order and to put all the null values in the last few Parquet pages. New Iceberg data files will be generated from this query.
Running the parquet column-index command on the sorted data file from the cs_sold_date_sk = 2450815 partition shows that the Parquet Column Indexes are now sorted and have non-overlapping ranges. The first 11 pages of the cs_item_sk column from the first row group are shown in the following example:
Now when searching for cs_item_sk = 100, Athena only needs to read page-0, skipping the remaining 10 pages entirely.
Run Athena query with Parquet Column Indexes
Now that the data is sorted to eliminate overlapping pages within the row groups for the cs_item_sk column, we run two experiments on the sorted data. The first measures the impact of sorting alone, and the second measures the combined effect of sorting with Parquet Column Indexes.
- Create a new Python cell in the notebook. Execute the same script in the section Run Athena query without Parquet Column Indexes and take note of the query runtime and bytes scanned results. This measures the performance of querying sorted data without using Parquet Column Indexes.
- Create a new Python cell in the notebook. Execute the following Python script to set the
use_iceberg_parquet_column_indextable property totruefor thecatalog_salestable in the AWS Glue Data Catalog. - Create a new Python cell in the notebook. Execute the same script in the section Run Athena query without Parquet Column Indexes and take note of the query runtime and bytes scanned results. This measures the performance of querying sorted data using Parquet Column Indexes.
Athena query time and bytes scanned improvement
The following table summarizes the results from each experiment. The percentage improvements for the sorted experiments are measured against the unsorted baseline.
| Experiment | Runtime (sec) | Bytes Scanned (GB) |
| Unsorted without Parquet Column Indexes | 20.6 | 45.2 |
| Sorted without Parquet Column Indexes | 15.4 (25.2% faster) | 27.8 (38.5% fewer bytes) |
| Sorted with Parquet Column Indexes | 10.3 (50.0% faster) | 13.0 (71.2% fewer bytes) |
Recommendations
To maximize Athena’s ability to use Parquet Column Indexes and achieve optimal query performance, we recommend the following.
- Sort data by frequently filtered columns. This allows Athena to efficiently read Parquet Column Indexes and skip irrelevant pages, potentially reducing scan time. When data is sorted by a filter column, similar values are clustered together within pages. Because Parquet Column Indexes store min/max values for each page, Athena can quickly determine which pages contain matching values and skip the rest.
- Sort data by high-cardinality columns. This creates distinct value ranges between pages, maximizing the opportunity for Athena to skip pages during query execution. High-cardinality (many distinct values) columns produce non-overlapping min/max ranges across pages, allowing Athena to more effectively filter out irrelevant pages. In contrast, low-cardinality columns such as boolean or status fields result in overlapping ranges across many pages, reducing the number of skipped pages.
Clean up
When you have finished the steps in this post, complete the following cleanup actions to avoid incurring ongoing charges:
- Create a new SQL cell in the notebook and set the connection type to Athena (Spark). Execute the following command to drop the
parquet_column_index_blogdatabase and thecatalog_salestable. - Delete the Iceberg table data and the Athena query results from your S3 bucket.
- Delete the SageMaker Unified Studio IAM-based domain if it is no longer needed.
Conclusion
In this post, we showed you how Athena uses Parquet Column Indexes to speed up queries and reduce the number of bytes scanned. By using Parquet Column Indexes, Athena can skip irrelevant data pages to improve query performance, especially for queries with selective filters on sorted data. Refer to Optimize Iceberg tables to learn more about this feature and try it out on your own queries.