Speed up your Amazon Athena queries using partition projection
This post is co-written with Steven Wasserman of Vertex, Inc.
Amazon Athena is an interactive query service that makes it easy to analyze data stored in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use—simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.
Athena has added support for partition projection, a new functionality that you can use to speed up query processing of highly partitioned tables.
In this post, we explore the partition projection feature and how it can speed up query runs. We also dig into the details of how Vertex Inc. used partition projection to improve the performance of their high-volume reporting system. Vertex Inc. provides comprehensive solutions that automate indirect tax processes for businesses worldwide, helping them manage the increasingly complex tax landscape.
What is partition projection
When processing queries, Athena retrieves metadata information from your metadata store such as the AWS Glue Data Catalog or your Hive metastore before performing partition pruning. Partition pruning refers to the step where Athena gathers metadata information and trims it down to only the partitions that apply to your query. This often speeds up queries and results in a comparatively smaller amount of data scanned for the query. Athena uses partition pruning for all tables with partition columns, including those tables configured for partition projection.
In cases when your tables have a large number of partitions, retrieving metadata can be time-consuming. To avoid this, you can use partition projection. Partition projection allows you to specify partition projection configuration, giving Athena the information necessary to build the partitions without retrieving metadata information from your metadata store. Athena reads the partition values and locations from the configuration, rather than reading from a repository like the AWS Glue Data Catalog. Partition projection reduces the runtime of queries against highly partitioned tables because in-memory operations are often faster than remote operations.
Use cases for partition projection
Partition projection can help speed up your queries in several use cases:
- Queries against a highly partitioned table don’t complete as quickly as you would like.
- You regularly add partitions to tables as new date or time partitions are created in your data. With partition projection, you configure relative date ranges to use as new data arrives.
- You have highly partitioned data in Amazon S3. The data is impractical to model in your Data Catalog or Hive metastore, and your queries read only small parts of it.
For more information and usage examples, see Partition Projection with Amazon Athena.
How Vertex used partition projection to improve query performance
Vertex provides capabilities that enable customers to generate reports on the amount of taxes collected against their transactions for a designated period (usually monthly). Customers use this data to reconcile and meet their month-end reporting needs, as well as ad hoc reports.
To support their customers’ compliance requirements, Vertex needed a solution that provided on-demand access to reports against high volumes of transactional data. Vertex was looking for ways to improve the customer experience by reducing query runtime and avoid causing delays to customer processes.
Vertex and AWS account teams dove deep into the details of their datasets to identify opportunities for optimization and reduction of query processing times. The AWS account team understood Vertex’s access patterns and the partitioned nature of the data, and partnered with the Athena service team to explore roadmap items of interest and opportunities to leverage features that could further improve query performance. The Athena team provided access to partition projection, a new capability that was in preview at the time, for the Vertex team to test.
In the Vertex multi-tenant cloud solution, a reporting service runs queries on the customer’s behalf. The data is partitioned by tenant and date in order to support all their processing and reporting needs. Before partition projection, each query run needed to request the required partitioning metadata from the Data Catalog, resulting in growing query latency as new data and time partitions were created with incoming data.
Before partition projection was enabled on the table, the production query took 137 seconds to run. With partition projection, it ran in 10 seconds, an improvement of approximately 92% in runtime.
Let’s discuss the partition projection properties to understand how partition projection enabled a 92% improvement in query latency.
At the time of this test, the table contained approximately 18,000 partitions with the following partition columns:
In the preceding code,
id_column represents a unique tenant in this table, and
postdate represents the date of transaction activity for a tenant.
The following partition projection attributes were defined in the table’s DDL:
The attributes are defined as follows:
- projection.enabled – This is set to
TRUEto enable partition projection.
- projection.postdate.range – In the preceding code,
postdateis one of the partitioned columns in the table. The value for this property defines the range of values to be projected when a query is run. The minimum value of range is
2013-10-01and the maximum value is
NOW+12YEARS(current date + 12 years).
- projection.postdate.type – The projection type is used for the
postdatecolumn. The date value signals the use of
- projection.postdate.format – This date format string is based on the Java date format DateTimeFormatter. In this code, it’s
yyyy-MM-dd, representing year, month, and day.
- projection.id_column.type –
id_columnis another partition column in the table. The value for this property is
INJECTED. We use the injected type for partition columns with possible values that can’t be procedurally generated within some logical range but are provided in a query’s
WHEREclause as a single value. An example value of
9|P7130. Vertex specifies the value for
id_columnin each query.
The following code is one such query, with and without partition projection enabled:
For this query run, with partition projection disabled, the response time was approximately 85 seconds. With partition projection enabled, the query response time was approximately 15 seconds, resulting in an 82% runtime improvement.
Month-end batch processing involves similar queries for every tenant and jurisdiction. When Vertex processed month-end reports for all customers and jurisdictions, their processing time went from 4.5 hours to 40 minutes, an 85% improvement with the partition projection feature.
Partition projection is usable only when the table is queried through Athena. If the same table is read through another service such as Amazon Redshift Spectrum or Amazon EMR, the standard partition metadata is used.
Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Vertex used Athena to provide customers valuable tax reporting capabilities to support core business processes. Recently, Athena added support for partition projection, a new functionality to speed up query processing of highly partitioned tables and automate partition management. Vertex used partition projection to improve production query response times by 92% and month-end batch processing of reports by 85%.
About the Authors
Steven Wasserman is a Principal Enterprise/Solution Architect for Vertex, Inc. and a subject matter expert in big data, databases, technical solutioning, enterprise architecture, and cloud technologies. Steve has over 30 years of experience working with clients and employers developing profit-producing, data-centric solutions.
Juan Lamadrid is a New York-based Solutions Architect for AWS. He works with numerous enterprise customers helping them achieve their digital innovation and modernization goals.
Pathik Shah is a Big Data Architect at AWS.
Janak Agarwal is a product manager for Athena at AWS.