AWS Big Data Blog
Improve query performance using AWS Glue partition indexes
While creating data lakes on the cloud, the data catalog is crucial to centralize metadata and make the data visible, searchable, and queryable for users. With the recent exponential growth of data volume, it becomes much more important to optimize data layout and maintain the metadata on cloud storage to keep the value of data lakes.
Partitioning has emerged as an important technique for optimizing data layout so that the data can be queried efficiently by a variety of analytic engines. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns. Over time, hundreds of thousands of partitions get added to a table, resulting in slow queries. To speed up query processing of highly partitioned tables cataloged in AWS Glue Data Catalog, you can take advantage of AWS Glue partition indexes.
Partition indexes are available for queries in Amazon EMR, Amazon Redshift Spectrum, and AWS Glue extract, transform, and load (ETL) jobs (Spark DataFrame and Glue DynamicFrame). When partition indexes are enabled on the heavily partitioned AWS Glue Data Catalog tables, all these query engines are accelerated. You can add partition indexes to both new tables and existing tables. This post demonstrates how to utilize partition indexes, and discusses the benefit you can get with partition indexes when working with highly partitioned data.
Partition indexes
AWS Glue partition indexes are an important configuration to reduce overall data transfers and processing, and reduce query processing time. In the AWS Glue Data Catalog, the GetPartitions API is used to fetch the partitions in the table. The API returns partitions that match the expression provided in the request. If no partition indexes are present on the table, all the partitions of the table are loaded, and then filtered using the query expression provided by the user in the GetPartitions
request. The query takes more time to run as the number of partitions increase on a table with no indexes. With an index, the GetPartitions
request tries to fetch a subset of the partitions instead of loading all the partitions in the table.
The following are key benefits of partition indexes:
- Increased query performance
- Increased concurrency as a result of fewer
GetPartitions
API calls - Cost savings:
- Analytic engine cost (query performance is related to the charges in Amazon EMR and AWS Glue ETL)
- AWS Glue Data Catalog API request cost
Setting up resources with AWS CloudFormation
This post provides an AWS CloudFormation template for a quick setup. You can review and customize it to suit your needs. Some of the resources that this stack deploys incur costs when in use.
The CloudFormation template generates the following resources:
- AWS Identity and Access Management(IAM) users, roles, and policies
- AWS Glue database, tables and partitions
If you’re using AWS Lake Formation permissions, you need to ensure that the IAM user or role running AWS CloudFormation has the required permissions (to create a database on the Data Catalog).
The tables use sample data located in an Amazon Simple Storage Service (Amazon S3) public bucket. Initially, no partition indexes are configured in these AWS Glue Data Catalog tables.
To create your resources, complete the following steps:
- Sign in to the CloudFormation console.
- Choose Launch Stack:
- Choose Next.
- For DatabaseName, leave as the default.
- Choose Next.
- On the next page, choose Next.
- Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create.
Stack creation can take up to 5 minutes. When the stack is completed, you have two Data Catalog tables: table_with_index
and table_without_index
. Both tables point to the same S3 bucket, and the data is highly partitioned based on year
, month
, day
, and hour
columns for more than 42 years (1980-2021). In total, there are 367,920 partitions, and each partition has one JSON file, data.json
. In the following sections, you see how the partition indexes work with these sample tables.
Setting up a partition index on the AWS Glue console
You can create partition indexes at any time. If you want to create a new table with partition indexes, you can make the CreateTable
API call with a list of PartitionIndex
objects. If you want to add a partition index to an existing table, make the CreatePartitionIndex
API call. You can also perform these actions on the AWS Glue console. You can create up to three partition indexes on a table.
Let’s configure a new partition index for the table table_with_index
we created with the CloudFormation template.
- On the AWS Glue console, choose Tables.
- Choose the table
table_with_index
. - Choose Partitions and indices.
- Choose Add new index.
- For Index name, enter
year-month-day-hour
. - For Selected keys from schema, select year, month, day, and hour.
- Choose Add index.
The Status column of the newly created partition index shows the status as Creating
. Wait for the partition index to be Active
. The process takes about 1 hour because more number of partitions longer it takes for index creation and we have 367,920 partitions on this table.
Now the partition index is ready for the table table_with_index
. You can use this index from various analytic engines when you query against the table. You see default behavior in the table table_without_index
because no partition indexes are configured for this table.
You can follow (or skip) any of the following sections based on your interest.
Making a GetPartitions API call with an expression
Before we use the partition index from various query engines, let’s try making the GetPartitions
API call using AWS Command Line Interface (AWS CLI) to see the difference. The AWS CLI get-partitions
command makes multiple GetPartitions
API calls if needed. In this section, we simply use the time
command to compare the duration for each table, and use the debug logging to compare the number of API calls for each table.
- Run the
get-partitions
command against the tabletable_without_index
with the expressionyear='2021' and month='04' and day='01'
:
The command took about 4 minutes. Note that you used only three partition columns out of four.
- Run the same command with debug logging to get the number of the
GetPartitionsAPI
calls:
There were 737 GetPartitions
API calls when the partition indexes aren’t used.
- Next, run the
get-partitions
command againsttable_with_index
with the same expression:
The command took just 2.7 seconds. You can see how quickly the required partitions were returned.
- Run the same command with debug logging to get the number of the
GetPartitionsAPI
calls:
There were only four GetPartitions
API calls when the partition indexes are used.
Querying a table using Apache Spark on Amazon EMR
In this section, we explore querying a table using Apache Spark on Amazon EMR.
- Launch a new EMR cluster with Apache Spark.
For instructions, see Setting Up Amazon EMR. You need to specify the AWS Glue Data Catalog as the metastore. In this example, we use the default EMR cluster (release: emr-6.2.0, three m5.xlarge nodes).
- Connect to the EMR node using SSH.
- Run the
spark-sql
command on the EMR node to start an interactive shell for Spark SQL: - Run the following SQL against
partition_index.table_without_index
:
The query took 35 seconds. Even though you aggregated records only in the specific partition, the query took so long because there are many partitions and the GetPartitions
API call takes time.
Now let’s run the same query against table_with_index
to see how much benefit the partition index introduces.
- Run the following SQL against
partition_index.table_with_index
:
The query took just 2 seconds. The reason for the difference in query duration is because the number of GetPartitions
calls is smaller because of the partition index.
The following chart shows the granular metrics for query planning time without and with the partition index. The query planning time with the index is far less than that without the index.
For more information about comparing metrics in Apache Spark, see Appendix 2 at the end of this post.
Querying a table using Redshift Spectrum
To query with Redshift Spectrum, complete the following steps:
You need to configure an IAM role for the cluster to utilize Redshift Spectrum and the Amazon Redshift query editor. Choose dc2.large, 1 node in this example. You need to launch the cluster in the us-east-1
Region because you need to place your cluster in the same Region as the bucket location.
- Connect with the Redshift query editor. For instructions, see Querying a database using the query editor.
- Create an external schema for the
partition_index
database to use it in Redshift Spectrum: (replace <your IAM role ARN> with your IAM role ARN). - Run the following SQL against
spectrum_schema.table_without_index
:
The following screenshot shows our output.
The query took more than 3 minutes.
- Run the following SQL against
spectrum_schema.table_with_index
:
The following screenshot shows our output.
The query for the table using indexes took just 8 seconds, which is much faster than the table without indexes.
Querying a table using AWS Glue ETL
Let’s launch an AWS Glue development endpoint and an Amazon SageMaker notebook.
- Open the AWS Glue console, choose Dev endpoints.
- Choose Add endpoint.
- For Development endpoint name, enter
partition-index
. - For IAM role, choose your IAM role.
For more information about roles, see Managing Access Permissions for AWS Glue Resources.
- For Worker type under Security configuration, script libraries, and job parameters (optional), choose 1X.
- For Number of workers, enter
4
. - For Dependent jar path, enter
s3://crawler-public/json/serde/json-serde.jar
. - Select Use Glue data catalog as the Hive metastore under Catalog options (optional).
- Choose Next.
- For Networking, leave as is (by default, Skip networking configuration is selected), and choose Next.
- For Add an SSH public key (Optional), leave it blank, and choose Next.
- Choose Finish.
- Wait for the development endpoint
partition-index
to show asREADY
.
The endpoint may take up to 10 minutes to be ready.
- Select the development endpoint
partition-index
, and choose Create SageMaker notebook on the Actions - For Notebook name, enter
partition-index
. - Select Create an IAM role.
- For IAM role, enter
partition-index
. - Choose Create notebook.
- Wait for the notebook
aws-glue-partition-index
to show the status asReady
.
The notebook may take up to 3 minutes to be ready.
- Select the notebook
aws-glue-partition-index
, and choose Open notebook. - Choose Sparkmagic (PySpark)on the New
- Enter the following code snippet against
table_without_index
, and run the cell:
The following screenshot shows our output.
The query took 3 minutes.
- Enter the following code snippet against
partition_index.table_with_index
, and run the cell:
The following screenshot shows our output.
The cell took just 7 seconds. The query for the table using indexes is faster than the table without indexes.
Cleaning up
Now to the final step, cleaning up the resources:
- Delete the CloudFormation stack.
- Delete the EMR cluster.
- Delete the Amazon Redshift cluster.
- Delete the AWS Glue development endpoint and SageMaker notebook.
Conclusion
In this post, we explained how to use partition indexes and how they accelerate queries in various query engines. If you have several millions of partitions, the performance benefit is significantly more. You can learn about partition indexes more deeply in Working with Partition Indexes.
Appendix 1: Setting up a partition index using AWS CLI
If you prefer using the AWS CLI, run the following create-partition-index
command to set up a partition index:
To get the status of the partition index, run the following get-partition-indexes
command:
Appendix 2: Comparing breakdown metrics in Apache Spark
If you’re interested in comparing the breakdown metrics for query planning time, you can register a SQL listener with the following Scala code snippet:
If you use spark-shell, you can register the listener as follows:
Then run the same query without using the index to get the breakdown metrics:
In this example, we use the same setup for the EMR cluster (release: emr-6.2.0, three m5.xlarge nodes). The console has additional line:
Apache Spark’s query planning mechanism has three phases: analysis, optimization, and physical planning (shown as just planning
). This line means that the query planning took 4 milliseconds in analysis, 29,002 milliseconds in optimization, and 208 milliseconds in physical planning.
Let’s try running the same query using the index:
The query planning took 2 milliseconds in analysis, 608 milliseconds in optimization, and 7 milliseconds in physical planning.
About the Authors
Noritaka Sekiyama is a Senior Big Data Architect at AWS Glue and AWS Lake Formation. He is passionate about big data technology and open source software, and enjoys building and experimenting in the analytics area.
Sachet Saurabh is a Senior Software Development Engineer at AWS Glue and AWS Lake Formation. He is passionate about building fault tolerant and reliable distributed systems at scale.
Vikas Malik is a Software Development Manager at AWS Glue. He enjoys building solutions that solve business problems at scale. In his free time, he likes playing and gardening with his kids and exploring local areas with family.