AWS Big Data Blog

Improve performance of workloads containing repetitive scan filters with multidimensional data layout sort keys in Amazon Redshift

Amazon Redshift, a widely used cloud data warehouse, has evolved significantly to meet the performance requirements of the most demanding workloads. This post covers one such new feature—the multidimensional data layout sort key.

Amazon Redshift now improves your query performance by supporting multidimensional data layout sort keys, which is a new type of sort key that sorts a table’s data by filter predicates instead of physical columns of the table. Multidimensional data layout sort keys will significantly improve the performance of table scans, especially when your query workload contains repetitive scan filters.

Amazon Redshift already provides the capability of automatic table optimization (ATO), which automatically optimizes the design of tables by applying sort and distribution keys without the need for administrator intervention. In this post, we introduce multidimensional data layout sort keys as an additional capability offered by ATO and fortified by Amazon Redshift’s sort key advisor algorithm.

Multidimensional data layout sort keys

When you define a table with the AUTO sort key, Amazon Redshift ATO will analyze your query history and automatically select either a single-column sort key or multidimensional data layout sort key for your table, based on which option is better for your workload. When multidimensional data layout is selected, Amazon Redshift will construct a multidimensional sort function that co-locates rows that are typically accessed by the same queries, and the sort function is subsequently used during query runs to skip data blocks and even skip scanning the individual predicate columns.

Consider the following user query, which is a dominant query pattern in the user’s workload:

SELECT season, sum(metric2) AS "__measure__0"
FROM titles
WHERE lower(subregion) like '%united states%'
GROUP BY 1
ORDER BY 1;

Amazon Redshift stores data for each column in 1 MB disk blocks and stores the minimum and maximum values in each block as part of the table’s metadata. If a query uses a range-restricted predicate, Amazon Redshift can use the minimum and maximum values to rapidly skip over large numbers of blocks during table scans. However, this query’s filter on the subregion column can’t be used to determine which blocks to skip based on minimum and maximum values, and as a result, Amazon Redshift scans all rows from the titles table:

SELECT table_name, input_rows, step_attribute
FROM sys_query_detail
WHERE query_id = 123456789;

When the user’s query was run with titles using a single-column sort key on subregion, the result of the preceding query is as follows:

  table_name | input_rows | step_attribute
-------------+------------+---------------
  titles     | 2164081640 | 
(1 rows)

This shows that the table scan read 2,164,081,640 rows.

To improve scans on the titles table, Amazon Redshift might automatically decide to use a multidimensional data layout sort key. All rows that satisfy the lower(subregion) like '%united states%' predicate would be co-located to a dedicated region of the table, and therefore Amazon Redshift will only scan data blocks that satisfy the predicate.

When the user’s query is run with titles using a multidimensional data layout sort key that includes lower(subregion) like '%united states%' as a predicate, the result of the sys_query_detail query is as follows:

  table_name | input_rows | step_attribute
-------------+------------+---------------
  titles     | 152324046  | multi-dimensional
(1 rows)

This shows that the table scan read 152,324,046 rows, which is only 7% of the original, and it used the multidimensional data layout sort key.

Note that this example uses a single query to showcase the multidimensional data layout feature, but Amazon Redshift will consider all the queries running against the table and can create multiple regions to satisfy the most commonly run predicates.

Let’s take another example, with more complex predicates and multiple queries this time.

Imagine having a table items (cost int, available int, demand int) with four rows as shown in the following example.

#id cost available demand
1 4 3 3
2 2 23 6
3 5 4 5
4 1 1 2

Your dominant workload consists of two queries:

  • 70% queries pattern:
    select * from items where cost > 3 and available < demand
  • 20% queries pattern:
    select avg(cost) from items where available < demand

With traditional sorting techniques, you might choose to sort the table over the cost column, such that the evaluation of cost > 3 will benefit from the sort. So, the items table after sorting using a single cost column will look like the following.

#id cost available demand
Region #1, with cost <= 3
Region #2, with cost > 3
#id cost available demand
4 1 1 2
2 2 23 6
1 4 3 3
3 5 4 5

By using this traditional sort, we can immediately exclude the top two (blue) rows with ID 4 and ID 2, because they don’t satisfy cost > 3.

On the other hand, with a multidimensional data layout sort key, the table will be sorted based on a combination of the two commonly occurring predicates in the user’s workload, which are cost > 3 and available < demand. As a result, the table’s rows are sorted into four regions.

#id cost available demand
Region #1, with cost <= 3 and available < demand
Region #2, with cost <= 3 and available >= demand
Region #3, with cost > 3 and available < demand
Region #4, with cost > 3 and available >= demand
#id cost available demand
4 1 1 2
2 2 23 6
3 5 4 5
1 4 3 3

This concept is even more powerful when applied to entire blocks instead of single rows, when applied to complex predicates that use operators not suitable for traditional sorting techniques (such as like), and when applied to more than two predicates.

System tables

The following Amazon Redshift system tables will show users if multidimensional data layouts are used on their tables and queries:

  • To determine if a particular table is using a multidimensional data layout sort key, you can check whether sortkey1 in svv_table_info is equal to AUTO(SORTKEY(padb_internal_mddl_key_col)).
  • To determine if a particular query uses multidimensional data layout to accelerate table scans, you can check step_attribute in the sys_query_detail view. The value will be equal to multi-dimensional if the table’s multidimensional data layout sort key was used during the scan.

Performance benchmarks

We performed internal benchmark testing for multiple workloads with repetitive scan filters and see that introducing multidimensional data layout sort keys produced the following results:

  • A 74% total runtime reduction compared to having no sort key.
  • A 40% total runtime reduction compared to having the best single-column sort key on each table.
  • A 80% reduction in total rows read from tables compared to having no sort key.
  • A 47% reduction in total rows read from tables compared to having the best single-column sort key on each table.

Feature comparison

With the introduction of multidimensional data layout sort keys, your tables can now be sorted by expressions based off of the commonly occurring filter predicates in your workload. The following table provides a feature comparison for Amazon Redshift against two competitors.

Feature Amazon Redshift Competitor A Competitor B
Support for sorting on columns Yes Yes Yes
Support for sorting by expression Yes Yes No
Automatic column selection for sorting Yes No Yes
Automatic expressions selection for sorting Yes No No
Automatic selection between columns sorting or expressions sorting Yes No No
Automatic use of sorting properties for expressions during scans Yes No No

Considerations

Keep in mind the following when using a multidimensional data layout:

  • Multidimensional data layout is enabled when you set your table as SORTKEY AUTO.
  • Amazon Redshift Advisor will automatically choose either a single-column sort key or multidimensional data layout for the table by analyzing your historical workload.
  • Amazon Redshift ATO adjusts the multidimensional data layout sorting results based on the manner in which ongoing queries interact with the workload.
  • Amazon Redshift ATO maintains multidimensional data layout sort keys the same way as it currently does for existing sort keys. Refer to Working with automatic table optimization for more details on ATO.
  • Multidimensional data layout sort keys will work with both provisioned clusters and serverless workgroups.
  • Multidimensional data layout sort keys will work with your existing data as long as the AUTO SORTKEY is enabled on your table and a workload with repetitive scan filters is detected. The table will be reorganized based on the results of multi-dimensional sort function.
  • To disable multidimensional data layout sort keys for a table, use alter table: ALTER TABLE table_name ALTER SORTKEY NONE. This disables the AUTO sort key feature on the table.
  • Multidimensional data layout sort keys are preserved when restoring or migrating your provisioned cluster to a serverless cluster or vice versa.

Conclusion

In this post, we showed that multidimensional data layout sort keys can significantly improve query runtime performance for workloads where dominant queries have repetitive scan filters.

To create a preview cluster from the Amazon Redshift console, navigate to the Clusters page and choose Create preview cluster. You can create a cluster in the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Tokyo), Europe (Ireland), and Europe (Stockholm) Regions and test your workloads.

We would love to hear your feedback on this new feature and look forward to your comments on this post.


About the authors

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Jialin Ding is an Applied Scientist in the Learned Systems Group, specializing in applying machine learning and optimization techniques to improve the performance of data systems such as Amazon Redshift.

Yanzhu Ji is a Product Manager in the Amazon Redshift team. She has experience in product vision and strategy in industry-leading data products and platforms. She has outstanding skill in building substantial software products using web development, system design, database, and distributed programming techniques. In her personal life, Yanzhu likes painting, photography, and playing tennis.