Optimize your analytical workloads using the automatic query rewrite feature of Amazon Redshift materialized views

Amazon Redshift materialized views enable you to significantly improve performance of complex queries that are frequently run as part of your extract, load, and transform (ELT), business intelligence (BI), or dashboarding applications. Materialized views precompute and store the result sets of the SQL query in the view definition. Materialized views speed up data access, because the query doesn’t need to rerun the computation each time the query runs, which also reduces the resource consumption.

Amazon Redshift has the ability to automatically rewrite your SQL queries that don’t explicitly reference existing materialized views to use an existing materialized view if it will improve performance. This feature is valuable and, in some cases, the only option for performance optimization. Consider packaged ISV apps or even just reports— users often don’t have access to the SQL to optimize. In some cases, even if they do have access, the code or script is so old that nobody is familiar with it and you don’t know what regressions even a small change might introduce.

In this post, we describe how the automatic query rewrite feature works and some scenarios where you could take advantage of this feature. For information about the materialized view feature itself, refer to Speed up your ELT and BI queries with Amazon Redshift materialized views and Creating materialized views in Amazon Redshift.

All examples in this post are run on an 8 node ra3.4xlarge cluster with the 3 TB TPC-DS cloud benchmark dataset.

Let’s look at three different scenarios where the automatic query rewrite feature could help: optimizing joins between two large tables, optimizing joins for tables that have multiple join paths, and optimizing table scans.

Optimize joins between two large tables

There are many situations where you have two large tables that are joined frequently. In this case, creating a materialized view that joins these two tables could help improve the performance of those queries. Materialized views precompute the join and store the results so subsequent runs only need to retrieve the saved results; no need to run the expensive JOINs each time. With automatic query rewrite, none of the end-user queries have to be modified to refer to the materialized view. When creating the explain plan for the query, Amazon Redshift replaces the join between the two tables with the materialized view.

By default, the automatic query rewrite uses a materialized view only if it’s up to date and reflects all changes from its base tables. This means that the query isn’t rewritten to use the materialized view if the base tables have more recent updates that aren’t yet reflected in the materialized view.

For example, consider the following SQL query. The query joins two tables: store_sales (8,639,936,081 rows) and customer (30,000,000 rows):

SELECT 
cust.c_customer_id 
FROM store_sales sales
INNER JOIN customer cust
ON sales.ss_customer_sk = cust.c_customer_sk
GROUP BY cust.c_customer_id;

The query runs in 545,520 milliseconds; the following is the explain plan for the query:

XN HashAggregate  (cost=9602679386653.98..9602679386653.98 rows=29705556 width=20)
  ->  XN Hash Join DS_BCAST_INNER  (cost=375000.00..9602659714194.54 rows=7868983773 width=20)
        Hash Cond: (""outer"".ss_customer_sk = ""inner"".c_customer_sk)
        ->  XN Seq Scan on store_sales sales  (cost=0.00..86399365.12 rows=8245454518 width=4)
              Filter: (ss_customer_sk IS NOT NULL)
        ->  XN Hash  (cost=300000.00..300000.00 rows=30000000 width=24)
              ->  XN Seq Scan on customer cust  (cost=0.00..300000.00 rows=30000000 width=24)

Let’s create a materialized view that pre-computes the join between the store_sales and customer tables using the following SQL statement:

CREATE MATERIALIZED VIEW cust_store_sales
AS 
SELECT         
  cust.c_customer_id
, cust.c_first_name
, cust.c_last_name
, sales.ss_item_sk
, sales.ss_quantity
, cust.c_current_addr_sk
FROM  store_sales sales
INNER JOIN customer cust
ON sales.ss_customer_sk = cust.c_customer_sk;

Let’s now rerun the original query:

SELECT 
cust.c_customer_id 
FROM store_sales sales
INNER JOIN customer cust
ON sales.ss_customer_sk = cust.c_customer_sk
GROUP BY cust.c_customer_sk;

The query runs much faster (46,493 milliseconds). This is because of the automatic query rewrite feature, which has rewritten the preceding query to use the newly created materialized view instead of joining both tables. The explain plan for this query shows this change:

XN HashAggregate  (cost=103138905.60..103138905.60 rows=29705556 width=20)
  ->  XN Seq Scan on mv_tbl__cust_store_sales__0 derived_table1  (cost=0.00..82511124.48 rows=8251112448 width=20)

The original query run also consumed 1,263 CPU seconds and read 45,013 blocks of data, whereas the query that ran after the creation of the materialized view only consumed 898 CPU seconds and read 29,256 blocks. That is a reduction of 29% in CPU consumption and 35% in blocks read.

The optimizer can also rewrite the following query to use the previously created materialized view, which includes the additional join to the customer_address table:

SELECT
cust.c_customer_id
,addr.ca_state
FROM store_sales sales
INNER JOIN customer cust
ON sales.ss_customer_sk = cust.c_customer_sk
INNER JOIN customer_address addr
ON cust.c_current_addr_sk = addr.ca_address_sk
GROUP BY cust.c_customer_id, addr.ca_state;
     
      XN HashAggregate  (cost=30242919089.37..30242919089.37 rows=1544688912 width=26)
         ->  XN Hash Join DS_BCAST_INNER  (cost=542661.20..30201663527.13 rows=8251112448 width=26)
        Hash Cond: ("outer".c_current_addr_sk = "inner".ca_address_sk)
        ->  XN Seq Scan on mv_tbl__cust_store_sales_1__0 derived_table1  (cost=0.00..82511124.48 rows=8251112448 width=24)
        ->  XN Hash  (cost=150000.00..150000.00 rows=15000000 width=10)
              ->  XN Seq Scan on customer_address addr  (cost=0.00..150000.00 rows=15000000 width=10)

Optimize joins for tables that have multiple join paths

For large tables on Amazon Redshift, the ideal distribution style would be ‘KEY’, with the distribution key being the column that is used most frequently in the JOIN clause. There are situations where some large tables have multiple join paths. 50% of the queries may use a particular column to join to the table, and the other 50% of the queries may use a different column to join to the table. Both types of queries are important and have stringent performance requirements. In this case, you could pick one column as the distribution key for the table and then create a materialized view with the second column as the distribution key. This is possible because materialized views can have their own distribution and sort keys.

Here’s an example to illustrate how this works.

The web_sales table (2,159,968,881 rows) has the distribution key ws_order_number. This helps optimize a majority of the queries (70% of the joins to this table use ws_order_number as the join column). The remaining 30% use the column ws_bill_customer_sk to join to the table, as shown in the following SQL statement. This query took 12,790 milliseconds to run.

SELECT 
  c_customer_id
, c_email_address 
FROM web_sales ws
INNER JOIN customer cs
ON ws.ws_bill_customer_sk=cs.c_customer_sk;

We can create the materialized view to help improve the performance of the remaining 30% of the queries. Note the DISTKEY keyword in the following code. We have defined a new distribution key for the materialized view (ws_bill_customer_sk):

CREATE MATERIALIZED VIEW web_sales_cust_dist
DISTKEY (ws_bill_customer_sk)
AS
SELECT * FROM web_sales;

Rerunning the following query returns rows much faster than before (7,715 milliseconds vs. 12,790 milliseconds):

SELECT 
  c_customer_id
, c_email_address 
FROM web_sales ws
INNER JOIN customer cs
ON ws.ws_bill_customer_sk=cs.c_customer_sk;

Again, the explain plan of the query has changed; it now references the materialized view even though the SQL statement doesn’t explicitly reference the materialized view:

XN Hash Join DS_DIST_NONE  (cost=375000.00..696964927.69 rows=2159968768 width=74)
  Hash Cond: (""outer"".ws_bill_customer_sk = ""inner"".c_customer_sk)
  ->  XN Seq Scan on mv_tbl__web_sales_cust_dist__0 derived_table1  (cost=0.00..21599687.68 rows=2159968768 width=4)
  ->  XN Hash  (cost=300000.00..300000.00 rows=30000000 width=78)
        ->  XN Seq Scan on customer cs  (cost=0.00..300000.00 rows=30000000 width=78)

Optimize table scans

Table scans on Amazon Redshift are made efficient through the use of sort keys. Sort keys determine the order in which the columns are stored in the data blocks. Picking a column that appears frequently in your filtering conditions as a sort key can improve query performance significantly.

Compound sort keys with multiple columns can be defined on your table in case multiple columns are good candidates for sort keys. But in some situations where two or more high cardinality columns are sort key candidates, the compound sort key may not provide adequate performance. In these cases, a materialized view could be created with a different sort key to maintain that data in an alternate sorted order to help cater to a subset of the queries.

In the following example query, the web_sales table uses the column ws_sold_date_sk for the sort key, because this is the column that is used commonly for filtering rows. A smaller set of queries use ws_sales_price for filtering rows. Given that both ws_sold_date_sk and ws_sales_price are high cardinality columns with lots of unique values, a compound sort key with both columns may not be performant for all query patterns.

SELECT *
FROM web_sales 
WHERE ws_sales_price BETWEEN 50 AND 100;

Let’s create the following materialized view and see how it can help improve the performance of the preceding query:

CREATE MATERIALIZED VIEW web_sales_sort_on_price
SORTKEY (ws_sales_price)
AS
SELECT * FROM web_sales;

Running the following query returns rows much faster (5 milliseconds vs. 3,548 milliseconds) because the automatic query rewrite is using the materialized view:

SELECT *
FROM web_sales 
WHERE ws_sales_price BETWEEN 50 AND 100;

The following is the new explain plan:

XN Seq Scan on mv_tbl__web_sales_cust_dist__0 derived_table1  (cost=0.00..32399531.52 rows=10799844 width=260)
  Filter: ((ws_sales_price <= 100.00) AND (ws_sales_price >= 50.00))

Conclusion

Materialized views on Amazon Redshift can be a powerful optimization tool if used appropriately. With automatic query rewrite, you can optimize queries without any impact to end-users or their queries. This allows you to create materialized views after the application has gone live. Some customers plan this as part of their performance-optimization strategy when building new apps. The real value is that you can optimize queries and workloads without needing to modify the source code or scripts, and you can benefit even with a partial match.

About the Authors

Harshida Patel is a Specialist Sr. Solutions Architect, Analytics with AWS.

Jeetesh Srivastva is a Sr. Manager, Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and works with customers to implement scalable solutions using Amazon Redshift and other AWS Analytic services. He has worked to deliver on-premises and cloud-based analytic solutions for customers in banking and finance and hospitality industry verticals.

Sain Das is an Analytics Specialist Solutions Architect at AWS and helps customers build scalable cloud solutions that help turn data into actionable insights.

Somdeb Bhattacharjee is an Enterprise Solutions Architect at AWS.