AWS Big Data Blog

Speed up your ELT and BI queries with Amazon Redshift materialized views

The Amazon Redshift materialized views function helps you achieve significantly faster query performance on repeated or predictable workloads such as dashboard queries from Business Intelligence (BI) tools, such as Amazon QuickSight. It also speeds up and simplifies extract, load, and transform (ELT) data processing. You can use materialized views to store frequently used precomputations and seamlessly use them to achieve lower latency on subsequent analytical queries.

This post demonstrates how to create a materialized view, refresh it after data ingestion, and speed up your BI workload.

Setting up your sample datasets

This walkthrough uses the Amazon Customer Reviews Dataset. It is a public dataset stored in the us-east-1 Region. You will create the following three tables:

  • product_reviews – Contains customer reviews for a specific product
  • customer – Contains customer profile data
  • customer_address – Contains customer address information

The following diagram shows the relationship of the three tables.

To download the script and set up the tables, choose mv_blog.sql.

Creating and using materialized views

For this use case, your marketing team wants to build a report that shows how many customers per state like your products. You also want to drill down to each product category when needed.

In this first step, you create a regular view. See the following code:

CREATE VIEW v_reviews_byprod_and_state AS
SELECT PR.product_category,
       A.ca_state AS customer_state,
       count(PR.star_rating) AS cnt
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND PR.marketplace = 'US'
GROUP BY 1,
         2;

The following code is a report to analyze the product review count per state:

SELECT customer_state,
       sum(cnt)
FROM v_reviews_byprod_and_state
GROUP BY 1
ORDER BY 2;

The following code is a report to analyze the product review count per state for specific categories:

SELECT customer_state,
       sum(cnt)
FROM v_reviews_byprod_and_state
WHERE product_category IN ('Home',
                           'Grocery')
GROUP BY 1
ORDER BY 2;

The preceding reports take approximately 4 seconds to run. As you sell more products and get more reviews, this elapsed time gradually gets longer. To speed up those reports, you can create a materialized view to precompute the count of reviews per product category and per state. See the following code:

CREATE MATERIALIZED VIEW mv_reviews_byprod_and_state AS
SELECT PR.product_category,
       A.ca_state AS customer_state,
       count(PR.star_rating) AS cnt
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND PR.marketplace = 'US'
GROUP BY 1,
         2;

The following code are the reports to analyze the product review against the materialized view.

SELECT customer_state,
       sum(cnt)
FROM mv_reviews_byprod_and_state
GROUP BY 1
ORDER BY 2;

SELECT customer_state,
       sum(cnt)
FROM mv_reviews_byprod_and_state
WHERE product_category IN ('Home',
                           'Grocery')
GROUP BY 1
ORDER BY 2;

The same reports against materialized views take less than 200 milliseconds because the new queries access precomputed joins, filters, grouping, and partial sums instead of the multiple, larger base tables.

Speeding up and simplifying ELT data processing

To achieve similar performance without the use of materialized views, many users use the CREATE TABLE AS (CTAS) command. However, as you update base tables with new data inserts, updates, or deletes, the CTAS tables become stale; you must recreate them to keep them up-to-date with the latest changes from the base tables. Now with Amazon Redshift materialized views, you can overcome this problem by efficiently and incrementally refreshing the materialized views with supported SQL. For example, the following code ingests another 10,000 reviews:

INSERT INTO product_reviews
SELECT   marketplace, 
  cast(customer_id as bigint) customer_id, 
  review_id, 
  product_id, 
  cast(product_parent as bigint) product_parent, 
  product_title, 
  star_rating, 
  helpful_votes, 
  total_votes, 
  vine, 
  verified_purchase, 
  review_headline, 
  review_body, 
  review_date, 
  year,
  product_category
FROM demo.products_reviews
WHERE review_date = '2015-07-01' LIMIT 10000;

Now the materialized view is out-of-date. To refresh the materialized view, enter the following code:

REFRESH MATERIALIZED VIEW mv_reviews_byprod_and_state;

Within 200 milliseconds, the materialized view is up-to-date again. Your report queries have the same consistent, fast performance.

The following screenshot is the query log that shows query performance. The log shows newer statements at the top.

The materialized views refresh is much faster because it’s incremental: Amazon Redshift only uses the new data to update the materialized view instead of recomputing the entire materialized view again from the base tables.  For more information, see REFRESH MATERIALIZED VIEW.

Materialized views also simplify and make ELT easier and more efficient. Without materialized views, you might create an ELT job and use CTAS to precompute the product analysis data. The ELT job recomputes this data after new data is ingested and stores the data in the precomputed product analysis table to meet the dashboard latency requirement.

In particular, the ETL job drops and recreates the precomputed product analysis table after each ingestion. See the following code:

BEGIN;
    DROP TABLE IF EXISTS latest_product_analysis;
    CREATE TABLE latest_product_analysis as SELECT ...;
END;

With materialized views, you just need to create the materialized view one time and refresh to keep it up-to-date. To refresh materialized views after ingesting new data, add REFRESH MATERIALIZED VIEW to the ELT data ingestion scripts. Redshift will automatically and incrementally bring the materialized view up-to-date.

Achieving faster performance for BI dashboards

You can use materialized views to help your BI team build a dashboard to analyze product trends.

For example, to create a materialized view to join customer and customer_address dimension tables and precompute reviews and ratings, enter the following code:

CREATE MATERIALIZED VIEW mv_product_analysis 
    sortkey(product_category, Customer_State, review_date) 
AS
SELECT PR.product_category,
       A.ca_state AS Customer_State,
       PR.review_date AS Review_Date,
       PR.product_title,
       COUNT(1) AS review_total,
       SUM(PR.star_rating) AS rating
FROM product_reviews PR,
     customer C,
     customer_address A
WHERE PR.customer_id = C.c_customer_sk
  AND C.c_current_addr_sk = A.ca_address_sk
  AND marketplace = 'US'
GROUP BY 1,2,3,4;

You access materialized views the same as you do a regular table. For this walkthrough, choose a materialized view as the source for an Amazon QuickSight dataset. As showing by the following screenshot.

You can preview data of the materialized view in Amazon QuickSight to understand what information can be used to build the dashboard. The following screenshot shows the sample data of mv_product_analysis.

To track how many reviews customers post over time, use review_date as the X-axis and Sum(review_total) as the Y-axis. The following graph shows this visualization.

The following screenshot shows a complete dashboard “Product trend” that analyzes the top product category, product popularity by state, and more.

Because you are using materialized views, the product trend dashboard loads in seconds and is always up-to-date. You can gain the latest insights, understand customer purchase behavior, and identify business opportunities and optimizations.

You can compare the performance of materialized views with other possible alternatives, such as using regular views and using CTAS. The following graph shows the overall query execution for the product trend dashboard. Materialized views not only improve query performance by more than an order of magnitude compared to using a regular view, but also have low maintenance costs compared to using a CTAS because the incremental refresh time is proportional to the delta of changes in the base tables. In contrast, the CTAS recreate approach needs to processes all the data in the base tables.

The following animated gif shows the actual response time for the product trend dashboard built using Amazon QuickSight in direct query mode.

Conclusion

This post showed how to create Amazon Redshift materialized views with one or more base tables to speed up both BI queries and ELT. You can easily build and maintain efficient data processing pipelines and seamlessly extend the low latency query execution benefits of materialized views to data analysis.

 


About the Authors

Juan Yu is a Data Warehouse Specialist Solutions Architect at AWS.

 

 

 

 

Jose Kunnackal John is principal product manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.