AWS Big Data Blog
Incremental refresh for Amazon Redshift materialized views on data lake tables
Amazon Redshift is a fast, fully managed cloud data warehouse that makes it cost-effective to analyze your data using standard SQL and business intelligence tools. You can use Amazon Redshift to analyze structured and semi-structured data and seamlessly query data lakes and operational databases, using AWS designed hardware and automated machine learning (ML)-based tuning to deliver top-tier price performance at scale.
Amazon Redshift delivers price performance right out of the box. However, it also offers additional optimizations that you can use to further improve this performance and achieve even faster query response times from your data warehouse.
One such optimization for reducing query runtime is to precompute query results in the form of a materialized view. Materialized views in Redshift speed up running queries on large tables. This is useful for queries that involve aggregations and multi-table joins. Materialized views store a precomputed result set of these queries and also support incremental refresh capability for local tables.
Customers use data lake tables to achieve cost effective storage and interoperability with other tools. With open table formats (OTFs) such as Apache Iceberg, data is continuously being added and updated.
Amazon Redshift now provides the ability to incrementally refresh your materialized views on data lake tables including open file and table formats such as Apache Iceberg.
In this post, we will show you step-by-step what operations are supported on both open file formats and transactional data lake tables to enable incremental refresh of the materialized view.
Prerequisites
To walk through the examples in this post, you need the following prerequisites:
- You can test the incremental refresh of materialized views on standard data lake tables in your account using an existing Redshift data warehouse and data lake. However, if you want to test the examples using sample data, download the sample data. The sample files are ‘|’ delimited text files.
- An AWS Identity and Access Management (IAM) role attached to Amazon Redshift to grant the minimum permissions required to use Redshift Spectrum with Amazon Simple Storage Service (Amazon S3) and AWS Glue.
- Set the IAM Role as the default role in Amazon Redshift.
Incremental materialized view refresh on standard data lake tables
In this section, you learn how to can build and incrementally refresh materialized views in Amazon Redshift on standard text files in Amazon S3, maintaining data freshness with a cost-effective approach.
- Upload the first file,
customer.tbl.1
, downloaded from the Prerequisites section in your desired S3 bucket with the prefixcustomer
. - Connect to your Amazon Redshift Serverless workgroup or Redshift provisioned cluster using Query editor v2.
- Create an external schema.
- Create an external table named
customer
in the external schemadatalake_mv_demo
created in the preceding step. - Validate the sample data in the external customer.
- Create a materialized view on the external table.
- Validate the data in the materialized view.
- Upload a new file
customer.tbl.2
in the same S3 bucket andcustomer
prefix location. This file contains one additional record.
- Using Query editor v2 , refresh the materialized view
customer_mv
. - Validate the incremental refresh of the materialized view when the new file is added.
- Retrieve the current number of rows present in the materialized view
customer_mv
. - Delete the existing file
customer.tbl.1
from the same S3 bucket and prefixcustomer
. You should only havecustomer.tbl.2
in thecustomer
prefix of your S3 bucket.
- Using Query editor v2, refresh the materialized view
customer_mv
again. - Verify that the materialized view is refreshed incrementally when the existing file is deleted.
- Retrieve the current row count in the materialized view
customer_mv
. It should now have one record as present in thecustomer.tbl.2
file. - Modify the contents of the previously downloaded
customer.tbl.2
file by altering the customer key from999999999
to111111111
.
- Save the modified file and upload it again to the same S3 bucket, overwriting the existing file within the
customer
prefix. - Using Query editor v2, refresh the materialized view
customer_mv
- Validate that the materialized view was incrementally refreshed after the data was modified in the file.
- Validate that the data in the materialized view reflects your prior data changes from
999999999
to111111111
.
Incremental materialized view refresh on Apache Iceberg data lake tables
Apache Iceberg is a data lake open table format that’s rapidly becoming an industry standard for managing data in data lakes. Iceberg introduces new capabilities that enable multiple applications to work together on the same data in a transactionally consistent manner.
In this section, we will explore how Amazon Redshift can seamlessly integrate with Apache Iceberg. You can use this integration to build materialized views and incrementally refresh them using a cost-effective approach, maintaining the freshness of the stored data.
- Sign in to the AWS Management Console, go to Amazon Athena, and execute the following SQL to create a database in an AWS Glue catalog.
- Create a new Iceberg table
- Add some sample data to
iceberg_mv_demo.category
. - Validate the sample data in
iceberg_mv_demo.category
. - Connect to your Amazon Redshift Serverless workgroup or Redshift provisioned cluster using Query editor v2.
- Create an external schema
- Query the Iceberg table data from Amazon Redshift.
- Create a materialized view using the external schema.
- Validate the data in the materialized view.
- Using Amazon Athena, modify the Iceberg table
iceberg_mv_demo.category
and insert sample data. - Using Query editor v2, refresh the materialized view
mv_category
. - Validate the incremental refresh of the materialized view after the additional data was populated in the Iceberg table.
- Using Amazon Athena, modify the Iceberg table
iceberg_mv_demo.category
by deleting and updating records. - Validate the sample data in
iceberg_mv_demo.category
to confirm thatcatid=4
has been updated andcatid=3
has been deleted from the table. - Using Query editor v2, Refresh the materialized view
mv_category
. - Validate the incremental refresh of the materialized view after one row was updated and another was deleted.
Performance Improvements
To understand the performance improvements of incremental refresh over full recompute, we used the industry-standard TPC-DS benchmark using 3 TB data sets for Iceberg tables configured in copy-on-write. In our benchmark, fact tables are stored on Amazon S3, while dimension tables are in Redshift. We created 34 materialized views representing different customer use cases on a Redshift provisioned cluster of size ra3.4xl with 4 nodes. We applied 1% inserts and deletes on fact tables, i.e., tables store_sales
, catalog_sales
and web_sales
. We ran the inserts and deletes with Spark SQL on EMR serverless. We refreshed all 34 materialized views using incremental refresh and measured refresh latencies. We repeated the experiment using full recompute.
Our experiments show that incremental refresh provides substantial performance gains over full recompute. After insertions, incremental refresh was 13.5X faster on average than full recompute (maximum 43.8X, minimum 1.8X). After deletions, incremental refresh was 15X faster on average (maximum 47X, minimum 1.2X). The following graphs illustrate the latency of refresh.
Inserts
Deletes
Clean up
When you’re done, remove any resources that you no longer need to avoid ongoing charges.
- Run the following script to clean up the Amazon Redshift objects.
- Run the following script to clean up the Apache Iceberg tables using Amazon Athena.
Conclusion
Materialized views on Amazon Redshift can be a powerful optimization tool. With incremental refresh of materialized views on data lake tables, you can store pre-computed results of your queries over one or more base tables, providing a cost-effective approach to maintaining fresh data. We encourage you to update your data lake workloads and use the incremental materialized view feature. If you’re new to Amazon Redshift, try the Getting Started tutorial and use the free trial to create and provision your first cluster and experiment with the feature.
See Materialized views on external data lake tables in Amazon Redshift Spectrum for considerations and best practices.
About the authors
Raks Khare is a Senior Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers across varying industries and regions architect data analytics solutions at scale on the AWS platform. Outside of work, he likes exploring new travel and food destinations and spending quality time with his family.
Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 15+ years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.
Raza Hafeez is a Senior Product Manager at Amazon Redshift. He has over 13 years of professional experience building and optimizing enterprise data warehouses and is passionate about enabling customers to realize the power of their data. He specializes in migrating enterprise data warehouses to AWS Modern Data Architecture.
Enrico Siragusa is a Senior Software Development Engineer at Amazon Redshift. He contributed to query processing and materialized views. Enrico holds a M.Sc. in Computer Science from the University of Paris-Est and a Ph.D. in Bioinformatics from the International Max Planck Research School in Computational Biology and Scientific Computing in Berlin.