AWS Big Data Blog
Extend your Amazon Redshift Data Warehouse to your Data Lake
Amazon Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools.
Many companies today are using Amazon Redshift to analyze data and perform various transformations on the data. However, as data continues to grow and become even more important, companies are looking for more ways to extract valuable insights from the data, such as big data analytics, numerous machine learning (ML) applications, and a range of tools to drive new use cases and business processes. Companies are looking to access all their data, all the time, by all users and get fast answers. The best solution for all those requirements is for companies to build a data lake, which is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale.
With a data lake built on Amazon Simple Storage Service (Amazon S3), you can easily run big data analytics using services such as Amazon EMR and AWS Glue. You can also query structured data (such as CSV, Avro, and Parquet) and semi-structured data (such as JSON and XML) by using Amazon Athena and Amazon Redshift Spectrum. You can also use a data lake with ML services such as Amazon SageMaker to gain insights.
A large startup company in Europe uses an Amazon Redshift cluster to allow different company teams to analyze vast amounts of data. They wanted a way to extend the collected data into the data lake and allow additional analytical teams to access more data to explore new ideas and business cases.
Additionally, the company was looking to reduce their storage utilization, which had already reached more than 80% of their Amazon Redshift cluster’s storage capacity. The high storage utilization necessitated ongoing cleanup of growing tables to avoid purchasing additional nodes and associated increased costs. The cleanup operations, however, created a larger operational footprint. The proposed solution implemented a hot/cold storage pattern using Amazon Redshift Spectrum and reduced the local disk utilization on the Amazon Redshift cluster to make sure costs are maintained.
In this post we demonstrate how the company, with the support of AWS, implemented a lake house architecture by employing the following best practices:
- Unloading data into Amazon Simple Storage Service (Amazon S3)
- Instituting a hot/cold pattern using Amazon Redshift Spectrum
- Using AWS Glue to crawl and catalog the data
- Querying data using Athena
The following diagram illustrates the solution architecture.
The solution includes the following steps:
- Unload data from Amazon Redshift to Amazon S3
- Create an AWS Glue Data Catalog using an AWS Glue crawler
- Query the data lake in Amazon Athena
- Query Amazon Redshift and the data lake with Amazon Redshift Spectrum
To complete this walkthrough, you must have the following prerequisites:
- An AWS account.
- An Amazon Redshift cluster.
- The following AWS services and access: Amazon Redshift, Amazon S3, AWS Glue, and Athena.
- The appropriate AWS Identity and Access Management (IAM) permissions for Amazon Redshift Spectrum and AWS Glue to access Amazon S3 buckets. For more information, see IAM policies for Amazon Redshift Spectrum and Setting up IAM Permissions for AWS Glue.
To demonstrate the process performed by the company, we use the industry-standard TPC-H dataset provided publicly by the TPC organization.
Orders table has the following columns:
Unloading data from Amazon Redshift to Amazon S3
Amazon Redshift allows you to unload your data using a data lake export to an Apache Parquet file format. Parquet is an efficient open columnar storage format for analytics. Parquet format is up to twice as fast to unload and consumes up to six times less storage in Amazon S3, compared with text formats.
To unload cold or historical data from Amazon Redshift to Amazon S3, you need to run an UNLOAD statement similar to the following code (substitute your IAM role ARN):
It is important to define a partition key or column that minimizes Amazon S3 scans as much as possible based on the query patterns intended. The query pattern is often by date ranges; for this use case, use the
o_orderdate field as the partition key.
Another important recommendation when unloading is to have file sizes between 128 MB and 512 MB. By default, the
UNLOAD command splits the results to one or more files per node slice (virtual worker in the Amazon Redshift cluster) which allows you to use the Amazon Redshift MPP architecture. However, this can potentially cause files created by every slice to be small. In the company’s use case, the default
PARALLEL ON yielded dozens of small (MBs) files. For the company,
PARALLEL OFF yielded the best results because it aggregated all the slices’ work into the
LEADER node and wrote it out as a single stream controlling the file size using the
Another performance enhancement applied in this use case was the use of Parquet’s min and max statistics. Parquet files have
max_value column statistics for each row group that allow Amazon Redshift Spectrum to prune (skip) row groups that are out of scope for a query (range-restricted scan). To use row group pruning, you should sort the data by frequently-used columns. Min/max pruning helps scan less data from Amazon S3, which results in improved performance and reduced cost.
After unloading the data to your data lake, you can view your Parquet file’s content in Amazon S3 (assuming it’s under 128 MB). From the Actions drop-down menu, choose Select from.
You’re now ready to populate your Data Catalog using an AWS Glue crawler.
Creating a Data Catalog with an AWS Glue crawler
To query your data lake using Athena, you must catalog the data. The Data Catalog is an index of the location, schema, and runtime metrics of the data.
An AWS Glue crawler accesses your data store, extracts metadata (such as field types), and creates a table schema in the Data Catalog. For instructions, see Working with Crawlers on the AWS Glue Console.
Querying the data lake in Athena
After you create the crawler, you can view the schema and tables in AWS Glue and Athena, and can immediately start querying the data in Athena. The following screenshot shows the table in the Athena Query Editor.
Querying Amazon Redshift and the data lake using a unified view with Amazon Redshift Spectrum
Amazon Redshift Spectrum is a feature of Amazon Redshift that allows multiple Redshift clusters to query from same data in the lake. It enables the lake house architecture and allows data warehouse queries to reference data in the data lake as they would any other table. Amazon Redshift clusters transparently use the Amazon Redshift Spectrum feature when the SQL query references an external table stored in Amazon S3. Large multiple queries in parallel are possible by using Amazon Redshift Spectrum on external tables to scan, filter, aggregate, and return rows from Amazon S3 back to the Amazon Redshift cluster.\
Following best practices, the company decided to persist all their data in their Amazon S3 data lake and only store hot data in Amazon Redshift. They could query both hot and cold datasets in a single query with Amazon Redshift Spectrum.
The first step is creating an external schema in Amazon Redshift that maps a database in the Data Catalog. See the following code:
After the crawler creates the external table, you can start querying in Amazon Redshift using the mapped schema that you created earlier. See the following code:
Lastly, create a late binding view that unions the hot and cold data:
In this post, we showed how a large startup company unloaded data from Amazon Redshift to a data lake. By doing that, they exposed the data to many additional groups within the organization and democratized the data. These benefits of data democratization are substantial because various teams within the company can access the data, analyze it with various tools, and come up with new insights.
As an additional benefit, the company reduced their Amazon Redshift utilized storage, which allowed them to maintain cluster size and avoid additional spending by keeping all historical data within the data lake and only hot data in the Amazon Redshift cluster. Keeping only hot data on the Amazon Redshift cluster prevents the company from deleting data frequently, which saves IT resources, time, and effort.
If you are looking to extend your data warehouse to a data lake and leverage various tools for big data analytics and machine learning (ML) applications, we invite you to try out this walkthrough.
About the Authors
Yonatan Dolan is a Business Development Manager at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value.
Alon Gendler is a Startup Solutions Architect at Amazon Web Services. He works with AWS customers to help them architect secure, resilient, scalable and high performance applications in the cloud.
Vincent Gromakowski is a Specialist Solutions Architect for Amazon Web Services.