AWS Big Data Blog

Get up to 3x better price performance with Amazon Redshift than other cloud data warehouses

Since we announced Amazon Redshift in 2012, tens of thousands of customers have trusted us to deliver the performance and scale they need to gain business insights from their data. Amazon Redshift customers span all industries and sizes, from startups to Fortune 500 companies, and we work to deliver the best price performance for any use case. Earlier in 2020, we published a blog post about improved speed and scalability in Amazon Redshift. This includes optimizations such as dynamically adding cluster capacity when you need it with concurrency scaling, making sure you use cluster resources efficiently with automatic workload management (WLM), and automatically adjusting data layout, distribution keys, and query plans to provide optimal performance for a given workload. We also described how customers, including Codeacademy, OpenVault, Yelp, and Nielsen, have taken advantage of Amazon Redshift RA3 nodes with managed storage to scale their cloud data warehouses and reduce costs.

In addition to improving performance and scale, we are constantly looking at how to also improve the price performance that Amazon Redshift provides. One of the ways we ensure that we provide the best value for customers is to measure performance regularly using a benchmark derived from the industry-standard TPC-DS benchmark. You can read the details of the benchmark at the end of this blog, and can reproduce the results using the scripts queries, and data in this Github repository.

We completed our most recent benchmark derived from the TPC-DS benchmark in November using the latest version of the products available across the vendors tested at that time. For Amazon Redshift, this includes more than 15 new capabilities released this year prior to November, but not new capabilities announced during AWS re:Invent 2020.

Best Out-of-the-Box and Tuned Price Performance

Our November test of Amazon Redshift and three other leading cloud data warehouse showed that Amazon Redshift delivers up to three times better price performance out-of-the-box.  The following chart illustrates these findings.

For this test, we ran all 99 queries derived from the TPC-DS benchmark against a 3 TB data set. We calculated price performance by multiplying the time required to run all queries in hours by the price per hour for each cloud data warehouse. We used clusters with comparable hardware characteristics for each data warehouse. We also used default settings for each cloud data warehouse, except we enabled encryption for all four services because it is enabled on two by default, and we disabled result caching where applicable. The default settings allowed us to determine the price performance delivered with no manual tuning effort. We selected the best result out of three runs for each query in order to take advantage of optimizations provided by each service. Finally, to ensure an apples-to-apples comparison, we used public pricing, and compared price performance rather than performance alone. For Amazon Redshift specifically, we used on-demand pricing; Amazon Redshift Reserved Instance pricing provides up to a 60% discount vs. on-demand pricing.

These results show that Amazon Redshift provides the best price performance out-of-the-box, even for a comparatively small 3 TB dataset. This means that you can benefit from Amazon Redshift’s leading price performance from the start without manual tuning.

You can also take advantage of performance tuning techniques for Amazon Redshift to achieve even better results for your workloads. We repeated the benchmark test using tuning best practices provided by each cloud data warehouse vendor. After all cloud data warehouses are tuned, Amazon Redshift has 1.5 times better price performance than the nearest cloud data warehouse competitor, as shown in the following chart.

As with all benchmarks, transparency and reproducibility are crucial. For this reason, we have made the data and queries available on GitHub for anyone to use. See the README in GitHub for detailed instructions on re-running these benchmarks.

Tuned price performance improves as your data warehouse grows

One critical aspect of a data warehouse is how it scales as your data grows. Will you be paying more per TB as you add more data, or will your costs remain consistent and predictable? We work to make sure that Amazon Redshift delivers not only strong performance as your data grows, but also consistent price performance. We tested Amazon Redshift price performance using the queries derived from TPC-DS with 3 TB, 30 TB, and 100 TB datasets on three different cluster sizes. As shown in the following graph, Amazon Redshift tuned price performance improved (from $2.80 to $2.41 per TB per run) as the datasets grew. Tuning reduces the amount of network and disk I/O required for a given workload, and has varying impact depending on the combination of workload and cluster size.

In addition, as shown in the following table, Amazon Redshift out-of-the-box price performance is nearly the same ($4.80 to $5.01 per TB per run) for all three dataset sizes. This linear scaling of price performance across data size and cluster size, both out-of-the-box and tuned, makes sure that Amazon Redshift will scale predictably as your data and workloads grow.

Amazon Redshift results on test derived from TPC-DS benchmark
  Out-of-Box Tuned
Data set
(TB)
Cluster Runtime
(sec)
Price per TB per run Runtime
(sec)
Price per TB per run
3 10 node ra3.4xlarge 1591 $4.80 926 $2.80
30 5 node ra3.16xlarge 8291 $5.01 4198 $2.53
100 10 node ra3.16xlarge 13,343 $4.83 6644 $2.41

You can learn more about Amazon Redshift’s performance on large datasets in How Amazon Redshift powers large-scale analytics for Amazon.com. This AWS re:Invent 2020 session shows how Amazon.com is using Amazon Redshift to keep up with exploding data growth, and how you can upgrade your existing data warehouse workloads to RA3 nodes to get scale and performance at great value.

Up to 10x better query performance with AQUA

We’re investing to make sure Amazon Redshift continues to improve as your data warehouse needs grow. As noted earlier, these benchmark results reflect the latest version of Amazon Redshift as of November, 2020. This version includes more than 15 new features released earlier this year, such as distributed bloom filters, vectorized queries, and automatic WLM, but doesn’t include the benefits from new capabilities announced during AWS re:Invent 2020. You can join What’s new with Amazon Redshift at AWS re:Invent 2020 to learn more about the new capabilities.

These new capabilities include AQUA (Advanced Query Accelerator) for Amazon Redshift. AQUA is a new distributed and hardware-accelerated cache for Amazon Redshift that delivers up to 10x better query performance than other cloud data warehouses for certain types of queries. AQUA takes a new approach to cloud data warehousing. AQUA brings the compute to storage by doing a substantial share of data processing in-place on the innovative cache. In addition, it uses AWS-designed processors and a scale-out architecture to accelerate data processing beyond anything traditional CPUs can do today. AQUA’s preview is now open to all customers, and AQUA will be generally available in January 2021. You can learn more about AQUA and other new Amazon Redshift capabilities by joining What’s new with Amazon Redshift at AWS re:Invent 2020.

Price performance continues to improve

We’re investing to make sure Amazon Redshift continues to improve as your data warehouse needs grow. As noted earlier, these benchmark results reflect the latest version of Amazon Redshift as of November, 2020. This version includes more than 15 new features released earlier this year, such as distributed bloom filters, vectorized queries, and automatic WLM, but doesn’t include the benefits from new capabilities announced during AWS re:Invent 2020. You can join What’s new with Amazon Redshift at AWS re:Invent 2020 to learn more about the new capabilities.

Find the best price performance for your workloads

The benchmark used in this blog is derived from the industry-standard TPC-DS benchmark, and has the following characteristics:

  • The schema and data are used unmodified from TPC-DS.
  • The queries are used unmodified from TPC-DS. TPC-approved query variants are used for a warehouse if the warehouse does not support the SQL dialect of the default TPC-DS query.
  • The test includes only the 99 TPC-DS SELECT queries. It does not include maintenance and throughput steps.
  • Three power runs (i.e. single stream) were run with query parameters generated using the default random seed of the TPC-DS kit.
  • The primary metric of total query runtime is used.  The runtime is taken as the best of the three runs.
  • Price performance is calculated as cost per hour (USD) divided by queries per hour, which is equivalent to average cost per query. Published on-demand pricing is used for all data warehouses.

We call this benchmark the Cloud Data Warehouse Benchmark, and you can reproduce the benchmark results above using the scripts, queries, and data available on GitHub. It is derived from the TPC-DS benchmark, and as such is not comparable to published TPC-DS results, as the results of our tests do not comply with the specification.

Each workload has unique characteristics, so if you’re just getting started, a proof of concept is the best way to understand how Amazon Redshift performs for your requirements. When running your own proof of concept, it’s important that you focus on proper cluster sizing and the right metrics—query throughput (number of queries per hour) and price performance. You can make a data-driven decision by requesting assistance with a proof of concept or working with a system integration and consulting partner.

If you’re an existing Amazon Redshift customer, connect with us for a free optimization session and briefing on the new features announced at AWS re:Invent 2020.

To stay up-to-date with the latest developments in Amazon Redshift, subscribe to the What’s New in Amazon Redshift RSS feed.


About the Authors

Eugene Kawamoto is a director of product management for Amazon Redshift. Eugene leads the product management and database engineering teams at AWS. He has been with AWS for ~8 years supporting analytics and database services both in Seattle and in Tokyo. In his spare time, he likes running trails in Seattle, loves finding new temples and shrines in Kyoto, and enjoys exploring his travel bucket list.

 

 

Stefan Gromoll is a Senior Performance Engineer with Amazon Redshift where he is responsible for measuring and improving Redshift performance. In his spare time, he enjoys cooking, playing with his three boys, and chopping firewood.