AWS Big Data Blog
Compare different node types for your workload using Amazon Redshift
February 2023: This post was reviewed and updated to include support for Amazon Redshift Serverless. The Amazon Redshift Node Configuration Comparison utility latest release now supports Amazon Redshift Serverless to test your workload performance.
If you want to either explore different Amazon Redshift Serverless configurations or combination of Amazon Redshift Provisioned and Serverless configurations based on your workload, please follow instructions as outlined in this Github repo.
If you want to explore only Amazon Redshift Provisioned configurations based on your workload, please continue to follow steps as outlined in this blog below. This blog post will be updated soon with the changes related to the latest release.
Amazon Redshift is a fast, fully managed, widely popular cloud data warehouse that offers different node types to meet the flexible workload patterns of our customers. Amazon Redshift RA3 with managed storage is the newest instance type in Amazon Redshift, which allows you to scale and pay for compute and storage independently, and also allows advanced features like AQUA (Advanced Query Accelerator), cross-cluster data sharing, and cross-Availability Zone cluster relocation.
Many customers start their workload on Amazon Redshift with RA3 instances as their default choice, which currently offers three node sizes.
Node size | vCPU | RAM (GiB) | Managed storage quota per node |
ra3.xlplus | 4 | 32 | 32 TB |
ra3.4xlarge | 12 | 96 | 128 TB |
ra3.16xlarge | 48 | 384 | 128 TB |
A very common question we get from customers is “Which RA3 instance type and number of nodes should we choose for our workload?” In this post, we answer that question with the following two-step process:
- Use the Amazon Redshift sizing calculator on the Amazon Redshift console, which estimates your cluster configuration based on data size and data access frequency.
- Use the Amazon Redshift Node Configuration Comparison utility to find the right configuration for your cluster based on your query performance expectation for sequential or concurrently running queries. If you’re already using Amazon Redshift and want to migrate from your existing DC2 or DS2 instances to RA3, you may follow our recommendations on node count and type when upgrading. Before doing that, you can use the utility to evaluate the new cluster’s performance by replaying your past workloads too, which integrates with the Amazon Redshift Simple Replay utility to evaluate performance metrics for different Amazon Redshift configurations to meet your needs. We describe this utility in detail later in this post.
Amazon Redshift sizing calculator
The Amazon Redshift console provides the Help me choose option when you create a new Amazon Redshift cluster. This option allows you to get an estimate of your cluster configuration based on your data size and data access frequency. In this wizard, you need to provide the following information:
- The compressed or uncompressed size of your total dataset
- Whether your data is time-based:
- If your data is time based, how many months of data does the data warehouse contain and how many months are frequently queried on that data
- If your data is not time based, what percentage of data do your queries access frequently
Based on your response, you receive a recommended node configuration, as shown in the following screenshot.
You can use this recommendation to create your Amazon Redshift cluster and quickly get started with your workload to gain insights from your data in the Amazon Redshift data warehouse.
Amazon Redshift Node Configuration Comparison utility
If you have stringent service level agreements (SLAs) for query performance in your Amazon Redshift cluster or you want to explore different Amazon Redshift configurations based on the price/performance of your workload, you can use the Amazon Redshift Node Configuration Comparison utility. This utility helps evaluate performance of your queries using different Amazon Redshift cluster configurations in parallel and compares the end results to find the best cluster configuration that meets your need.
To perform this comparison, you can use the Amazon Redshift sizing calculator recommendations in the previous step as a benchmark and test other configurations with it in parallel to compare performance. If you’re migrating from DC2 or DS2 node types to RA3, you can use our recommendations on node count and type as a benchmark.
Solution overview
The solution uses AWS Step Functions, AWS Lambda, and AWS Batch to run an end-to-end automated orchestration to find the best Amazon Redshift configuration based on your price/performance requirements. We use an AWS CloudFormation template to deploy and run this solution in your AWS account. Along with other resources, this template creates an Amazon Simple Storage Service (Amazon S3) bucket to store all data and metadata related to this process. The following diagram shows the solution architecture.
You need to create a JSON file to provide the following input configurations for your test:
- Amazon Redshift cluster configurations
- DDL and load script (optional)
- Amazon Redshift snapshot identifier (optional)
- SQL script to conduct sequential and concurrency test (optional)
- Amazon Redshift audit log location and simple replay time window (optional)
You need to store this file in an existing S3 bucket and then use the following CloudFormation template to deploy this solution. This solution uses AWS CloudFormation to automatically provision all the required resources in your AWS accounts. For more information, see Getting started with AWS CloudFormation.
Deploying the solution also initiates an iteration of this test by invoking a Step Functions state machine in your AWS account.
Prerequisites
If you’re already running an Amazon Redshift workload in production, you can use this solution to replay your past workload using the Amazon Redshift Simple Replay utility. This helps you conduct what-if analysis on Amazon Redshift and evaluate how your workload performs with different configurations. For example, you can use the tool to benchmark your actual workload on a new instance type like RA3, evaluate a new feature like AQUA, or assess different cluster configurations. It replays your workload with the same time interval between connections, transactions, and queries as the source. It also supports extract, transform, and load (ETL) workloads including COPY and UNLOAD statements. The Amazon Redshift Node Configuration Comparison utility uses Simple Replay to automate all these configuration tests. As a prerequisite to use Simple Replay, you need to enable audit logging and user-activity logging in your Amazon Redshift cluster.
Example use case
As an example, assume you have an existing Amazon Redshift cluster with two nodes of DC2.8XLarge instances. You want to evaluate migrating this cluster to RA3 node types. You used the Amazon Redshift sizing calculator in the previous step, which recommends using four nodes of RA3.4XL, but you also want to evaluate the performance with five nodes of RA3.4XL to meet your future growth demands. For that, you want to run five test queries sequentially as well as in 5 and 10 parallel sessions in all these clusters. You also want to replay the workload in the past hour on these clusters and compare their performance.
You also want to test your workload performance with concurrency scaling enabled in that clusters, which helps improve concurrent workloads with consistently fast query performance.
The following table summarizes the five cluster configurations that are evaluated as part of this test.
Node Type | Number of Nodes | Option |
dc2.8xlarge | 2 | concurrency scaling – off |
ra3.4xlarge | 4 | concurrency scaling – off |
ra3.4xlarge | 4 | concurrency scaling – on |
ra3.4xlarge | 5 | concurrency scaling – off |
ra3.4xlarge | 5 | concurrency scaling – on |
To perform this test using the Amazon Redshift Node Configuration Comparison utility, you provide these configurations in a JSON file and store it in an S3 bucket. You then use the provided CloudFormation template to deploy this utility, which performs the end-to-end performance testing in all the clusters in parallel and produces a performance evaluation summary, which can help you decide which configuration works best for you.
JSON file parameters
You need to provide a configuration JSON file to use this solution. The following table explains the input parameters for this JSON file.
JSON Parameter | Allowed Values | Description |
SNAPSHOT_ID |
N/A, Amazon Redshift snapshot identifier |
Enter the snapshot identifier if you want to create new Amazon Redshift clusters by restoring from a snapshot. Snapshot identifier is mandatory if you’re using Simple Replay in this test to replay your past workload. If you’re using this solution in a different AWS account, make sure to share your Amazon Redshift cluster snapshot with this account. For more information, see How do I copy an Amazon Redshift cluster to a different AWS account. Enter N/A if not applicable. |
SNAPSHOT_ACCOUNT_ID |
N/A, AWS Account ID |
The AWS account ID where the snapshot was created. Enter N/A if not applicable. |
PARAMETER_GROUP_CONFIG_S3_PATH |
N/A, Amazon S3 URI |
If you use a custom parameter group for this test, provide its Amazon S3 URI. You can get this JSON by running the following command in the AWS Command Line Interface (AWS CLI):
Enter |
DDL_AND_COPY_SCRIPT_S3_PATH |
N/A, Amazon S3 URI |
If you create tables and load data on them before performing the test, provide its Amazon S3 URI. Enter N/A if not applicable. |
SQL_SCRIPT_S3_PATH |
N/A, Amazon S3 URI |
If you run performance testing of your queries, provide the Amazon S3 URI of your script consisting of all your SQL commands. These commands should be deliminated by a semicolon (;). Enter N/A if not applicable. We recommend keeping only SELECT statements in this script path. |
NUMBER_OF_PARALLEL_SESSIONS_LIST |
N/A, comma-separated numbers | Enter comma-separated numbers to denote the number of parallel sessions in which you want to run the preceding script. Enter N/A if not applicable. |
SIMPLE_REPLAY_LOG_LOCATION |
N/A, Amazon S3 URI |
If you’re already running an Amazon Redshift workload and your cluster has audit logging enabled, provide the Amazon S3 URI of your Amazon Redshift audit logging location. If you’re using this solution in a different AWS account, copy these logs from your source clusters’ audit logging bucket to the S3 bucket in this account. Enter N/A if not applicable. |
SIMPLE_REPLAY_EXTRACT_START_TIME |
N/A, time in ISO-8601 format | If you’re using Simple Replay in this test to replay your past workload, provide the start time of that workload in ISO-8601 format (for example, 2021-01-26T21:41:16+00:00 ). Enter N/A if not applicable. |
SIMPLE_REPLAY_EXTRACT_END_TIME |
N/A, time in ISO-8601 format | If you’re using Simple Replay in this test to replay your past workload, provide the end time of that workload in ISO-8601 format (for example, 2021-01-26T21:41:16+00:00 ). Enter N/A if not applicable. |
SIMPLE_REPLAY_EXTRACT_OVERWRITE_S3_PATH |
N/A, Amazon S3 URI |
If you’re using Simple Replay and you want to use a custom extract.yaml file, provide its Amazon S3 URI. Enter N/A if not applicable. |
SIMPLE_REPLAY_OVERWRITE_S3_PATH |
N/A, Amazon S3 URI |
If you’re using Simple Replay and you want to use a custom replay.yaml file, provide its Amazon S3 URI. Enter N/A if not applicable. |
AUTO_PAUSE |
true, false |
Enter true if you want to automatically pause all except the first Amazon Redshift clusters created for this test, otherwise enter false . |
DATABASE_NAME |
N/A, Amazon Redshift database name |
Specify the primary database name of your Redshift cluster. If you’re using Simple Replay, provide the database name for which you want to replay the workload. Amazon Redshift automatically creates a default database named dev , which may not be your primary database. |
CONFIGURATIONS |
JSON array with parameters NODE_TYPE, NUMBER_OF_NODES, WLM_CONFIG_S3_PATH | Enter a JSON array mentioning your Amazon Redshift cluster configurations for which you want to perform this test. We explain the parameters in the three rows. |
NODE_TYPE |
ra3.xlplus, ra3.4xlarge, ra3.16xlarge, dc2.large, dc2.8xlarge, ds2.xlarge, ds2.8xlarge | The Amazon Redshift cluster node type that you want to run for this test. |
NUMBER_OF_NODES |
a number between 1–128 | The number of nodes for your Amazon Redshift cluster. |
WLM_CONFIG_S3_PATH |
N/A, Amazon S3 URI |
If you want to use custom workload management settings if you have different Amazon Redshift clusters, provide the Amazon S3 URI for that. Enter N/A if not applicable. |
The following code is a sample configuration JSON file used to implement this example use case:
Make sure to use same S3 bucket to store all your configurations for this test. For example, we use the S3 bucket node-config-compare-bucket
to store all configuration scripts in the preceding JSON configuration. After you populate all the parameters in this JSON file, save the file in the same S3 bucket in your AWS account.
Deploy the solution using AWS CloudFormation
After you save the configuration JSON file in an S3 bucket, you can use the provided CloudFormation template to deploy this solution, which also initiates an iteration of this test. This template provisions the required AWS resources except for the Amazon Redshift clusters, which are created in the subsequent step by a Step Functions state machine. The following table lists the required parameters in the template.
CloudFormation Template Parameter | Allowed Values | Description |
ConfigJsonS3Path |
Amazon S3 URI | Enter the Amazon S3 URI where you stored your JSON configuration file from the previous step. The template grants access on this S3 bucket to the appropriate AWS resources created by this solution. |
ClusterIdentifierPrefix |
Prefix of Amazon Redshift cluster identifiers | Enter a valid string like rs to be used as the prefix of your Amazon Redshift cluster identifiers created by this solution. |
PreExistingS3BucketToGrantRedshiftAccess |
N/A, Amazon S3 Bucket name |
If you’re using Simple Replay, provide the Amazon Redshift audit logging bucket name so it can grant appropriate permissions to the AWS resources. You can also add an existing S3 bucket in same AWS Region, which can be accessed by Amazon Redshift. Enter N/A if not applicable. |
GrantS3ReadOnlyAccessToRedshift |
Yes, No |
If you’re using Simple Replay in the same AWS account as the source Amazon Redshift cluster, enter Yes for this parameter, which grants AmazonS3ReadOnlyAccess to the new Amazon Redshift clusters to replay copy statements within the account. Otherwise, enter No so you can’t replay copy statements if running on a different AWS account without manually configuring it. |
SourceRedshiftClusterKMSKeyARN |
N/A, AWS KMS Key ARN | Provide the AWS Key Management Service (AWS KMS) Key ARN (Amazon Resource Name) if your source Amazon Redshift cluster is encrypted (available on the stack Outputs tab). If using Simple Replay, you need to run extract and replay in the same AWS account if your source cluster is encrypted. |
OnPremisesCIDR |
CIDR notation | The IP range (CIDR notation) for your existing infrastructure to access the target and replica clusters from a SQL client. If unsure, enter your corporate desktop’s CIDR address. For instance, if your desktop’s IP address is 10.156.87.45 , enter 10.156.87.45/32 . |
VPC |
VPC ID | An existing Amazon Virtual Private Cloud (Amazon VPC) where you want to deploy the clusters |
SubnetId |
Subnet ID | An existing subnet within the VPC in which you deploy the Amazon Redshift clusters and AWS Batch compute environment. |
UseAWSLakeFormationForGlueCatalog |
No, Yes | Default value is No, Select Yes if AWS Lake Formation is enabled for the account and manages access for Glue catalog |
Performance evaluation
The CloudFormation stack deployed in above step runs a Step Functions state machine to orchestrate the end-to-end workflow. Navigate to Step Functions in AWS Management Console to view the state machine deployed by this solution as shown in the following screenshot.
This state machine creates the Amazon Redshift clusters you defined in the configuration JSON file. Then, it runs the performance evaluation tests on these clusters. If you provided the value true
for parameter AUTO_PAUSE
in the input JSON file, it also pauses the Amazon Redshift clusters except for the first cluster. At the end, it unloads the performance comparison results to your Amazon S3 bucket.
The following screenshot shows the clusters that were created as part of this state machine run for our use case. The state machine automatically paused all clusters except the first one.
This state machine creates an external schema redshift_config_comparison
and creates three external tables (comparison_stats
, cluster_config
, and pricing
) in that schema to read the raw data created by this solution in an S3 bucket. Based on these external tables, it creates the views redshift_config_comparison_results
and redshift_config_comparison_raw
in the public
schema of your Amazon Redshift clusters to compare their performance metrics. These objects can be accessed from any of the Amazon Redshift clusters created by this solution.
REDSHIFT_CONFIG_COMPARISON_RESULTS
This view provides the aggregated comparison summary of your Amazon Redshift clusters.
It provides the raw value and a percentage number for KPIs like total, mean, median, max query times, percentile-75, and percentile-90 to help you compare and find the most performant cluster based on the different cluster configurations. The test_type
column in this view indicates if the test type was to replay your past workload using the Simple Replay utility or a concurrency test to run your queries in parallel with different concurrency numbers. We use the following query to view our outcome:
The following table summarizes the performance comparison results:
Test Type | Cluster Identifier | Total Query Time in Seconds | Improved Total Query Time | Mean Query Time in Seconds | Improved Mean Query Time | Median Query Time in Seconds | Improved Median Query Time |
simple-replay | rs-dc2-8xlarge-2 | 98.75 | 0% | 4.94 | 0% | 5.11 | 0% |
simple-replay | rs-ra3-4xlarge-4 | 94.16 | 5% | 4.71 | 5% | 4.08 | 25% |
simple-replay | rs-ra3-4xlarge-4-cs | 19.36 | 410% | 0.97 | 409% | 0.68 | 651% |
simple-replay | rs-ra3-4xlarge-5 | 16.21 | 509% | 0.85 | 481% | 0.65 | 686% |
simple-replay | rs-ra3-4xlarge-5-cs | 14.66 | 574% | 0.73 | 577% | 0.6 | 752% |
concurrency-1 | rs-dc2-8xlarge-2 | 49.6 | 0% | 9.92 | 0% | 10.67 | 0% |
concurrency-1 | rs-ra3-4xlarge-4 | 45.2 | 10% | 9.51 | 4% | 10.3 | 4% |
concurrency-1 | rs-ra3-4xlarge-4-cs | 45.2 | 10% | 9.51 | 4% | 10.4 | 3% |
concurrency-1 | rs-ra3-4xlarge-5 | 43.93 | 13% | 8.79 | 13% | 7.7 | 39% |
concurrency-1 | rs-ra3-4xlarge-5-cs | 43.94 | 13% | 8.78 | 13% | 7.78 | 37% |
concurrency-5 | rs-dc2-8xlarge-2 | 24.8 | 0% | 0.99 | 0% | 0.79 | 8% |
concurrency-5 | rs-ra3-4xlarge-4 | 21.24 | 17% | 0.85 | 16% | 0.7 | 22% |
concurrency-5 | rs-ra3-4xlarge-4-cs | 22.49 | 10% | 0.9 | 10% | 0.85 | 0% |
concurrency-5 | rs-ra3-4xlarge-5 | 19.73 | 26% | 0.79 | 25% | 0.72 | 19% |
concurrency-5 | rs-ra3-4xlarge-5-cs | 18.73 | 32% | 0.75 | 32% | 0.72 | 18% |
concurrency-10 | rs-dc2-8xlarge-2 | 98.2 | 0% | 1.96 | 0% | 1.63 | 0% |
concurrency-10 | rs-ra3-4xlarge-4 | 85.46 | 15% | 1.71 | 15% | 1.58 | 3% |
concurrency-10 | rs-ra3-4xlarge-4-cs | 88.09 | 11% | 1.76 | 11% | 1.59 | 2% |
concurrency-10 | rs-ra3-4xlarge-5 | 77.54 | 27% | 1.55 | 26% | 1.36 | 20% |
concurrency-10 | rs-ra3-4xlarge-5-cs | 74.68 | 31% | 1.49 | 32% | 1.31 | 24% |
Table 1: Summary of Performance Comparison Results
Based on these results, we can conclude that five nodes of RA3.4XLarge with concurrency scaling enabled was the best-performing configuration.
REDSHIFT_CONFIG_COMPARISON_RAW
This view provides the query-level comparison summary of your Amazon Redshift clusters. We use the following query:
The following table shows the performance comparison results per query:
Query Hash | Cluster Identifier | Execution Time in Seconds | Total Query Time in Seconds | Compile Time in Seconds | Queue Time in Seconds | Username | Query |
0531f3b54885afb | rs-dc2-8xlarge-2 | 2 | 7 | 5 | 0 | awsuser | 599 |
0531f3b54885afb | rs-ra3-4xlarge-4 | 2 | 5 | 3 | 0 | awsuser | 510 |
0531f3b54885afb | rs-ra3-4xlarge-4-cs | 2 | 2 | 0 | 0 | awsuser | 499 |
0531f3b54885afb | rs-ra3-4xlarge-5 | 1 | 6 | 4 | 0 | awsuser | 498 |
0531f3b54885afb | rs-ra3-4xlarge-5-cs | 1 | 1 | 0 | 0 | awsuser | 457 |
10ef3990f05c9f8 | rs-dc2-8xlarge-2 | 0 | 0 | 0 | 0 | awsuser | 516 |
10ef3990f05c9f8 | rs-ra3-4xlarge-4 | 0 | 0 | 0 | 0 | awsuser | 427 |
10ef3990f05c9f8 | rs-ra3-4xlarge-4-cs | 0 | 0 | 0 | 0 | awsuser | 423 |
10ef3990f05c9f8 | rs-ra3-4xlarge-5 | 0 | 0 | 0 | 0 | awsuser | 412 |
10ef3990f05c9f8 | rs-ra3-4xlarge-5-cs | 0 | 0 | 0 | 0 | awsuser | 390 |
Table 2: Performance Comparison Per Query
Access permissions and security
To deploy this solution, you need administrator access on the AWS accounts where you plan to deploy the AWS CloudFormation resources for this solution.
User deploying the AWS CloudFormation stack needs full permission on these services:
AWS IAM, Amazon ECR, AWS Batch, AWS Lambda, Amazon CloudWatch, AWS Glue, Amazon S3, AWS StepFunction, Amazon Redshift, AWS Secrets Manager, Amazon EC2 – SecurityGroup, AWS LakeFormation (if Selected Yes for the CloudFormation parameter UseAWSLakeFormationForGlueCatalog
)
The CloudFormation template provisions all the required resources using security best practices based on the principle of least privileges and hosts all resources within your account VPC. Access to the Amazon Redshift clusters is controlled with the CloudFormation template parameter OnPremisesCIDR
, which you need to provide to allow on-premises users to connect to the new clusters using their SQL clients on the Amazon Redshift port.
Access permissions for all the resources are controlled using AWS Identity and Access Management (IAM) roles granting appropriate permissions to Amazon Redshift, AWS Lambda, AWS Step Functions, AWS Glue, and AWS Batch. Read and write access privileges are granted to the Amazon Redshift clusters and AWS Batch jobs on the S3 bucket created by the CloudFormation template so that it can read and update data and metadata configurations from that bucket. Read and write access privileges are also granted on the S3 bucket where the user configuration JSON file is uploaded. AWS Batch requires internet access in order to pull images from Amazon ECR public repository. AWS LakeFormation is used to manage access control on the AWS Glue catalog tables created for performance evaluation, this is optional, based on the UseAWSLakeFormationForGlueCatalog
parameter in the CloudFormation template.
You can find here the list of IAM permissions used in the utility.
Troubleshooting
AWS Batch jobs can fail with error – CannotPullContainerError
, if the subnet doesn’t have route to the internet to pull the container image. Refer to this KB article to resolve the issue.
There might be some rare instances in which failures occur in the state machine running this solution. To troubleshoot, refer to its logs, along with logs from the AWS Batch jobs in Amazon CloudWatch Logs. To view the AWS Batch logs, navigate to the Amazon CloudWatch console and choose Logs in the navigation pane. Find the log group with name <Your CloudFormation Stack Name>/log
and choose the latest log streams.
To view the Step Functions logs, navigate to the state machine’s latest run on the Step Functions console and choose CloudWatch Logs for the failed Step Functions step.
After you fix the issue, you can restart the state machine by choosing New execution.
Clean up
Running this template in your AWS account may have some cost implications because it provisions new Amazon Redshift clusters, which may be charged as on-demand instances if you don’t have Reserved Instances. When you complete your evaluations, we recommend deleting the Amazon Redshift clusters to save cost. We also recommend pausing your clusters when not in use. If you don’t plan to run this test in future, you should also delete the CloudFormation stack, which deletes all the resources it created.
Limitations
Simple Replay and this automation process have some known limitations:
- The audit log delivery to production can take up to 7 hours. The extract process may fail if there is lag in delivery of audit logs to Amazon S3 for the selected workload window
- Dependent SQL queries across connections aren’t guaranteed to run in the original order.
- Redshift Spectrum queries aren’t replayed if the target cluster doesn’t have access to external tables.
- Replay using JDBC isn’t supported.
- Testcases like Data Sharing and Serverless aren’t supported by this automation and will require manual set up and execution of Simple Replay.
- Transactions with stored procedures using bind variables will not be replayed. This will cause the transaction to fail.
You can minimize the impact of these limitations by using the replica and target clusters for comparison rather than direct comparison with the production Amazon Redshift cluster.
Conclusion
In this post, we walked you through the process to find the correct size of your Amazon Redshift cluster based on your performance requirements. You can start with our easy-to-use, out-of-the-box Amazon Redshift sizing calculator, or you can use the Amazon Redshift Node Configuration Comparison tool to evaluate the performance of your workload and find the best cluster configuration that meets your need.
About the Authors
Manash Deb is a Software Development Engineer in the AWS Directory Service team. He has worked on building end-to-end applications in different database and technologies for over 15 years. He loves to learn new technologies and solving, automating, and simplifying customer problems on AWS.
Ranjan Burman is a Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers to build scalable analytical solutions. He has more than 15 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.
Srinath Madabushi is a Manager, Database Engineering at Amazon Redshift. He has worked in various data warehousing, BI and database technologies for over 16 years. He is an avid big data enthusiast who collaborates with customers around the globe to meet their data warehousing needs.