Five workload characteristics to consider when right sizing Amazon ElastiCache Redis clusters

This post discusses the process to determine the right node size and cluster topology for your Amazon ElastiCache workloads, and the important factors to consider. This post assumes you have a good knowledge of Redis and its commands and have an understanding of Amazon ElastiCache for Redis and its features such as online cluster resizing, scaling, online migration from Amazon EC2 to ElastiCache, general-purpose and memory-optimized nodes, and enhanced I/O.

Baseline recommendation

For entry-level, small (2,000 or less TPS and 10 GB or less data size) and medium (TPS between 2,000 and 20,000, and data size between 10 GB and 100 GB) cache workloads, including the ones that may also experience temporary spikes in use, choose a cache node from the T3 family, which are the next generation general-purpose burstable T3-Standard cache nodes. If you’re just starting to use ElastiCache for your workloads, start on a T3.micro cache node because it offers a free tier. You can go up to T3.medium cache nodes as you increase the load.

For moderate to high (more than 20,000 TPS and 100 GB data size) workloads, choose a cache node from the M5 or R5 families, because the newest node types support the latest generation CPUs and networking capabilities. These cache node families can deliver up to 25 Gbps of aggregate network bandwidth with enhanced networking based on the Elastic Network Adapter (ENA) and over 600 GiB of memory. The R5 node types provide 5% more memory per vCPU and a 10% price per GiB improvement over R4 node types. In addition, R5 node types deliver an approximate 20% CPU performance improvement over R4 node types.

If T3.medium is no longer sufficient, you can move to one of the following:

M5 cache nodes if you need more throughput with some increased memory
R5 cache nodes if you need more throughput and up to 35%–51% higher memory per cache node

To further narrow down the node size and cluster topology suitable for your workloads, you need to do the following:

Determine your five workload characteristics
Run your benchmark testing

Determining your five workload characteristics

You can determine most workloads characteristics using your application metrics, Redis’s INFO command, or from Amazon CloudWatch metrics. For more information about maximum node memory, see Redis Node-Type Specific Parameters.

When determining your ElastiCache node requirements, consider the following:

Memory
Spare or reserved memory
Availability
Scaling
Data

Memory

The following considerations can help you start identifying potential node sizes:

Identify your full datastore size, key, and value data sizes. You can get an approximate estimate of the amount of cache memory you need by multiplying the size of items you want to cache by the number of items you want to keep cached at once.
Determine your intention to retain the data or use TTL to expire the keys in the cache. TTL enables explicit memory management on the node.
Identify the existing and preferred cache hit rate if this is an important metric for you, such as in cache-only use cases. You want to make sure that your cluster has the desired hit rate, or that keys aren’t evicted too often. You can achieve this with more memory capacity.

Spare or reserved memory

You should keep at least 25% of the node size apart from containing your database size. Replication uses some memory from the primary node. In addition, the cache nodes should have spare memory of approximately 10%–15% for unexpected load peaks and early detection via CloudWatch alarms of increasing memory footprint. You can use this early detection to determine whether to scale up or scale out depending on your specific requirements.

Write-heavy applications can require significantly more available memory that data doesn’t use. You need this spare memory when taking snapshots or failing over to one of the replicas.

Availability

If you need your cluster to be available to service your customer requests, consider setting up replication groups with one primary, at least two replicas, and Multi-AZ enabled. This helps protect your data and the cluster continues to serve traffic if the primary fails for any reason. When that happens, one of the replicas becomes the new primary. Replicas can also help you increase your read throughput.

Watch out for a heavy write primary whose write:read ratio is more than 50% and close to 80% of the request rate limit for that node type. Heavy write primary nodes may do a full sync with the replicas more often, which impacts your cluster performance. Frequent full syncs take away the primary node’s time that you could have used for processing incoming requests instead.

Also, resist the urge to spin up lots of replicas just for availability; it creates unnecessary stress on the primary to sync with many replicas. There is a limit of five replicas per primary node. One or two replicas in a different Availability Zone are sufficient for availability.

Scaling

ElastiCache offers cluster-mode enabled configuration that supports online scaling in vertical (up and down) and horizontal (in and out), while the cluster continues to serve requests. It’s better to scale out if you use several simultaneous clients (e.g. 10,000 or more, such as 1 TPS on 10,000 clients or 5 TPS on 2,000 clients) on the primary node to make sure you have the available compute capacity to service them all. The optimal simultaneous clients per primary node depend on your specific use case and overall application architecture. Besides serving a very high number of simultaneous clients, scaling out makes sure that data is spread across multiple shards, which further increases the availability of your data in the cluster. However, if your business requires higher performance on existing cluster configuration, you should scale up. Scale up provides means to increase performance for an individual node.

Migration between the two cluster configurations (cluster-mode disabled and cluster-mode enabled) is supported by backup/restore, an offline operation, that utilizes the .rdb file from your source cluster. Therefore, you should use cluster-mode enabled configuration by default because it permits both vertical and horizontal scaling to meet future needs.

If you’re reducing the size and memory capacity of the cluster, by either scaling in or scaling down, make sure that the new configuration has sufficient memory for your data and Redis overhead.

Data

Determine whether your workloads have any hot keys, such as one or more data objects that are requested at very high rates or have suddenly become very large. Hot keys can impair your cache engine’s ability to maintain high performance and serve all requests. For that use case, assuming you’re using the recommended cluster-mode enabled configuration, you could spread the load to various shards and keep the hot key in one shard and rest of the keys in other shards so other incoming requests aren’t blocked. If there are multiple hot keys, consider spreading them across shards.

In addition, consider separating your read and write workloads. This separation lets you scale reads by adding additional replicas as your application grows. Replicas provide eventually consistent reads. ElastiCache provides a replica endpoint for cluster-mode-disabled configuration that load balances the read traffic on replicas, enabling separation of reads and writes. In addition, some Redis clients for cluster mode enabled configuration allow traffic to be routed to replicas. You should review your specific Redis client documentation for this mechanism.

Running your benchmark testing

After determining the parameters applicable to your case, you should identify a few best-fitting cache nodes and cluster topologies. Choosing two large cache nodes may (or may not) be better than one xlarge cache node. Configure your client application and run your benchmark tests on each of these inline with your workload characteristics in a production-like environment. You should run your benchmark tests with production data and traffic patterns for no less than 14 days to generate a good baseline of your regular production workload pattern. Once you have the baseline, you should then include seasonality such as holidays, Black Friday sales in your workload to get the performance benchmark results that reflect your actual workload patterns more closely. Based on the outcome of the benchmark testing, you can select the right node size and cluster configuration for your Redis workloads.

ElastiCache benchmarking

ElastiCache has already published benchmarking results in two separate blog posts. The first benchmarking post compares R4 and optimized R5 cache nodes. For more information, see Amazon ElastiCache performance boost with Amazon EC2 M5 and R5 instances. The second benchmarking post is on the R5 family and compares Redis 5.0.3 with enhanced I/O against Redis 5.0.0 which doesn’t offer enhanced I/O. For more information, see Boosting application performance and reducing costs with Amazon ElastiCache for Redis. The following section explains their test setup in more detail.

Comparing R4, R5, optimized R5, and optimized R5 with enhanced I/O cache nodes

The two benchmarking exercises had same configuration for consistent results and ensuring apples to apples comparison. Both had 14.7 million unique keys, same memory usage, 80% gets, 20% sets, and no command pipelining. The benchmark ran on client instances connecting to an ElastiCache cluster in the same Availability Zone.

The following table summarizes the benchmark test setup.

Workload Attributes	Attribute Values used in first benchmarking	Attribute values used in second benchmarking
Memory	14.7 million keys with 200-byte string values = 2.9 GB. No TTL. Key was a 4-byte random string with values in range [a-z A-Z 0-9], (62**4 values = 14.7 million keys). Value was a 200-byte non-random/regenerated string.	14.7 million keys with 200-byte string values = 2.9 GB. No TTL. Key was a 16-byte random string with values in range [a-z A-Z 0-9]. Value was a 200-byte non-random/regenerated string.
Spare Memory	5 GB; accounting for 25% for snapshotting, the cache node should be at least 2.9 + 5 + 2.7 = 10.5 GB.	Accounting for 25% for snapshotting.
Availability	One primary with no replicas.	One primary with no replicas.
Scaling	Cluster-mode disabled. Each test had 20 application nodes. Each application node opened a variable amount of connections based on the node type. For larger nodes, more connections were opened (to increase throughput). For smaller nodes, fewer connections were opened. The number of connections were based on how many connections could be opened without significantly increasing the p99.9 latency of requests.	Cluster-mode disabled. Each test had 800 client connections coming from 15 different EC2 hosts.
Data	No hot keys. Up to 160 client connections. Keys were generated randomly.	No hot keys. 800 client connections. Keys were generated randomly.

The following table summarizes the data from both benchmarking exercises:

Cache Node Size	ElastiCache R4 Node	ElastiCache Vanilla R5 Node	ElastiCache Optimized R5 Node	ElastiCache Optimized R5 with Enhanced I/O Node*
large	88,000 RPS	179,000 RPS	215,000 RPS	N/A
xlarge	93,000 RPS	180,000 RPS	207,000 RPS	238,000 RPS
2xlarge	107,000 RPS	187,000 RPS	217,000 RPS	360,000 RPS
4xlarge	131,000 RPS	208,000 RPS	225,000 RPS	453,000 RPS
8xlarge/12xlarge	128,000 RPS	211,000 RPS	247,000 RPS	452,000 RPS
16xlarge/24xlarge	149,000 RPS	181,000 RPS	237,000 RPS	434,000 RPS

* available in ElastiCache for Redis version 5.0.3 onwards

Conclusion

Selecting the right node size and cluster configuration for your workloads is an important activity that you should do regularly, including before migrating to ElastiCache. It’s not a one-time activity; you should do it often throughout the year, especially well in advance of any major upcoming business event. This can prepare your teams to handle the scale and expected growth in traffic much better, which enables you to continue to serve your customers seamlessly.

If you have any questions or feedback, reach out on the AWS ElastiCache Discussion Forum or in the comments.

About the Author

Anumeha is a Product Manager with Amazon Web Services.

AWS Database Blog