Configure Amazon ElastiCache for Redis for higher availability

This post was updated 3/10/2021 to include additional features and enhancements to Amazon ElastiCache for Redis.

Amazon ElastiCache has become synonymous with real-time applications. Redis’ high performance, simplicity, and support for diverse data structures have made it one of the most popular non-relational key value stores. With the growth of business-critical, real-time use cases on Redis, ensuring availability becomes an important consideration.

To provide high availability, Amazon ElastiCache for Redis supports Redis Cluster configuration, which delivers superior scalability and availability. In addition, Amazon ElastiCache offers multiple Availability Zone (Multi-AZ) support with auto failover that enables you to set up a cluster with one or more replicas across zones. In the event of a failure on the primary node, Amazon ElastiCache for Redis automatically fails over to a replica to ensure high availability.

Amazon ElastiCache for Redis made several announcements that improve the end-to-end availability of your Redis applications.

Cluster availability during planned maintenance improves availability for auto-failover enabled clusters, during patching, updates, and other maintenance-related activities that involve node replacements. For Redis Cluster configurations set up to use Redis Cluster clients, the planned maintenance and node replacements now complete without any write interruption. For non-Redis Cluster (non-sharded) configurations, you may notice a brief write interruption of up to a few seconds, associated with DNS updates.
Self-service updates allow you to minimize any maintenance impacts by controlling when to initiate maintenance updates. You can find more information, and frequently asked questions about maintenance windows at this dedicated FAQ page.
Single reader endpoints for non-Redis Cluster configuration allow you to direct read traffic, without having to track individual replica endpoint changes. This improves availability by eliminating the need for your application to track changes to individual node endpoints. For Redis Cluster configuration, this capability is typically already handled by the Redis cluster smart clients. Additionally, many clients have a READONLY option which is useful if you wish to leverage Replicas in your Redis Cluster for greater read-scalability within your applications.
Dynamic rename for Redis commands allows you to rename Redis commands in an online manner, without any reboots or availability impact. For Redis 6 workloads, Amazon ElastiCache for Redis supports role-based access control to limit access to commands, and restrict access to certain keys using access strings based on user accounts.

To get the most out of these improvements and overall availability, review your configuration and make sure that it is set up to offer the best availability. The following sections walk through best practices for configuring Amazon ElastiCache for Redis clusters, Redis clients, as well as general application tips for availability.

Configuring Amazon ElastiCache for Redis

Amazon ElastiCache for Redis can be setup by selecting the appropriate node types, Redis configuration (Redis Cluster or non-Redis Cluster), number of replicas, and other opt-in features. As a first step, review the configuration of your Amazon ElastiCache for Redis cluster:

Enable Multi-AZ with automatic failover: Enabling Multi-AZ minimizes downtime by performing automatic failovers from primary node to replicas, in case of any planned or unplanned maintenance. For more information, see Multi-AZ auto failover.
Three-shard minimum Redis Cluster: Having a minimum of three shards provides improved availability by providing faster recovery during both planned and unplanned failovers. Amazon ElastiCache for Redis supports up to 500 total nodes in a cluster, inclusive of shards and replicas.
Set up two or more replicas across Availability Zones: Having two replicas provides improved read scalability and also read availability in scenarios where one replica is undergoing maintenance. This is important if you are not using single reader endpoint and chose to direct your read requests to read replicas only (client setting).
Use the latest engine versions, and cache node types: Latest generation instances such as the R6g and M6g benefit from the advanced Nitro system, which delivers performance indistinguishable from bare metal and enhanced network processing. The Graviton2 based R6g and M6g provide price/performance improvements over previous generation instances. Along with feature enhancements, Redis engine version 6 provides operational enhancements in areas such as replication, snapshotting, eviction, and latency.
Monitor and right-size to deal with anticipated traffic peaks: Under heavy load, the Redis engine may become unresponsive, which affects availability. DatabaseMemoryUsagePercentage is the primary indicator of in-memory storage available per node, whereas ReplicationLag is an indicator of your replication health based on your write rate. You can use these metrics to trigger cluster scaling. For more information about monitoring and sizing, check out this dedicated blog on monitoring your Amazon ElastiCache for Redis workloads.
Avoid maintenance and upgrades during peak hour: A lower write load eases failovers and minimize any application impact.
Amazon ElastiCache for Reds Global Datastore: Leverage Amazon ElastiCache for Redis Global Datastore to replicate your cluster data out of the Primary AWS Region into another Secondary Region for disaster recovery purposes, and to act as a failover target in the event of a regional issue.

Configuring the Redis client

Redis provides a robust client ecosystem which gives you flexibility to choose a client based on your preference. The list below provides general guidance that is applicable across most clients:

Redis Cluster mode: Use Cluster-aware Redis clients and connect to the cluster using the configuration endpoint. This allows the client to automatically discover the shard and slot mappings. Redis Cluster mode also provides online resharding (scale in/out) for resizing your cluster, and allows you to complete planned maintenance and node replacements without any write interruptions. Regularly update the local cluster map using the Redis client in your application. Such as when a MOVED error occurs. Learn more about working with Amazon Elasticache for Redis cluster mode enabled at this dedicated blog on Redis cluster.
Non-Redis Cluster mode: Use the primary endpoint for all write traffic. During any configuration changes or failovers, Amazon ElastiCache ensures that the DNS of the primary endpoint is updated to always point to the primary node. Use the reader endpoint to direct all read traffic. Amazon ElastiCache ensures that the reader endpoint is kept up-to-date with the cluster changes in real time as replicas are added or removed. Individual node endpoints are also available but using reader endpoint frees up your application from tracking any individual node endpoint changes. The Reader Endpoint provides a DNS record that will resolve to an IP address of one of the replica nodes in a round robin fashion. Hence, it’s best to use primary endpoint for writes and single reader endpoint for reads.
Socket timeout: Ensure that the socket timeout of the client is set to at least one second (vs. the typical “none” default in several clients). Setting the timeout too low can lead to numerous timeouts when the server load is high. Setting it too high can result in your application taking a long time to detect connection issues.
Connection pooling: Enable connection pooling to allow the client to reuse connections. This reduces connection overhead, and the likelihood of exhausting node connection limits. This improves the performance of your clients by reducing additionally latency incurred when opening and closing connections while issuing commands to Amazon ElastiCache for Redis.
DNS caching: If your client has a DNS caching mechanism built in, it is recommended to have a lower TTL (as low as 5–10 seconds). Having a higher TTL poses a risk of your application not reaching the desired node. Also, do not use the “cache forever” option.
Test Failover: in the event of an issue with one of your Redis primary nodes, Amazon ElastiCache will initiate a failover to an existing replica node. You can test this event proactively using the TestFailover API. Redis clients handle this process differently so it’s important to understand the behavior prior to a production failover event. This will allow you to test how a failover event impacts your connection management, and client-side resilience.

Application best practices

In addition to configuring your Amazon ElastiCache for Redis cluster and Redis clients, it is helpful to review your application logic for general best practices and availability tips listed below:

Avoid long-running LUA scripts: This can cause the Redis engine to be unresponsive and affect availability. If you must use a LUA script, make sure that you are sized appropriately to deal with CPU spikes.
Consider expiration over eviction: Your eviction policy can be computationally more expensive than expiration. To reduce memory pressure, consider expiration on your keys.
Avoid expensive command operations: Expensive commands such as KEYS can cause degradation in performance and hamper the managed operations on the cluster. An alternative is to use the SCAN command, which offers constant time complexity rather than linear time. Likewise, large objects of the Sorted Sets or Hash data type can cause sync issues and affect managed operations, including maintenance and upgrades. To avoid accidental use of expensive commands, consider enabling role-based access control (RBAC). This allows the Amazon ElastiCache for Redis Cluster to enforce access policies to restrict the availability of commands based on user authentication. Alternatively, Amazon ElastiCache for Redis allows for commands to be dynamically renamed.

Summary

We are excited to bring these availability improvements and recommendations to you. And this is just Day 1. Our team is continuing to enhance end-to-end system availability. Stay tuned for more updates and best practices. To get started with Amazon ElastiCache for Redis, access the Amazon ElastiCache console.

About the Author

Ruchita Arora is a Senior Product Manager at Amazon ElastiCache and works closely on all aspects of Amazon ElastiCache service. Besides databases, she has worked across storage, enterprise application development and telecommunication domains, in various engineering and product management roles.

Nirmal George Eapen is a Software Development Engineer at Amazon ElastiCache.

AWS Database Blog

Configure Amazon ElastiCache for Redis for higher availability

Configuring Amazon ElastiCache for Redis

Configuring the Redis client

Application best practices

Summary

About the Author

Resources

Blog Topics

Follow