AWS Database Blog

Testing Automatic Failover to a Read Replica on Amazon ElastiCache for Redis

Darin Briskman is a developer evangelist at Amazon Web Services. You can reach him @briskmad

“Trust, but verify.”

— U. S. President Ronald Reagan, 1987

In this comment, President Reagan quoted a Russian proverb to describe his philosophy for nuclear disarmament treaties. But the same philosophy also applies well to DevOps!

Amazon ElastiCache for Redis provides high availability, with automated failover and recovery. When you create an ElastiCache cluster, if you are using Redis Cluster Mode, you set the number of shards in the cluster. Each shard has one primary node (for reads and writes) and from zero to five replica nodes (for reads and failover protection). A cluster can be as small as a single shard with zero replicas (1 node) and as large as 15 shards each with 5 replicas (90 total nodes).

Failures don’t happen often in AWS, but any machine fails, eventually. When a replica node fails, the failure is detected and the node replaced in a few minutes.

In Redis Cluster, when a primary node fails, the Redis cluster detects the failure and promotes a replica node to be the new primary for the shard. The cluster informs all nodes in the cluster and all clients about the change. This process should usually take about 30 seconds. The failed node is then replaced and returned to the cluster as a replica node. You can use Redis Cluster with ElastiCache by choosing engine version 3.2.4 and Cluster Mode enabled in the ElastiCache Management Console.

You can trust me on this one… but you should also verify.

Verifying is easy, thanks to ElastiCache Test Failover. Using the console or the AWS CLI, you can simulate a failure for any node in your ElastiCache cluster and see how the failover process works for your own applications. You can test failover on both multishard and single-shard environments, and even on the older Redis 2.8 branch. To do that in the console, simply choose the cluster and shard of your choice to see the Nodes view, and then choose Failover primary.

Be careful with this technique. It works the same on all clusters, because AWS has no way to know if any cluster is in a development, test, or production role. Don’t cause failures on production nodes unless you’re certain that you want to test that production cluster.

Give ElastiCache Test Failover a try!