How do I minimize downtime when my ElastiCache for Redis is scaling?

Last updated: 2022-07-19

I want to minimize my downtime while my Amazon ElastiCache for Redis is scaling. How can I do this?

Resolution

To minimize downtime, consider the following:

  • To minimize the impact of synchronization, avoid scaling under high workloads. If the cluster is under high workload and scaling is taking a long time, reduce the incoming requests to Redis to prevent synchronization failure. During the scaling process, synchronization (background save, forked or forkless) might trigger. Synchronization is a compute-intensive operation and consumes extra memory. Check the SaveInProgress metric in Amazon CloudWatch to see when the synchronization happened. Keep in mind that this metric collects data each minute. Because of this, the metric might not capture synchronization that finished in under one minute. For more information on monitoring workload, see Monitoring best practices with Amazon ElastiCache for Redis using Amazon CloudWatch.
  • Depending on the scaling type, there might be new node joining, a node might be removed from a cluster, or the IP of a node might change during the scaling. Amazon ElastiCache for Redis provides different types of connection endpoints for connecting to the cluster. Choosing the type of connection endpoint is dependent on application requirements. It's a best practice to test scaling in a non-production environment to identify unexpected issues caused by client-side misconfiguration while connecting the Redis cluster.
  • If the client connects to a new replica that's in the progress of synchronization, the LOADING: Redis is loading the dataset in memory error appears. Configure the Redis client or application code to retry the query on another replica, or send a query to the primary node. The time it takes to load the dataset depends on the data size and the performance of the node. Test in your testing environment to determine if this will be an issue.
  • You can configure the cluster to scale automatically, instead of manually scaling. Automatic scaling prevents performance issues caused by sudden increases in the incoming workload. For more information, see Auto Scaling ElastiCache for Redis clusters.

Scaling action downtime overview

There are four scaling actions:

  • Scaling in.
  • Scaling out.
  • Changing node types.
  • Changing the number of node groups. This is only supported for Redis (cluster mode disabled) clusters.

For more information, see Scaling ElastiCache for Redis clusters.

Usually, scaling downtime might be a few seconds at most, depending on the scaling action and cluster configuration. The following are downtime explanations for each cluster type and scaling action:

Redis (cluster mode disabled) clusters

Scaling in: Scaling in removes a replica node from clusters. You might do this to reduce costs. Keep in mind that scaling in also decreases the data durability. If your applications use only a primary endpoint to connect to the cluster, removing the replica nodes doesn't cause any downtime. This is because the primary endpoint points to the primary node.

However, if your applications use reader endpoints, or individual endpoints to connect to that replica node, then the original connection to removed replica node breaks and the application must establish a new TCP connection to other replica nodes. The application also has to perform DNS lookup again to avoid connecting to the removed replica node. DNS propagation of reader endpoints might cause a few seconds of downtime if the client is using reader endpoints.

Scaling out: Scaling out adds a replica node in an existing cluster. The primary node synchronizes with the new nodes. To avoid downtime during the synchronization, consider scaling out during the hours when workload is minimum.

Changing node types: When changing node types, new nodes of the specified type are created. The old primary node synchronizes with the new primary node and new primary node synchronizes with new replica nodes. During this scaling process, make sure that:

  • The workload isn't so high that synchronization fails.
  • The IP of the new nodes might not be the same as the old nodes. Your application might need to perform DNS lookup on the primary endpoint or reader endpoint again and establish new connections to the new node. It takes a few seconds for DNS propagation, so there might be some interruption on your service before the client reaches the new node.
    In Redis versions 5.0.5 or above, the interruption is minimized. The Redis client retries the request to a new node if the request failed due to termination of the old node. It's a best practice to upgrade to the new Redis version to benefit from optimizations on ElastiCache service.

Redis (cluster mode enabled) clusters

Scaling in: Scaling in means removing a shard from a cluster. Before the shard is removed, the data in that shard is migrated to other nodes. This process is called "Resharding". Resharding causes extra workload on the cluster, and the client must support Redis cluster. For information on minimizing downtime during scaling in, see Best practices: Online cluster resizing.

To minimize performance issue due to excessive scaling in, consider scaling gradually. For example, if the target is to scale in from 12 shards to 6 shards, scale in from 12 shards to 9 shards first. After the initial scale in, check the cluster's performance during peak time and then do further scaling in.

Scaling out: Scaling out adds a shard to a cluster. The data in other shards is migrated to the new shard. This process is called "Resharding". Resharding causes extra workload on the cluster and the client must support the Redis cluster. For information on minimizing downtime during scaling out, see Best practices: Online cluster resizing.

Changing node types: During "Changing node types", for each shard, the old primary node synchronizes with the new primary node. Then, the new primary nodes synchronize with the new replica nodes. During this scaling process, make sure that:

  • The workload isn't so high that synchronization fails.
  • The IP of the new nodes might not be the same as old nodes. To determine the IP address, your application might use the cluster nodes or cluster slots command to get updated topology information from the cluster. Most Redis clients that support Redis clusters can update the topology of the cluster. However, you might need to configure the Redis client to do this. For information on how to configure your Redis client, see the documentation for your specific client type.

Changing the number of node groups: Changing the number of node groups changes the number of replica nodes in each shard. When the number of replica nodes increase, new replica nodes join the cluster and the primary node synchronizes with the new node. So, check the performance of the primary nodes before adding additional replica nodes. When the number of replica nodes decreases and if the client needs to read from a replica node that's removed, it must send the request to one of the new replica nodes. The client also must update the topology of the cluster to prevent further requests from being sent to the removed node.