We frequently upgrade our Amazon ElastiCache fleet, with patches and upgrades being applied to instances seamlessly. However, from time to time we need to relaunch your ElastiCache nodes to apply mandatory OS updates to the underlying host. These replacements are required to apply upgrades that strengthen security, reliability, and operational performance.

You also have the option to manage these replacements yourself at any time prior to the scheduled maintenance window. When you manage a replacement yourself, your instance will receive the OS update when you relaunch the node and your scheduled maintenance window will be cancelled.

Q: How long does a node replacement take?

A replacement typically completes within a few minutes. The replacement may take longer in certain instance configurations and traffic patterns. For example, Redis primary nodes may not have enough free memory, and may be experiencing high write traffic. When an empty replica syncs from this primary, the primary node may run out of memory trying to address the incoming writes as well as sync the replica. In that case, the master disconnects the replica and restarts the sync process. It may take multiple attempts for replica to sync successfully. It is also possible that replica may never sync if the incoming write traffic continues to remains high.

Memcached nodes do not need to sync during replacement and are always replaced fast irrespective of node sizes.

 

Q: How does a node replacement impact my application?

For Redis nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful Redis replication. For single node Redis clusters, ElastiCache dynamically spins up a replica, replicates the data, and then fails over to it. For replication groups consisting of multiple nodes, ElastiCache replaces the existing replicas and syncs data from the primary to the new replicas. If Multi-AZ or Cluster Mode is enabled, replacing the primary triggers a failover to a read replica. If Multi-AZ is disabled, ElastiCache replaces the primary and then syncs the data from a read replica. The primary will be unavailable during this time.

For Memcached nodes, the replacement process brings up an empty new node and terminates the current node. The new node will be unavailable for a short period during the switch. Once switched, your application may see performance degradation while the empty new node is populated with cache data.

 

Q: What best practices should I follow for a smooth replacement experience and minimize data loss?

For Redis nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful Redis replication. We try to replace just enough nodes from the same cluster at a time to keep the cluster stable. You can provision primary and read replicas in different availability zones. In this case, when a node is replaced, the data will be synced from a peer node in a different availability zone. For single node Redis clusters, we recommend that sufficient memory is available to Redis, as described here. For Redis replication groups with multiple nodes, we also recommend scheduling the replacement during a period with low incoming write traffic.

For Memcached nodes, schedule your maintenance window during a period with low incoming write traffic, test your application for failover and use the ElastiCache provided "smarter" client. You cannot avoid data loss as Memcached has data purely in memory.

 

Q: How do I manage node replacements on my own?

We recommend that you allow ElastiCache to manage your node replacements for you during your scheduled maintenance window. You can specify your preferred time for replacements via the weekly maintenance window when you create an ElastiCache cluster. For changing your maintenance window to a more convenient time later, you can use the ModifyCacheCluster API or click on Modify in the ElastiCache Management Console.

If you choose to manage the replacement yourself, you can take various actions depending on your use case and cluster configuration:

For more instructions on all these options see Actions You Can Take When a Node is Scheduled for Replacement page.

For Memcached, you can just delete and re-create the clusters. Post replacement, your instance should no longer have a scheduled event associated with it.

 

Q: How do I find out about upcoming scheduled replacements?

ElastiCache will send you email notifications before your node is scheduled for replacement. You can use the Cache Events section of the ElastiCache Management Console or use the describe-events API to check for the upcoming ElastiCache:NodeReplacementScheduled event. Finally, you can set up Amazon SNS notifications for this event in Redis using the information provided here.

For setting up SNS notifications in Memcached, use the information provided here.

 

Q: Can I change the scheduled maintenance at a more suitable time?

Yes, you can change your cluster’s maintenance window. For changing your maintenance window to a more convenient time later, you can use the API (ModifyCacheCluster or ModifyReplicationGroup) or click on Modify in the ElastiCache Management Console.

Once you change your maintenance window, ElastiCache service will schedule your node for maintenance during the newly specified window. Please see examples on how the changes take effect below.

For example,

Let's say, currently it's Thursday, 11/09, at 1500 and the next maintenance window is Friday, 11/10, at 1700. Following are 3 scenarios with their outcomes:

  • You change your maintenance window to Friday at 1600 (after the current date time and before the next scheduled maintenance window). The node will be replaced on Friday, 11/10, at 1600.
  • You change your maintenance window to Saturday at 1600 (after the current date time and after the next scheduled maintenance window). The node will be replaced on Saturday, 11/11, at 1600.
  • You change your maintenance window to Wednesday at 1600 (earlier in the week than the current date time). The node will be replaced next Wednesday, 11/15, at 1600.

 

Q: Why are you doing these node replacements?

These replacements are needed to apply mandatory software updates to your underlying host. The updates help strengthen our security, reliability, and operational performance.

 

Q: Do these replacements affect my nodes in Multiple Availability Zones at the same time?

We may replace multiple nodes from the same cluster depending on the cluster configuration while maintaining cluster stability. For Redis sharded clusters, we try not to replace multiple nodes in the same shard at a time. In addition, we try not to replace majority of the master nodes in the cluster across all the shards.

For non-sharded clusters, we will attempt to stagger node replacements over the maintenance window as much as possible to continue maintaining cluster stability.

 

Q: Can the nodes in different clusters from different regions be replaced at the same time?

Yes, it is possible that these nodes will be replaced at the same time, if your maintenance window for these clusters is configured to be the same.