Amazon ElastiCache managed maintenance and service updates help page

Overview

We frequently upgrade our Amazon ElastiCache fleet, with patches and upgrades being applied to instances seamlessly. We do this in one of the two ways:

(a) continuous managed maintenance, and (b) service updates. These maintenance and service updates are required to apply upgrades that strengthen security, reliability, and operational performance.

Continuous managed maintenance happens from time to time and directly in your maintenance windows without requiring any action from your end.
Service updates give you flexibility to apply them on your own. They are timed and may be moved into the maintenance window to be applied by us after their due date lapses.

You have the option to manage updates yourself at any time prior to the scheduled maintenance window. When you manage an update yourself, your instance will receive the OS update when you relaunch the node and your scheduled maintenance window will be cancelled.

Service Updates

Service updates is a feature in Amazon ElastiCache that enables you to apply certain service updates at your discretion. These updates can be of the following types: security patches or minor software updates. These updates help strengthen security, reliability, and operational performance of your clusters.

The value of these service updates is that you can control when to apply the update (e.g., you can delay applying service updates when there is an important business event that requires 24x7 availability of ElastiCache clusters).

For details of each service update, please refer to value of the "Update Description" attribute.

When service updates applicable to your clusters become available, we will notify you via several channels, including the Amazon ElastiCache console, email, Amazon Simple Notification Service (SNS), AWS Personal Health Dashboard, and Amazon CloudWatch events.

Updates available via our continuous managed maintenance are separate than those offered by service updates. Updates applied via continuous managed maintenance are directly scheduled in your maintenance windows without any action needed from your side. Service updates are timed and give you control on when you want to apply by the “Recommended Apply by Date”. If they are still not applied by then, ElastiCache may schedule these updates in your maintenance window.

If your ElastiCache cluster is participating in a HIPAA, PCI, or FedRAMP compliance program, you must apply service updates by their “Recommended Apply by Date” in order to maintain compliance. For more information, please see Self-Service Security Updates for Compliance.

For other clusters, we recommend that you apply service updates as per your business cadence. Even if you are unable to apply a service update by its “Recommended Apply by Date” you will be able to apply it until its “Update Expiration Date”. However, the “Update Expiration Date” can change anytime depending on the availability of new updates.

When you or Amazon ElastiCache applies a service update to one or more clusters, the update is applied to no more than one node at a time within each shard until all selected clusters are updated. The nodes being updated will experience downtime of few seconds, while the rest of the cluster will continue to serve traffic.

  • There will be no change in the cluster configuration.
  • You will see a delay in your CloudWatch metrics that catch up as soon as possible.

Service updates are applied in the same way as “Continuous Managed Maintenance Updates”, through node replacement. Please refer to the following questions in the Continuous Managed Maintenance Updates section on this page for details about how the update is applied and how to prepare your application to minimize the impact.

  • How does a node replacement impact my application?
  • What best practices should I follow for a smooth replacement experience and minimize data loss?
  • What client configuration best practices should I follow to minimize application interruption during maintenance?

Yes, the node is replaced by a new empty node. The cache contents will no longer be there and will start fresh.

You can determine if you can opt out of a service update by verifying that value of “Auto-Update after Due Date” attribute. If the value of “Auto-Update after Due Date” attribute of a service update is “no”, this service update can be opted out of. However, if the value of “Auto-Update after Due Date” attribute of a service update is “yes” and the recommended “Apply by Date” has passed, ElastiCache will automatically schedule the service update to any remaining clusters during an upcoming maintenance window. This automatic service update will be scheduled before the "Update expiration date” and you will receive a notification one week prior to the update with the scheduled time. We strongly recommend applying security updates even if they can be opted out of.  If you choose to apply the service update to the remaining clusters prior to the maintenance window, ElastiCache will not reapply the service update during the maintenance window.

The purpose of service updates is to give you flexibility on when to apply them. Clusters that are not participating in the ElastiCache-supported compliance programs can choose to not apply these updates, or apply them at a reduced frequency throughout the year. This is true only when the value of “Auto-Update after Due Date” attribute of a service update is “no”. For more information, see Can I opt out of service updates?

No, service updates are mutually exclusive to the continuous managed maintenance updates applied directly by Amazon ElastiCache during your clusters’ maintenance windows.

A complete list of attributes and their descriptions is available in Applying the Self-Service Updates.

To help determine how soon to apply the available service updates, you can refer to the “Severity” service update attribute which has the following values (in order of priority):

1. criticalRecommended to apply immediately (within 14 days or less)
2. importantRecommended to apply as soon as your business flow allows (within 30 days or less)
3. mediumRecommended to apply within 60 days or less
4. lowRecommended to apply within 90 days or less

For more details refer to our public documentation – Applying Updates.

Release schedule depends on the importance of the service updates.

This attribute reflects whether your cluster was updated by the “Recommended Apply by Date”. If a service update is applied after the “Recommended Apply by Date”, the attribute “Service Update SLA Met” is set to “no”.

This information is relevant for Amazon ElastiCache clusters participating in HIPAA, PCI, and FedRAMP compliance programs. For more information, please see Self-Service Security Updates for Compliance.

Yes. Unless noted otherwise in the service update “Description” attribute, service updates are always cumulative: if you miss applying them by the “Update Expiration Date”, they will be included in the next service update. Service updates of type “security” fall under this cumulative category.

No, service updates are applied at the cluster level. If you cancel an ongoing update, a cluster may have some nodes updated and some nodes not updated. In this case, the cluster will continue to show up in the list of clusters to apply the service update to. The cluster will continue to operate normally.

There are two cases when this may happen:

(a) If you missed applying the service update that was optional and the update is now in “expired” status. Hence clusters participating in compliance programs must always apply all the service updates.
(b) If your node(s) are replaced for any other reason, such as a planned maintenance event or node failover, Amazon ElastiCache will provision new node(s) with the latest service updates included.

In both cases, the cluster will continue to operate normally.

New nodes contain all applicable service updates, so you can manually replace the existing nodes that haven’t been updated to get the latest updates.

Yes. A service update may be applicable to only Redis OSS, only Memcached, or both Redis OSS and Memcached. You can look for the “Engine” and “Engine Version” service update attributes to determine the scope of each update.

Yes, you can defer the service update by changing the maintenance window. The scheduled update will only be applied to the cluster if the scheduled date matches the cluster's maintenance window. Once you change the maintenance window and the scheduled date has passed, the service update will be rescheduled to the newly specified window in the following weeks. You will receive a new notification one week before the new date has been reached.

Security at AWS is a shared responsibility. We strongly recommend that you apply the update at the earliest.

Your cluster may be part of different service updates. Most of the updates do not require you to apply them separately. Applying one update to your cluster will mark the other updates as completed wherever applicable. You may need to apply multiple updates to the same cluster separately if the status does not change to “completed” automatically.

ElastiCache will schedule the service update on the remaining clusters after the “Recommended Apply by Date” if the value of “Auto-Update after Due Date” attribute is “yes.” The update will be scheduled in the cluster’s maintenance window and you will receive a new notification one week in advance with the scheduled date before the updates are applied.

Scheduled service update will be applied to the clusters in the same way as “Continuous Managed Maintenance Updates.” Please refer to the following section on the details of how the update is applied, how to change the scheduled update, and how to prepare your application for a scheduled update to minimize the impact.

To maintain cluster stability, ElastiCache applies updates to only one node at a time within each shard. If the service update cannot be applied to the entire cluster within a single maintenance window, it will be scheduled to continue in the next ones. You will receive new notifications on the next scheduled date and can prepare accordingly.

The customer cannot roll back the service update once it starts. If you find an issue after applying a service update, please reach out to the AWS Support team.

Continuous Managed Maintenance Updates

These updates are mandatory and applied directly in your maintenance windows without any action needed from your side. These updates are separate than those offered by service updates.

A replacement typically completes within a few seconds. The replacement may take longer in certain instance configurations and traffic patterns. For example, Redis OSS primary nodes may not have enough free memory, and may be experiencing high write traffic. When an empty replica syncs from this primary, the primary node may run out of memory trying to address the incoming writes as well as sync the replica. In that case, the master disconnects the replica and restarts the sync process. It may take multiple attempts for replica to sync successfully. It is also possible that replica may never sync if the incoming write traffic continues to remains high.

Memcached nodes do not need sync, so their replacement completes faster, irrespective of node sizes.

For Redis OSS nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful replication. For single node clusters, ElastiCache dynamically spins up a replica, replicates the data, and then fails over to it. For replication groups consisting of multiple nodes, ElastiCache replaces the existing replicas and syncs data from the primary to the new replicas. If Multi-AZ with auto-failover is enabled, replacing the primary triggers a failover to a read replica. For cluster configurations that are set up to use cluster clients, and non-Cluster configurations with auto failover enabled, the planned node replacements complete while the cluster serves incoming write requests. If Multi-AZ is disabled, ElastiCache replaces the primary and then syncs the data from a read replica. The primary node is unavailable during this time, leading to longer write interruption.

For Memcached nodes, the replacement process brings up an empty new node and terminates the current node. The new node will be unavailable for a short period during the switch. Once switched, your application may see performance degradation while the empty new node is populated with cache data.

For Redis OSS nodes, the replacement process is designed to make a best effort to retain your existing data and requires successful replication. We try to replace just enough nodes from the same cluster at a time to keep the cluster stable. You can provision primary and read replicas in different availability zones. In this case, when a node is replaced, the data will be synced from a peer node in a different availability zone. We also recommend that you upgrade your Redis OSS version to 5.0.6 or higher as those engine versions have improved stability and enable your clusters to continuously serve incoming write requests during patching activities if they have auto-failover enabled. Finally, if your configuration includes only one primary and one single replica per shard, we recommend adding additional replicas prior to the patching. This will prevent reduced availability and risk during the patching process. For single node clusters, we recommend that sufficient memory is available to Redis OSS as described here. For replication groups with multiple nodes, we also recommend scheduling the replacement during a period with low incoming write traffic.

For Memcached nodes, schedule your maintenance window during a period with low incoming write traffic, test your application for failover and use the ElastiCache provided "smarter" client. You cannot avoid data loss as Memcached has data purely in memory.

For Redis OSS, cluster mode configuration has the best availability during managed or unmanaged operations and it is always recommended to use a cluster mode supported client which connects to the cluster discovery endpoint. For cluster mode disabled, it is recommended to always use the primary endpoint for all the write operations. The individual node endpoints of the replica nodes can be used for all the read operations. If auto-failover is enabled in the cluster, primary node may change, therefore, the application should confirm the role of the node and update all the read endpoints to ensure that you aren't causing a major load on the master. With auto failover disabled, the role of the node will not change, however the downtime in managed or unmanaged operations is higher as compared to clusters with auto failover enabled. Avoid directing read requests to read replicas only. If you configure your client to direct read requests to read replicas only, ensure that you have at least two read replicas to avoid any read interruption during maintenance.

We recommend that you allow ElastiCache to manage your node replacements for you during your scheduled maintenance window. You can specify your preferred time for replacements via the weekly maintenance window when you create an ElastiCache cluster. For changing your maintenance window to a more convenient time later, you can use the ModifyCacheCluster API or click on Modify in the ElastiCache Management Console.

If you choose to manage the replacement yourself, you can take various actions depending on your use case and cluster configuration:

• Change the Maintenance Window.
• Re-launch your instance using Backup & Restore process.
• If your cluster configuration is Cluster Mode Disabled

Replace a read-replica (Cluster-Mode Disabled) – A procedure to manually replace a read-replica in a replication group.
Replace the primary node (Cluster-Mode Disabled) – A procedure to manually replace the primary node in a replication group.
Replace a standalone node (Cluster-Mode Disabled) – Two different procedures to replace a standalone node.

• If your cluster configuration is Cluster Mode Enabled

Replace a node in cluster with one or more shards – You can either use backup and restore or scale-out followed by a scale-in to replace the nodes.

For more instructions on all these options see Actions You Can Take When a Node is Scheduled for Replacement page.

For Memcached, you can just delete and re-create the clusters. Post replacement, your instance should no longer have a scheduled event associated with it.

To receive notifications, you can set up Amazon SNS notifications for significant events such as a scheduled replacement event. This can be achieved via the ElastiCache Management Console, under the Events section, or by using the describe-events API to check for the upcoming ElastiCache:NodeReplacementScheduled event.

For setting up SNS notifications use the information provided here.

Yes, you can change your cluster’s maintenance window. For changing your maintenance window to a more convenient time later, you can use the API (ModifyCacheCluster or ModifyReplicationGroup) or click on Modify in the ElastiCache Management Console.

Once you change your maintenance window, ElastiCache service will schedule your node for maintenance during the newly specified window. Please see examples on how the changes take effect below.

For example,

Let's say, currently it's Thursday, 11/09, at 1500 and the next maintenance window is Friday, 11/10, at 1700. Following are 3 scenarios with their outcomes:

• You change your maintenance window to Friday at 1600 (after the current date time and before the next scheduled maintenance window). The node will be replaced on Friday, 11/10, at 1600.
• You change your maintenance window to Saturday at 1600 (after the current date time and after the next scheduled maintenance window). The node will be replaced on Saturday, 11/11, at 1600.
• You change your maintenance window to Wednesday at 1600 (earlier in the week than the current date time). The node will be replaced next Wednesday, 11/15, at 1600.

These replacements are needed to apply mandatory software updates to your underlying host. The updates help strengthen our security, reliability, and operational performance.

We may replace multiple nodes from the same cluster depending on the cluster configuration while maintaining cluster stability. For sharded clusters, we try not to replace multiple nodes in the same shard at a time. In addition, we try not to replace majority of the master nodes in the cluster across all the shards.
For non-sharded clusters, we will attempt to stagger node replacements over the maintenance window as much as possible to continue maintaining cluster stability.

Yes, it is possible that these nodes will be replaced at the same time, if your maintenance window for these clusters is configured to be the same.