How do I implement disaster recovery or fault tolerance for my Amazon ElastiCache Redis cluster?

Last updated: 2019-05-31

I need to implement disaster recovery or fault tolerance for my Amazon ElastiCache Redis cluster data. What options are available?

Short Description

The available fault tolerance solutions each have their own balance of data durability, performance impact, and cost. Choose the one that best fits your use case:

Resolution

Multi-AZ with Automatic Failover

Multi-AZ with Automatic Failover is the best option when data retention, minimal downtime, and application performance are a priority.

  • Data loss potential - Low. Multi-AZ provides fault tolerance for every scenario, including hardware-related issues.
  • Performance impact - Low. Of the available options, Multi-AZ provides the fastest time to recovery, because there is no manual procedure to follow after the process is implemented.
  • Cost - Low to high. Multi-AZ is the lowest-cost option. Use Multi-AZ when you can't risk losing data because of hardware failure or you can't afford the downtime required by other options in your response to an outage.

Daily automatic backups

You can schedule daily automatic backups at a time when you expect low resource utilization for your cluster. ElastiCache creates a backup of the cluster, and then writes all data from the cache to a Redis RDB file. Redis versions 2.8.22 and later implement a forkless backup that can improve performance.

Note: Redis backup and restore aren't supported on cache.t1.micro nodes for cluster mode disabled clusters.

  • Data loss potential - High (up to a day’s worth). Daily automatic backups are retained for up to 35 days.
  • Performance impact - Medium to high. Running multiple file backups throughout the day impacts performance. To improve performance, consider enabling RDB snapshots on a designated persistence only secondary node. Then, disable both RDB snapshots and AOF on the primary node and all other secondary nodes.
  • Cost - Low to medium. Storage costs increase with the number of backups and the data retention duration.

Before implementing backup and restore, consider the limitations described at Backup Constraints. For comprehensive information about implementing backups for ElastiCache clusters running Redis, see ElastiCache for Redis Backup & Restore. For more information, see Making Manual Backups.

Manual backups using Redis append-only file (AOF)

Manual backups using AOF are retained indefinitely and are useful for testing and archiving. You can schedule manual backups to occur up to 20 times per node within any 24-hour period.

To enable AOF for a Redis cluster, create a parameter group with the appendonly parameter set to yes. Then, assign the parameter group to your cluster.

When using AOF, keep the following in mind:

  • To improve performance, consider enabling RDB snapshots on a designated persistence only secondary node. Then, disable both RDB snapshots and AOF on the primary node and all other secondary nodes.
  • To improve performance, set the value of the appendfsync parameter to everysec or no to write to disk every second or as needed.
  • AOF is supported only for use with Redis versions 2.8.21 and earlier.
  • AOF is subject to the limitations described at Mitigating Failures: Redis Append Only Files (AOF).
  • AOF isn't supported for cache.t1.micro and cache.t2.* nodes, or Multi-AZ replication groups. For nodes of these types, the appendonly parameter value is ignored.

This is a suitable option for maintaining a high level of data persistence at a relatively low cost by using the functionality that is native to Redis versions 2.8.21 and earlier.

  • Data loss potential - Low to medium. Although AOF provides a measure of fault tolerance, it can't protect your data from a hardware-related cache node failure, so there is risk of data loss.
  • Performance impact - Low to high. AOF performance impact is highly correlated with the associated appendfsync parameter value, which controls how often the AOF output buffer is written to disk. The more frequently the output buffer is written to disk, the greater the impact on performance. Choosing the always option for this parameter causes the buffer to be flushed every time the cache data is modified. Therefore, this option isn't recommended. Because the AOF file can grow quickly, it's a best practice to verify your disk space requirements. Another performance consideration for AOF is the time required to replay an AOF file. You might need several minutes to populate the Redis nodes with the cache data. During this time, your application can satisfy queries only for uncached data by directly querying your database.
  • Cost - Low to medium. AOF cost is most highly correlated to the time requirements and performance considerations involved whenever you need to replay an AOF file. Disk-space requirements are greater than the snapshot options already described.

For more information, see ElastiCache for Redis Append Only Files (AOF).