Amazon RDS Under the Hood: Single-AZ instance recovery
Amazon Web Services (AWS) customers rely on Amazon Relational Database Service (Amazon RDS) to store their data for all kinds of workloads. For high availability (HA), you can use the Multi-AZ feature of Amazon RDS to provide additional resiliency by maintaining two copies of data across different Availability Zones (AZs). This feature is described in the blog post Amazon RDS Under the Hood: Multi-AZ.
Failures are rare, but as a best practice, applications should design around potential failures. The RDS Multi-AZ configuration is the recommended approach for production environments due to its ability to support low RTO (recovery time objective) and RPO (recovery point objective) requirements. RTO is the targeted amount of time for a recovery to complete in the event of failure. RPO is the targeted amount of time during which data is at risk for loss in the event of a failure.
With millions of active customers using AWS monthly, there are some customer workloads that do not require the level of HA provided by RDS Multi-AZ and the additional costs associated with it. These workloads might have more relaxed RTO and RPO requirements, and Single-AZ configurations might be sufficient to meet those needs. However, before embarking on a Single-AZ only solution, you should understand what recovery expectations and scenarios are available with a Single-AZ RDS instance.
This post describes Amazon RDS Single-AZ RTO and RPO expectations for MySQL, MariaDB, PostgreSQL, Oracle, and Microsoft SQL Server databases. Amazon Aurora uses a different technology and storage subsystem designed for the cloud. Its single instance recovery process and scenarios are described in the Aurora FAQ.
Each RDS instance runs on an Amazon EC2 instance backed by an Amazon EBS volume for storage. RDS takes daily snapshots of the database, which are stored durably in Amazon S3 behind the scenes. It also regularly copies transaction logs to S3—up to 5-minute intervals—providing point-in time-recovery when needed. Point-in-time recovery is not automatic in RDS; you must trigger it, either manually or via a script as part of an event. This recovery is created in a new RDS instance.
RTO for recovery with an RDS Single-AZ instance failure can vary from minutes to hours. The duration depends on the size of the database and the failure and recovery approach required, as described later in this post.
RPO for recovery with an RDS Single-AZ instance failure is typically 5 minutes (the interval required for copying transaction logs to Amazon S3), but it can vary. You can confirm this by calling RDS:describe-db-instances:LatestRestorableTime. This service returns the latest time to which a database can be restored with point-in-time restore.
When we design and plan for failure with an RDS Single-AZ instance, we look at the following scenarios:
- Recoverable instance failure – The individual EC2 node suffered a hardware failure but could be recovered automatically by RDS.
- Non-recoverable instance failure – The individual EC2 node suffered a hardware failure but could not be recovered automatically by RDS.
- EBS volume failure – The EBS volume suffered a data loss failure.
- Availability Zone disruption – Failure at the Availability Zone level that would affect the RDS instance.
We discuss recovery expectations for these scenarios in the following sections.
Recoverable instance failures
An Amazon RDS instance failure occurs when the underlying EC2 instance suffers a failure. When this occurs, an event notification specific to the issue at hand is sent out to alert the customer (see Using Amazon RDS Event Notification for details). However, the RDS instance status remains available.
RDS automatically tries to launch a new instance in the same Availability Zone, attach the EBS volume, and recover. In this scenario, RTO is typically under 30 minutes. In this case, RPO is zero because the EBS volume was recovered.
The EBS volume is in a single Availability Zone, and this recovery occurs in the same Availability Zone as the original instance. The data and table changes in flight might not have been fully committed and completed before the failure occurred. Because the RDS DNS name does not change for Single-AZ instances, the connection endpoints remain the same.
Non-recoverable instance failures or EBS volume failures
If RDS instance recovery attempts are unsuccessful, or the underlying EBS volume suffers a data loss failure, the instance state is set to failed. This situation requires a point-in-time recovery. You can automate the recovery via an AWS Lambda script, or you can do it manually.
The RTO timing requires starting up a new Amazon RDS instance and then applying all changes since the last backup. The RPO is typically 5 minutes, but you can find it by calling RDS:describe-db-instances:LatestRestorableTime. This time can vary from 10 minutes to hours, depending on the number of logs that need to be applied. It can only be determined by testing because it depends on the size of the database, the number of changes made since the last backup, and the workload levels on the database. The RDS backups and transaction logs are stored in Amazon S3, so this recovery can occur in any supported Availability Zone in the Region.
When a new RDS instance is created, a new DNS name is also created. You can point your application to the new DNS name. You also have the option to rename the newly restored DB instance using the old DB instance’s endpoint name. You can do this when it is difficult or impossible to change the configuration of your application’s ODBC/JDBC settings. It requires you to delete the old failed instance first. You would therefore lose the ability for AWS Support to troubleshoot the root cause of the issue, so do it only if necessary.
Availability Zone disruptions
Availability Zone failures are unlikely, and are usually only temporary. If the Availability Zone failure is more permanent, the instance is set to a failed state. The recovery would work as described previously, and a new instance could be created in a different Availability Zone using point-in-time recovery. Because this Single-AZ instance was factored into the overall architecture, you need to perform this step yourself, either manually or by scripting, and the strategy for this recovery scenario would be part of your larger disaster recovery (DR) plans.
If the Availability Zone failure is temporary, the database will be down, but it remains in the available state. You are responsible for application-level monitoring or other third-party tools to detect this scenario. In this case, you could wait for the Availability Zone to recover, or you could choose to recover the instance to another Availability Zone with a point-in-time recovery.
The RTO would be the time it takes to start up a new RDS instance and then apply all the changes since the last backup. The RPO might be longer, up to the time the Availability Zone failure occurred.
I/O performance is impacted during backup and snapshot operations on an RDS Single-AZ instance. For MariaDB, MySQL, Oracle, and PostgreSQL, all I/O to the RDS instance is briefly suspended during backups and snapshots. This is an important consideration when scheduling backups or performing snapshots.
For RDS Single-AZ instances, patching the OS affects availability and performance because there is no failover alternative. Patching occurs during maintenance windows, and the system is not available during the patch.
To maximize availability, Multi-AZ instances patch the OS on the standby first, and then on the master instance. In both cases, the system is not available during RDBMS updates.
A Multi-AZ RDS deployment is also required to apply to the Amazon RDS Service Level Agreement.
AWS recommends that you use an RDS Multi-AZ deployment as part of its Reliability pillar in the Well-Architected Framework as a best practice. However, you can use an RDS Single-AZ instance when reduced availability requirements and lower cost are considerations. With this understanding of RTO and RPO expectations for Single-AZ RDS instances, you can make an informed decision based on availability, risk, and costs.
About the Authors
Wesley Wilk is a solutions architect with Amazon Web Services.
David Gardner is a senior solutions architect with Amazon Web Services.