Choose the right Amazon RDS deployment option: Single-AZ instance, Multi-AZ instance, or Multi-AZ database cluster

In addition to offering you a choice of seven well-known engines, Amazon Relational Database Service (Amazon RDS) also offers a number of deployment choices to assist you in selecting the option that best suits your workload. You can evaluate your requirements and then choose the right set of service offerings. In the latest set of innovations, Amazon released the Amazon RDS Multi-AZ DB cluster, which provides improved commit latency, faster failover, readable standby instances, and optimizations for replication. The instances are powered by AWS Graviton2 processors and deliver up to 40 percent better price performance and 50% more local storage GB per vCPU over comparable x86-based instances.

In the following sections, we dive deeper into different Amazon RDS deployment options of Single-AZ deployment, Multi-AZ DB instance deployment, and Multi-AZ DB cluster deployment on factors like automatic failover times, read scalability, available engines options, latency for transaction commits, resiliency to Availability Zone(AZ) outage, and recovery objectives (RTO/RPO). Recovery time objective (RTO) and recovery point objective (RPO) are two key metrics to consider when developing resilient architectures. RTO represents how much time it takes you to return to a working state after a disaster. RPO represents how much data you could lose when a disaster happens. For example, an RPO of 1 hour means that you could lose up to 1 hour of data when a disaster occurs.

Single-AZ Instance

A Single-AZ instance runs on a single Amazon RDS managed instance backed with Amazon Elastic Block Store (Amazon EBS) volumes. The application read or write requests are routed to Single-AZ instance. The following diagram illustrates the high-level architecture of a Single-AZ instances which applies to all Amazon RDS engines.

Due to the lack of standby instances, a Single-AZ instance cannot failover during an AZ outage. The RPO with an Amazon RDS Single-AZ instance is typically 5 minutes, which is based on the timeout interval for copying transaction logs to Amazon S3. This time may vary due to open transactions, engine specific settings, loss of network connectivity to Amazon S3, and instance class (network/disk/heavy workload) limits. You can find it by calling RDS:describe-db-instances:LatestRestorableTime API. The RTO can last from a few minutes to several hours. However, you can create read replicas within the same region or cross-region to support read workloads and the read replica can be promoted manually in case of primary instance failures.

The Single-AZ instance is not the best fit for production workloads where high availability is required. However, it can be a good fit for development or testing purposes where applications do not require high availability, automatic failover, or low RTO/RPO.

For more information, visit Amazon RDS Under the Hood: Single-AZ instance recovery.

Multi-AZ instance

A Multi-AZ instance consists of two Amazon RDS managed instances in two different AZs. The two instances in Multi-AZ instance deployment are referred to as the primary instance and the standby instance. The primary instance is responsible for serving read and write traffic. In this deployment option, the standby instance doesn’t serve any read or write traffic. The storage replication happens synchronously from primary instance to secondary instance.

The following diagram illustrates the high-level architecture of Multi-AZ Deployment for Amazon RDS For PostgreSQL and it also applies to other engines that are supported by Amazon RDS. Multi-AZ deployment with one standby applies to all Amazon RDS offerings.

In the event of failure, Amazon RDS initiates an automated failover to the standby instance. During this time, the role of instances in Multi-AZ instance deployment is reversed and DNS propagation takes place. The automated failover process promotes the standby instance to the new role of primary without any manual intervention. Amazon RDS automatically performs a failover in the event of any of the following:

Loss of availability in primary Availability Zone
Loss of network connectivity to primary
Compute unit failure on primary
Storage failure on primary

The RPO with an Amazon RDS Multi-AZ instance failover is zero because of the synchronous replication to the standby db instance. The amount of time it takes for failover is usually 1–2 minutes. Long recovery times due to rollback of uncommitted transactions or roll-forward of in-memory committed transactions, limits on instance class’s IO throughput, lazy loading from Amazon S3 to Amazon EBS volumes, and the amount of transactions logs that must be copied can all prolong failover time.

During automated failover, transactions or inflight queries are terminated. Therefore, it’s best practice to have your own mechanisms in place for detecting query cancellation. For information on how you can respond to failovers, reduce recovery time, and other best practices for Amazon RDS, see Best practices for Amazon RDS.

When using the Amazon RDS Multi-AZ instance the snapshots and backups are taken from the standby instance. This prevents I/O suspension on the primary instance during the backup process avoiding read/write traffic disruption on primary and lower latencies. However, as discussed above, the standby instance is passive and does not serve read traffic. To serve read-only traffic, we can add read replica’s to the Multi-AZ instance and use read endpoint to serve read-only traffic. You can also use read replica promotion as a data recovery scheme if the primary DB instance fails. For more information about read replicas, see Working with read replicas.

The Multi-AZ instance is suitable for business/mission critical applications that require high availability with low RTO/RPO and resilience to availability zone outage. However, this high availability option isn’t a scaling solution for read-only scenarios. You can’t use a standby replica to serve read traffic. To serve read-only traffic, use a Multi-AZ DB cluster or a read replica instead.

For more information, visit Amazon RDS Under the Hood: Multi-AZ.

Multi-AZ DB cluster

The Multi-AZ DB cluster is the latest deployment offering in Amazon RDS, and is available for MySQL and PostgreSQL engines. The Multi-AZ DB cluster combines automatic failover with two readable standby instances and provides up to 2x faster commit latencies and automated failovers, typically under 35 seconds. The Amazon RDS managed instances are created in three separate Availability Zones and are equipped with fast NVMe SSD for local storage, ideal for high speed and low-latency storage. Unlike Multi-AZ instance deployment, where the secondary instance can’t be accessed for read or writes, Multi-AZ DB cluster deployment consists of primary instance running in one AZ serving read-write traffic and two other standby running in two different AZs serving read traffic.

The Multi-AZ DB cluster helps maximize application performance and scalability by splitting traffic, to the cluster endpoint for write traffic and reader endpoint for read traffic respectively. The following diagram illustrates the high-level architecture.

This deployment option also offers improved failover time, typically under 35 seconds, compared to Multi-AZ instance deployment with the elimination of the crash recovery step. However, the total recovery time depends on replication lag.

In the following sections, we dive deeper into the differences on how the the Multi-AZ DB cluster handles read write traffic as compared to Single-AZ instance or Multi-AZ instance offerings.

Basic Architecture

There are three instances in Multi-AZ DB cluster: one primary writer instance and two readable standby instances in different AZs. Each of these instances consists of an Amazon Elastic Compute Cloud instance with local SSD storage, which improves performance, and attached Amazon EBS volumes for durable storage. Furthermore, Multi-AZ DB cluster instance use the instance class, and storage for EBS volumes is defined at the cluster level. The local storage of Amazon RDS compute instance isn’t elastic and its size is tied to the instance class chosen.

Replication Intricacies

Although the Multi-AZ instance deployment option uses synchronous replication, Multi-AZ DB cluster deployment replication is semi-synchronous. Semi-synchronous replication guarantees that if the primary crashes, all committed transactions have been transmitted to at least one readable standby instance. This is called a “quorum” when compared with asynchronous replication, semi-synchronous replication provides improved data integrity and durability, because when a commit returns successfully, we know that the data exists in at least two places. This makes sure that in the event of failure on the primary instance, one of the standby reader instances can be promoted to primary through an automated failover orchestrated by Amazon RDS. The time taken for failover is typically around 25–75 seconds, but may increase because of replica lag due to additional time required for applying transactions (for example, relay logs in MySQL) from local SSD to Amazon EBS volumes before the reader can be promoted as the new writer.

Faster Write Operations

When compared to Multi-AZ instance deployment, Multi-AZ DB cluster deployment provides lower latency for write commits. The primary database instance replicates to two standby reader instances in their own independent Availability Zone, and provides better durability because the data is available in another Availability Zone. To understand this better, let’s see how writes are made in a Multi-AZ DB cluster:

Transactions are committed and applied on primary only after one of the standby instances acknowledges that the transaction is written to standby’s local SSD. The Multi-AZ DB cluster uses a quorum mechanism to confirm at least one standby acknowledged the change.
Data is copied asynchronously from local SSD to attached EBS volumes.

The following figure illustrates this process.

While Multi-AZ DB clusters provide resiliency using the semi-synchronous replication model, they can still have replication lag if one instance in the quorum set has not applied all of the transactions. If your application needs to read all up-to-date data, which is known as “strong read consistency”, then you must use the writer endpoint for reads. You can use the reader endpoint for applications that are build to handle replication lag.

The behavior of readable standby instances observed in a Multi-AZ DB cluster with respect to replica lag is similar to what we observe in a Single-AZ or Multi-AZ instance with one or more read replicas, where each read replica may lag by different values. However, unlike standalone read replicas of Single-AZ instances and Multi-AZ instances, where replication is asynchronous, the replication in a Multi-AZ DB cluster from the primary to readable standby instances is semi-synchronous.

As a result of these improvements in the write process, the RDS Multi-AZ DB cluster provides the following benefits:

Lower latency and higher throughput as writes are made to local storage, which is faster storage when compared to Amazon EBS. Amazon RDS Multi-AZ DB cluster supports up to 2x faster transaction commit latencies than a Multi-AZ or Single-AZ deployment.
Higher resiliency to Availability Zone outage with two standby instances that can serve read traffic.
The Multi-AZ DB cluster uses a two out of three quorum, meaning that writes need to be acknowledged by one of the standbys. This makes the cluster more resilient if a write path is impaired, leading to better overall performance in the event of a failure scenario.

The Multi-AZ DB cluster is suitable for business/mission critical applications that require high availability, low RTO/RPO, improved commit latency, faster failover, readable standby instances, and optimized replications with both high availability with automatic failover and read scalability.

Summary

In this post, we walked you through the different Amazon RDS offerings, the new Multi AZ database cluster and key factors to consider while choosing the right offering for your workloads. The following table summarizes the key considerations while selecting the deployment options.

Considerations	Single-AZ	Multi-AZ (with one standby)	Multi-AZ (with two readable standby instances)
Standby instance can accept reads	No	No	Yes
Commit latency	Low	Higher than Single-AZ	Up to two-times faster commits for writes compared to Multi-AZ instance
Automatic failover	No, because there is no standby	Yes	Yes
Failover time	Not possible	Can take up to 120 seconds, based on crash recovery	Typically 25–75 seconds, but depends replica lag
AZ outage resiliency	In the event of Availability Zone failure, you risk data loss; RPO can be up to 5 minutes	In the event of Availability Zone failure, your workload automatically fails over to standby instances	Two standby instances serve as failover targets
Storage jitter	No optimization for jitter	Sensitive to impairments on the write path	Uses two of three quorum: insensitive to up to one impaired write path
Replication mode	None	Synchronous replication	Semi-synchronous engine-native replication
Performance impact of snapshots	Brief I/O suspension	Taken from secondary instance, no I/O suspension	Amazon EBS crash consistent snapshot feature to take backup from primary, which doesn’t result in I/O suspension

The Multi AZ database cluster option is ideal when your workloads require lower write latency, automated failovers, and additional read capacity. You can migrate a Single-AZ or Multi-AZ instance to a Multi-AZ DB Cluster using a read replica or snapshot and restore method.

Have follow-up questions or feedback? Let us know by creating a technical support ticket, posting your question in the AWS forums, or leaving a comment. We’d love to hear your thoughts and suggestions.

About the Authors

Ankush Agarwal is a Solutions Architect at AWS. He’s a AWS certified architect and helps customer design resilient workloads on AWS. He also has experience in designing, deploying, and optimizing data analytics workloads on the AWS Cloud. Outside of work, you will find him wandering in urban forests or on sports field.

Pranshu Mishra is a Solutions Architect at AWS. He’s an AWS certified professional in eight areas and specializes in databases and serverless technologies. He has experience in designing, deploying, and optimizing workloads on the AWS Cloud. Beyond work, he enjoys spending his time exploring the outdoors and immersing himself in nature