AWS Database Blog

Amazon RDS Under the Hood: Multi-AZ

Amazon Web Services (AWS) customers bet their businesses on their data store and highly available access to it. For these customers, Multi-AZ configurations provide an easy-to-use solution for high availability (HA).

When you enable Multi-AZ, Amazon Relational Database Service (Amazon RDS) maintains a redundant and consistent standby copy of your data. If you encounter problems with the primary copy, Amazon RDS automatically switches to the standby copy to provide continued availability to the data. The two copies are maintained in different Availability Zones (AZs), hence the name “Multi-AZ.” Having separate Availability Zones greatly reduces the likelihood that both copies will concurrently be affected by most types of disturbances. Proper management of the data, simple reconfiguration, and reliable user access to the copies are key to addressing the high availability requirements that customer environments demand.

This post describes Amazon RDS Multi-AZ configurations for MySQL, MariaDB, PostgreSQL, and Oracle database instances. Amazon RDS for SQL Server and Amazon RDS for Amazon Aurora use a different technology stack to provide Multi-AZ capabilities.

Basic design
The Multi-AZ feature is implemented using a replication layer installed between the database application and the Amazon Elastic Block Store (Amazon EBS) volumes. This layer handles application read and write requests and applies them in an environment where two discrete EBS volume copies are maintained—one accessed locally and one accessed remotely.

During normal operation, there are two active Amazon EC2 instances with the replication layer installed. Each instance manages one EBS volume with a full copy of the data. Configuration binds the two instances and their volumes as a Multi-AZ database instance. The replication layers are in direct communication with each other over a TCP connection.

At any moment in time, each instance is assigned a specific role. One is the primary, and it exposes an external endpoint through which users access their data. The other is the standby, and it acts as a secondary instance that synchronously writes all data that it receives from the primary. Database write operations result in the data being properly written to both volumes before a successful response is sent back to the calling application. However, read operations are always performed through the primary EBS volume. Because the database server process is not running on the standby instance, it does not expose an external endpoint. Consequently, its copy of the data is not available to users.

To improve availability, Multi-AZ tries to consistently ensure that one of the instances is in the primary role, providing access to its copy of the data. If there is an availability issue, the standby instance can automatically be promoted to the primary role and availability can be restored through redirection. This event is referred to as a failover. The previous primary, if it’s still up and running, is demoted to the standby role.

Redirection to the new primary instance is provided through DNS. The relevant records in the results from client DNS queries have very low time-to-live values. It is intended to inhibit long-term caching of the name-to-address information. This causes the client to refresh the information sooner in the failover process, picking up the DNS redirection changes more quickly.

The following diagram depicts a Multi-AZ instance that is running in its normal connected state.

Figure 1: Multi-AZ instance

The database application (DB APP, shown in yellow) uses DNS (shown in orange) to obtain the address information for the current external endpoint that is providing access to the data.

There are two RDS DB instances in this Multi-AZ instance: the primary instance (shown on the left side, in green) and the standby instance (shown on the right side, in blue). In this example, DNS is directing the application to the primary instance EC2 #1, serving the primary copy of the data EBS #1 that is available in Availability Zone #1. The replication layers of the two EC2 instances are connected. Write operations that the application issues also result in writes to the second instance (path shown in gray).

Generally, failover events are rare, but they do occur. For situations in which Amazon RDS detects problems, the failover is initiated by automation. You can also manually trigger failover events through the Amazon RDS API.

The replication layer has limited visibility above itself and is therefore incapable of making some of the more strategic decisions. For example, it doesn’t know about such things as user connectivity issues, local or regional outages, or the state of its EC2 peer that may have unexpectedly gone silent. For this reason, the two instances are monitored and managed by an external observer that has access to more critical information and periodically queries the instances for status. When appropriate, the observer takes action to ensure that availability and performance requirements are met.

The availability and durability improvements provided by Multi-AZ come at a minimal performance cost. In the normal use case, the replication layers are connected, and synchronous write operations to the standby EBS volume occur. The standby instance and volume are in a distinct and geographically distant Availability Zone. Assessment shows increases in database commit latencies of between 2 ms and 5 ms. However, the actual impact on real-world use cases is highly workflow-dependent. Most customer Multi-AZ instances show a minor impact on performance, if any at all.

This design enables AWS to provide a Service Level Agreement (SLA) that exceeds 99.95 percent availability to customer data. To learn more, see the Amazon RDS Service Level Agreement.

Intricacies of the implementation
You might think that the design of a volume replication facility is rather simple and straightforward. However, the actual implementation is fairly complex. This is because it must account for all the predicaments that two networked, discrete instances and volumes might find themselves in, inside a constantly changing and sometimes disrupted environment.

Normal ongoing replication assumes that everything is in reasonable working order and is performing well: The EC2 instances are available, regular instance monitoring is functional, the EBS volumes are available, and the network is performing as expected. But what happens when one or more of these pieces is misbehaving? Let’s look at some of the issues that could arise and how they are addressed.

Connectivity issues and synchronization
Occasionally the primary and standby instances are not connected to each other, either due to a problem or a deliberate administrative action. Ongoing replication is not possible, and waiting a long time for connectivity to be restored is not acceptable. When connectivity is lost or deliberately discontinued, the instances momentarily pause, waiting for a decision to be made by the observer. When the observer detects this condition, it directs an available instance to assume the primary role and to proceed on its own without replication. There is now only one current copy of the data, and the other copy is becoming increasingly out of date.

Connectivity issues are usually investigated, and the problem is often quickly corrected. If the issue persists beyond a minimum amount of time, it triggers an attention for operator intervention. It is therefore expected that the majority of connectivity issues will be relatively short-lived conditions, and the two instances will soon have connectivity restored. When connectivity is restored, the volumes must be resynchronized before returning to the normal, ongoing replication state.

The resynchronization process ensures that both copies of the data are restored to a consistent state. In an effort to reduce the time needed for resynchronization, the primary keeps track of blocks that are modified while the two instances are disconnected. When resynchronizing occurs, only those modifications need to be sent from the primary instance to the standby instance, which speeds up the process.

Fault tolerance in a dynamic environment
AWS is a large-scale, highly dynamic environment, and Amazon RDS Multi-AZ is designed to step in and take action when software and hardware disruptions occur.

In the event of a disruption, instance or volume availability problems are the most usual case, and they are predominantly resolved by performing a simple failover operation. This restores availability through the standby instance and volume.

In the unlikely event that a volume experiences a failure, it is replaced with a new one. The process of replacement begins with securing a snapshot of the surviving volume. This is mainly for durability reasons, but it also helps improve the performance of the subsequent resynchronization of the volumes. The instance is then connected to the new volume and the volume is hydrated from the snapshot. Upon completion, the volumes are resynchronized and replication is restored.

Instance or volume replacement might also be an option in situations where a component exhibits behaviors outside the norm. For example, a substantial or prolonged increase in latency or reduction in bandwidth can indicate an issue with the location of the path to the resource. Replacement is expected to be a permanent solution in such situations. Note that a replacement can impact performance, so it is only performed when necessary.

There could be situations in which an entire AWS Region or Availability Zone is affected—for example, during extreme weather or a widespread power outage. During these situations, special attention is given to ensure that Multi-AZ instances remain available. Care must be taken not to escalate a problem into a more serious situation. The observer uses Region availability information to pause unnecessary automated recovery actions while the underlying issue gets resolved.

Amazon RDS Multi-AZ configurations improve the availability and durability of customer data. With automated monitoring for problem detection and subsequent corrective action to restore availability in the event of a disruption, Multi-AZ ensures that your data remains intact. For more information, see Amazon RDS High Availability.

About the Author

John Gemignani is a principal software engineer in RDS at Amazon Web Services.