Improving application availability with Amazon RDS Proxy

One of the benefits of Amazon RDS Proxy is that it can improve application recovery time after database failovers. While RDS Proxy supports both MySQL as well as PostgreSQL engines, in this post, we will use a MySQL test workload to demonstrate how RDS Proxy reduces client recovery time after failover by up to 79% for Amazon Aurora MySQL and by up to 32% for Amazon RDS for MySQL. This post also explains how RDS Proxy insulates clients from reader-writer transition issues and overcomes sub-optimal client configurations. We discuss RDS Proxy benefits for planned and unplanned failovers via active connection monitoring and for client connection pools thanks to retaining idle connections through failovers. Finally, this post offers some best practices for client configurations.

Background

RDS Proxy can front your Amazon RDS for MySQL/PostgreSQL and Aurora MySQL/PostgreSQL databases. It allows you to manage an application’s access to the database and provides connection pooling, multiplexing, and graceful failover. It helps you to scale beyond database connection limits and manage bursts of connections and requests from applications. This post focuses on the failover benefits of RDS Proxy.

Failover occurs when the primary database instance becomes inaccessible and another instance takes over as the new primary. This disrupts client connections. Failovers can be planned, when they are induced by administrative actions such as a rolling upgrade, or unplanned, when they occur due to failures. In both cases, you want to reduce downtime to minimize client disruption.

Amazon RDS has multiple high availability options, and RDS Proxy provides failover recovery benefits for each option. The Amazon RDS Multi-AZ option is a primary and standby configuration with synchronous replication. Aurora offers replicas as availability options – up to 15 asynchronous replicas. All Aurora replicas are available for read-only access. You can elect and promote any replica as a new primary for the cluster in the event of failover without any data loss.

With all of these options, when a failover occurs, the client has to detect the connection failure, discover a new primary, and reconnect to it as quickly as possible. With RDS Proxy, the applications can avoid the complexity associated with failovers and experience faster recovery (in as few as 3 seconds). With Aurora, where failovers are faster, DNS propagation delay is the largest contributor to your overall failover time. RDS Proxy actively monitors database instances and automatically connects clients to the right target. It also maintains idle client connections through database failover without dropping them. Idle connections in this context are connections that don’t have outstanding requests.

Aurora failovers

The findings of an internal test executed 100 times comparing failover times between a direct connection to an Aurora cluster and a connection to the Aurora cluster via RDS Proxy reveal a significant improvement.

The following table summarizes these findings in milliseconds.

Test	Min	Max	Avg	Proxy Advantage
RDS Proxy with Aurora	1,667	15,511	3,138	87%
Direct to Aurora with MariaDB driver	9,720	31,491	24,037	87%

The average failover time with the RDS Proxy was 3.1 seconds, as opposed to 24 seconds when connecting directly to the database with a MariaDB driver. This is an 87% improvement.

A 3-second failover disruption is impressive for a relational database, even for a basic benchmark. This is a result of several Aurora innovations. But what’s also evident is that even the recommended MariaDB Aurora driver, which includes optimizations to reduce failover times, isn’t sufficient to experience all the benefits of Aurora. You need RDS Proxy to take full advantage of fast Aurora MySQL failovers because your current bottleneck may well be your client driver’s ability to recover as fast as the database.

DNS time to live

You can tune the client drivers many different ways. For example, the MariaDB ConnectorJ driver has nearly 100 configuration settings to play with. There are more configuration options in the OS, JVM, and connection pool frameworks. This post covers the most important ones, starting with DNS client cache configuration.

When a client connects to a database using a DNS name, it first must resolve it to an IP address by querying a DNS server. The client then caches the responses. Per protocol, DNS responses specify the time to live (TTL), which governs how long the client should cache the record. RDS sets the default TTL to 4 seconds. However, many systems implement client caches with different settings and make the TTL longer. An OS and JVM runtime environment both have such a cache. The cache settings in JVM differ by version and OS, so it’s important to set them explicitly. The following code shows the recommended settings:

java.security.Security.setProperty("networkaddress.cache.ttl" , "1");
java.security.Security.setProperty("networkaddress.cache.negative.ttl" , "1");

This reduces default DNS caching in JVM, which allows the driver to discover DNS name IP address changes faster during failover.

Failover benchmark with an optimized client

The next benchmark uses a client optimized with DNS TTL settings, as described previously, and other recommended client settings that this post discusses later. The following graph compares the failover time when using a MariaDB driver with optimal settings and connecting directly to Aurora MySQL versus via the RDS Proxy.

The following table summarizes these findings in milliseconds.

Test	Min	Max	Avg	Proxy Advantage
Proxy Aurora (MariaDB Driver)	1,644	11,642	2,913	79%
Direct Aurora (MariaDB Aurora Driver)	5,146	30,782	13,783	79%

With a well-tuned MariaDB client, you get an average failover time of 2.9 seconds when using RDS Proxy and 13.8 seconds for direct connections—a 79% improvement with RDS Proxy. Please note that the improvement in recovery times does depend on the specific workload.

Aurora client configuration

The preceding test used the recommended MariaDB ConnectorJ with Aurora-specific enhancement. Specifically, the direct connections to the database used the connection URL that starts with jdbc:mariadb:aurora://<cluster-endpoint>. This allows the MariaDB ConnectorJ driver to use Aurora-specific system tables to quickly discover the primary database instance. For more information, see Using the MariaDB JDBC driver with Amazon Aurora with MySQL compatibility.

The connection via RDS Proxy used a vanilla driver functionality with an URL such as jdbc:mariadb://<proxy-endpoint>. Despite not using any special client configurations, the RDS Proxy was more effective at minimizing the failover times. Having simpler client logic is beneficial because it also means you can use any client you want.

Reader and writer role transition

Aurora automatically performs failovers if the primary instance becomes unavailable for any reason. During this failover, the primary instance changes roles and becomes a reader while one of the readers in the Aurora cluster gets promoted to become the primary. If connecting directly to the cluster endpoint, the client may get reconnected to the old primary instance because DNS records may have been cached. Even though the connection is established the client can’t perform writes to the instance because it is no longer the primary and is designated as a read-only instance. RDS Proxy eliminates such errors because it always connects the client to the current primary.

Active monitoring and unplanned failovers

RDS Proxy improves failovers because it doesn’t rely on DNS propagation to perform failovers. The RDS Proxy eliminates reader and writer transition issues for Aurora cluster clients. It actively monitors each database instance in the Aurora database cluster to act quickly during failover on behalf of the clients.

So far, this post has covered the most basic scenario of a planned / administrative failover scenario. For Aurora, such failovers happen when you initiate them on-demand via the AWS Management Console, an API, or the AWS Command Line Interface (AWS CLI) to change the instance size or when Amazon performs a failover for scheduled maintenance operations that you opt into. In all such cases, the database host can gracefully close connections and reject new connections when the database process isn’t ready. This means that clients or the RDS Proxy can easily detect the problem and retry.

However, consider an unplanned failover scenario where the host becomes unreachable abruptly. The existing client TCP connections remain open and the client must detect that connections are dead due to lack of response. Handling such a scenario requires following best practices for configuring socket timeouts, connection timeouts, and TCP keep-alive. RDS Proxy can assist clients with unplanned failovers.

RDS Proxy monitors every database instance and can detect failures within seconds. When it detects a failure, RDS Proxy stops directing new queries to the failed database instance. RDS Proxy maintains idle client connections that weren’t in the middle of a transaction during failovers. This means that client connection pools that had inactive connections in the pool can handle failover much more gracefully without having to recreate every connection. This spares the client from the overhead of recreating many idle TLS connections. RDS Proxy also proactively terminates any client connections that were in the middle of executing a transaction on a failed database instance, which allows clients to quickly retry instead of waiting for the timeout.

RDS Proxy always accepts new connections and delays forwarding the query until the new primary is available. Without RDS Proxy, when a failover occurs, the client detects that the primary database instance is unavailable and may try to reconnect immediately. These attempts to reconnect may not succeed because the primary could be in the process of recovery. Too many attempts to reconnect as the primary recovers could cause it to fail again. As a result, the clients have to build a retry logic with adequate waits. With RDS Proxy, this need to build complicated retry logic goes away.

Amazon RDS Multi-AZ

One of the features of RDS that helps to mitigate the impact of an unplanned failover is Multi-AZ for Amazon RDS. With a Multi-AZ configuration for RDS MySQL, the primary database instance exists in one Availability Zone while a synchronous standby instance resides in a second Availability Zone. There is one hostname that points to the primary instance which client applications use to connect to their database. In the event of a failure, the RDS service will switch the roles of the primary and secondary instances. The RDS service will also change the underlying IP address of the database hostname so that client applications do not need to change their connection settings during a failover.

With Amazon RDS Multi-AZ, upon failover, the original primary doesn’t close TCP connections. The client doesn’t get any more TCP traffic from the database after failover initiates. Instead, it’s up to the client to time out. This deliberate design choice of hard fencing of the original primary database on any failover means that the client can expect similar behavior during planned and unplanned failovers. This has the practical benefit of making it easier to test for both planned and unplanned failover scenarios by simply doing administrative failovers via the Amazon RDS API or CLI. However, the default settings for the MariaDB driver as well as many operating systems are inadequate to handle this scenario. By default, the MariaDB driver never times out waiting for a response and the TCP keep-alive settings for certain operating systems can exceed 2 hours. The good news however is that with RDS Proxy, these settings no longer matter because the underlying DNS configuration of the hostname (in this case, RDS Proxy) never change.

RDS Multi-AZ failover recovery benchmark

The results below show the outcome 50 failovers while running insert queries using MariaDB drivers directly to the database and via RDS Proxy. The following table summarizes the findings in milliseconds.

Test	Min	Max	Avg	Proxy Advantage
Proxy Multi-AZ (MariaDB Driver)	21,485	29,176	25,075	32%
Direct Multi-AZ (MariaDB Driver)	27,240	52,234	36,849	32%

In summary, using RDS Proxy with Amazon RDS Multi-AZ databases showed a recovery within an average of 25 seconds, whereas direct connections to the database experienced 37–40 seconds of downtime.

While Amazon RDS Multi-AZ recovery was 25 seconds, in comparison Aurora provided recovery in under 3 seconds on a db.r5.large instance. This isn’t a surprise because Aurora has several innovations to expedite database recovery times after failover.

Conclusion

This post demonstrated the following benefits of RDS Proxy:

Reduced Aurora MySQL failover time by up to 79% for the test workload
Reduced RDS MySQL Multi-AZ failover time by up to 32% for the test workload
Works equally well with different client drivers without requiring special client logic
Insulates the clients from Aurora reader and writer transition
Actively monitors each database, including for unplanned failovers
Doesn’t drop idle connections during failover, which reduces the impact on client connection pools
Always accepts connections, which insulates client from connection timeouts

RDS Proxy is easy to use. You can point any MySQL or PostgreSQL client driver at the RDS Proxy endpoint and enjoy the benefits. Try RDS Proxy for your applications today!

About the Authors

Anton Okmyanskiy is a Principal Engineer for Amazon Web Services.

Steve Abraham is a Principal Data Architect for Amazon Web Services. He works with our customers to provide guidance and technical assistance on database projects, helping them improving the value of their solutions when using AWS.

AWS Database Blog

Improving application availability with Amazon RDS Proxy

Background

Aurora failovers

DNS time to live

Failover benchmark with an optimized client

Aurora client configuration

Reader and writer role transition

Active monitoring and unplanned failovers

Amazon RDS Multi-AZ

RDS Multi-AZ failover recovery benchmark

Conclusion

About the Authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help