Introducing – Aurora Global Database Failover

Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. Aurora combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open-source databases. Aurora Global Database lets you span your relational database across multiple Regions. Global Database is an ideal choice for use cases when you want cross-Region disaster recovery or to achieve low-latency reads in secondary Regions. Global Database is a great way to scale your readers across multiple Regions.

On August 2023, the Aurora team announced availability of a new Global Database Failover feature for Aurora MySQL and Aurora PostgreSQL, available for all versions that support Aurora Global Database. Global Database Failover reduces the operational overhead of failing over a Global Database cluster for a cross-Region unplanned failover event. In this post we dive deep into the Global Database Failover feature of Aurora Global Database, and explore how it works and how you can take advantage of it to make your distributed applications more resilient to failures.

Note: We are retiring the old Managed Planned Failover terminology and will use Global Database Switchover from now on to refer to the same functionality.

Overview of Aurora Global Database

A Global Database cluster comprises multiple Regional DB clusters. A Global Database cluster may have up to six regional DB clusters in supported Regions, such as DB clusters in both us-east-1 and us-west-2.

In the Global Database topology, only one Region is primary and all other Regions are secondary. The primary Region contains the only writer DB instance and also contains the only active writer endpoint. The writer endpoint always points to the active writer node. The secondary Regions also have writer endpoints, but they’re inactive.

Each Regional DB cluster also has a reader endpoint. The reader endpoint balances read traffic to read replicas in a Regional DB cluster, if there are any. The reader endpoint isn’t affected after a Global Database Failover.

There are two scenarios in which customers change their primary Region – as a response to a planned event (for example, a regional rotation) or a response to an unplanned event (for example, a regional outage). A Global Database Switchover (previously called Managed Planned Failover) is an existing feature that supports a planned event which lets you switch from the current primary Region to the secondary in a managed fashion. You can use global Database Switchover for a planned operation, such as a regional rotation (for example, follow the Sun model) that switches your primary and secondary Regions for disaster recovery (DR) testing. You can invoke the Global Database Switchover process via the AWS Management Console, AWS Command Line Interface (AWS CLI), or RDS API. During the switchover process, the user chosen secondary DB cluster becomes primary and the old primary assumes a secondary role. Global Database Switchover also flips the replication direction from a new primary Region to a new secondary Region. During the switchover process, the writer endpoint from the old primary Region (now secondary) becomes inactive and the writer endpoint in the new primary DB cluster becomes the active writer endpoint. When this happens, you must reconfigure database users and applications to update their connections strings to use the new endpoint. For a detailed discussion on Global Database Switchover, refer to Aurora documentation.

Challenges of an unplanned failover

An unplanned failover may be required when a current primary Region DB cluster experiences a service level outage or there is a complete primary Region outage. It’s worth noting that such a scenario is rare because of the high resilience of AWS regional architecture; however, it cannot be completely ruled out. Traditionally, an Aurora Global Database Failover was achieved by manually detaching one of the surviving secondary Region DB clusters and promoting it. This approach worked well, but it had some challenges.

First, the existing approach of removing a cluster from the global setup and converting it to a standalone cluster impacted the existing Global Database topology. This meant that the old Global Database cluster name was no longer valid. Additionally, Regional database DB clusters became decoupled, and after the surviving regional DB cluster was promoted, it became the only regional DB cluster available for applications. This meant that you had to recreate a new Global Database topology manually, by adding another secondary Region. You also had to re-create the old primary Region DB cluster and add it to the new Global Database Cluster, when the Region became available again.

Introducing Global Database Failover

With the launch of the new Global Database Failover feature, you can now manage an unplanned Aurora Global Database Failover such as a regional service outage. This feature is available for new Global Database deployments and also made retroactively available to existing Aurora Global Database deployments.

To use the Global Database Failover feature, you are not required to make configuration changes. You can initiate a Global Database Failover via the AWS Management Console, AWS Command Line Interface (AWS CLI), or API. Customers can choose one of the surviving Regions where Global Database clusters were initially created and initiate the failover process. For example, let us assume the Global Database was created in the us-east-1, us-west-2 and eu-east-1 Regions, and us-west-2 is experiencing a service outage. In this scenario, Global Database Failover can be initiated from either us-east-1 or eu-east-1 Region.

Once the Global Database Failover is initiated, the following steps are taken:

User chosen secondary DB cluster promotes one of the read replicas as a writer and assumes the role of primary DB cluster in the Global Database topology.
If the Global Database topology had other secondary DB clusters, they are rebuilt.
Aurora Global Database service continues to monitor the availability of the old primary Region. When it becomes available and healthy, Aurora Global Database adds this Region back to the Global Database by restoring a snapshot of the current Primary Region DB cluster.
When the old primary Region is added back to the Global Database cluster, an attempt is made to take a snapshot of the old storage volume, and if successful, a snapshot is made available with the naming convention of rds:unplanned-global-failover-<cluster name>-timestamp

Global Database Failover reduces the operational overhead of manually promoting Regional DB clusters, while preserving the Global Database topology. After a failover is completed, you can point applications to the writer endpoint of the new primary DB cluster, and start reading and writing from the new DB cluster. If you have applications and clients in other surviving secondary Regions, they must wait until the secondary DB clusters are rebuilt before they can start reading from those Region DB cluster reader endpoints.

Best Practices

During Global Database Failover, user chosen secondary Region DB cluster is promoted to primary. However, it does not automatically inherit the configuration options of the primary. To mitigate configuration mismatch, we recommend you create an Aurora DB cluster parameter group and configure the options in advance. We also recommend creating and configuring monitoring tools such as Amazon CloudWatch alerts and other third-party monitoring. We recommend configuring external service dependencies such as AWS Secrets Manager, Amazon Simple Storage Service (Amazon S3), AWS Lambda in your secondary Regions. If you use headless clusters as part of your Global Database setup, make sure you add an appropriately sized compute instance before you initiate a failover.

For a more complete list of recommendations, refer to the Aurora User Guide.

Performing Global Database Failover

When you create an Aurora Global Database with at least 1 secondary Region DB cluster comprising at least one instance, you have an option to perform Global Database Failover and Global Database Switchover functions. To set up an Aurora Global Database, see Getting started with Amazon Aurora Global Databases user guide.

In the following example (Figure.1), we have created an Aurora Global Database with 2 DB clusters. The primary Region is located in ap-southeast-1 and the secondary Region is located in ap-south-1 Region.

Figure.1 A Global Database Cluster

To perform the Global Database Failover using the console, complete the following steps:

Select the Aurora Global Database cluster you want to fail over.
On the Actions menu, choose Switch over or fail over global database.

Figure.2 Failover Global Database cluster
For the Target DB cluster, choose the active secondary Aurora DB cluster that you want to promote to primary.
Select Failover (allow data loss) as a Failover reason. To confirm the failover, type confirm and choose Confirm.

After performing the failover, you can see the database status change. You can follow the status column of the database list to monitor the failover process. Due to the asynchronous nature of Global Database replication, there is a possibility of some data loss after the failover on the new Primary DB cluster.

Figure.4 Failover in progress

The new primary Region DB cluster is available first and it takes a few minutes to setup the replication from new primary to new secondary DB cluster. Depending on the size of the database, new secondary Region clusters may take anywhere between a few minutes to a few hours to set up. If there were other secondary Region DB clusters, they are also recreated, and replication is re-established. After the failover completes, you can see both primary and secondary Region DB clusters become available.

Figure.5 Failover completed

To perform a Global Database Failover using AWS CLI, you can use a command similar to the following:

aws rds --region primary-region failover-global-cluster --global-cluster-identifier global_db_identifier \
--target-db-cluster-identifer ARN-of-secondary-to-promote --allow-data-loss

Conclusion

In this post, we explored the newly launched Aurora Global Database Failover feature and its benefits. You can use this feature to rapidly recover an Aurora Global Database cluster from failover during an unplanned outage with reduced operational burden. To learn more about Aurora Global Database, look at the detailed documentation. Explore Aurora Global Database and learn more about these features by heading over to the RDS Console and try creating a Global Database cluster.

About the authors

Aditya Samant is a relational database industry veteran with over 2 decades of experience working with commercial and open-source databases. Over the years, he has held many roles, including database consultant, professional support, DBA and database architect. He currently works at Amazon Web Services as a Sr. Database Specialist Solutions Architect. In his current role, he spends his time working with customers designing scalable, secure and robust cloud native architectures. Aditya also works closely with the service teams and collaborates on designing and delivery of the new features for Amazon’s flagship relational database, Amazon Aurora.

Surendar Munimohan is a Sr. Database Solutions Architect with Amazon Web Services. He has more than a decade of experience working with relational databases and architecting highly scalable applications in the AWS. In his current role, he works with customer to design scalable, highly available, secure and cost-effective solutions in AWS Cloud.

AWS Database Blog