Managed planned failovers with Amazon Aurora Global Database

Amazon Aurora is a relational database service that combines the speed and availability of high-end commercial databases with the simplicity and cost effectiveness of open-source databases. Aurora has a distributed architecture that replicates a shared storage volume across three Availability Zones to provide a high availability solution with no data loss and failover time measured in seconds for intra-Region failure scenarios. Amazon Aurora Global Database supports cross-region disaster recovery (DR) scenarios by using asynchronous storage replication between Regions. This enables very low replication latency minimizing both the potential for data loss as well as the time required to failover database to a new primary Region. These are referred as Recovery Point Objective (RPO), which is the amount of data loss that a business can tolerate and Recovery Time Objective (RTO), which is the time it takes for the system to begin taking normal requests from applications again.

Typical use cases in disaster recovery preparedness fall into two main areas: planned failovers and unplanned failovers. Although unplanned failovers, such as region-wide service outages, get all the headlines, many enterprises make it standard operating procedure to rotate their primary systems between AWS Regions on a regular basis. Not only does this ensure their organizational procedures are complete and accurate, but more importantly that their staff are trained to perform a DR Failover before it really happens. Prior to the introduction of managed planned failover for Global Database, failing over from one AWS Region to another was a process that irrevocably severed the cross-region replication topology of the entire global database cluster. The managed planned failover feature eliminates this limitation but can only be performed on a healthy Aurora Global Database clusters. For real disaster recovery use cases where the primary region cluster is not available, see Manually recovering an Amazon Aurora global database from an unplanned outage.

Managed planned failover

With managed planned failover for Global Database, you can failover to a secondary AWS Region while maintaining the replication topology and without having to recreate any secondary clusters. You can perform the failover via the AWS Management Console, AWS Command Line Interface (AWS CLI), or FailoverDBCluster API. During the planned failover process, the cluster in the primary AWS Region will be demoted to a secondary and becomes read-only. The storage volumes in all secondary AWS Regions are synchronized with the old primary AWS Region to ensure no data loss (RPO of zero). When completely caught up, the chosen secondary cluster is promoted to be the primary cluster and one of its’ existing read-only nodes becomes the writer node. Database instances in all AWS Regions clusters will restart and will be unavailable during the process. The duration of the failover depends on the amount of replication lag between the primary and secondary AWS regions.

With the planned failover feature, you can change your primary cluster to the secondary for the following use cases:

Performing DR drills on your production database
Relocating the primary database cluster to a different Region
Switching back to the previous primary Region without recreating the cluster

To perform the managed planned failover for Aurora Global Database using the console, complete the following steps:

On the Amazon Relational Database Service (Amazon RDS) console, choose Databases.
Select the Aurora global database you want to fail over.
On the Actions menu, choose Fail over global database.
Choose the secondary Aurora DB cluster that you want to promote to primary.
If you have more than one secondary DB cluster, you can compare the lag (the amount of time the secondary Region takes to catch up with changes coming from primary Region) and choose
the one with the smallest amount of lag.
Choose Fail over global database.

The Status column of the database list shows the state of each Aurora DB instance and Aurora DB cluster during the failover process.

Screenshot of the RDS Console showing a blue banner on the top of the page with the message "Failover of aurora-globaldb in progress". There's also a "Cancel failover" button available in the banner. The "status" column shows "Failing over" for the cluster line and "Modifying" for the databases lines.

Choose Cancel failover if you want to cancel the failover.

When the failover completes, you can see the Aurora DB clusters and their current state in the database list.

Screenshot of RDS Console showing a green banner at the top of the page with a success message. The status column shows "Available" for both the cluster and the databases.

To perform the Aurora Global Database failover through the AWS CLI, enter the following command:

aws rds –-region aws-Region failover-global-cluster 
--global-cluster-identifier global_database_identifier 
-–target-db-cluster-identifier ARN-of-secondary-to-promote

Key considerations and best practices

For the target secondary cluster to handle the application traffic immediately after the failover, make sure of the following:

Check that the target secondary Region cluster nodes are appropriately sized.
Confirm that the secondary Region is configured with the desired number of reader instances.
Configure both the cluster and instance parameter group the same as the current primary.
Configure monitoring tools and options, such as Amazon CloudWatch Events and alarms, to be the same as current primary Region cluster. For more information, see Getting CloudWatch Events and Amazon EventBridge events for Amazon RDS.
Configure integration with other AWS services as needed.
If you’re using AWS Secrets Manager, validate credentials stored in secrets manager and re-establish cross-region replication from new primary to other secondary regions. For more information, see set up automation to replicate the secrets across AWS Regions.
Consider performing the planned failover during the maintenance window because of the downtime during the failover process.
After performing the failover, update the application configuration to use new primary cluster endpoint.

Monitoring Aurora Global Database

To monitor the Aurora Global Database replication between the primary and secondary Region clusters, use the AuroraGlobalDBReplicationLag cluster-level CloudWatch metric in the secondary Region cluster, which shows the amount of lag with replication updates from primary AWS Region. The Aurora PostgreSQL-based Aurora Global Database also provides aurora_global_db_status and aurora_global_db_instance_status functions to monitor the replication between the primary and secondary Regions. For more information, see Monitoring Aurora PostgreSQL-based Amazon Aurora global databases.

Summary

With Aurora Global Database, you can span across multiple AWS Regions and create disaster recovery solutions with minimal RTO and RPO. With managed planned failover for Global Database, you can fail over from one AWS Region to another and switch back to the previous AWS Region without having to recreate the topology.

Get started with Aurora Global Database today!

For more information about this new feature, see Disaster recovery for Amazon Aurora global databases.

About the author

Surendar Munimohan is a Sr. Database Solutions Architect for Amazon Web Services. Surendar Munimohan picture

AWS Database Blog

Managed planned failovers with Amazon Aurora Global Database

Managed planned failover

Key considerations and best practices

Monitoring Aurora Global Database

Summary

About the author

Resources

Blog Topics

Follow