Deploy multi-Region Amazon RDS for SQL Server using cross-Region read replicas with a disaster recovery blueprint – Part 2

In our previous post, we deployed multi-Region disaster recovery blueprint using Amazon Route 53, Amazon Relational Database Service (Amazon RDS) for SQL Server and Amazon Simple Storage Service (Amazon S3). In this post we walk you through the process of promoting RDS for SQL Server in the AWS secondary Region and performing a cross-Region failover using the blueprint we have deployed.

A quick recap from our previous post

At high level we performed the following steps:

Verify Amazon RDS for SQL Server database running with cross Region read replica.
Created Amazon Route 53 private hosted zone CNAME records to configure application connectivity with RDS instance.
Configured Amazon Route53 public hosted zone to manage internet traffic between primary and secondary AWS Regions.
Created separate Amazon S3 buckets in both the Regions to host disaster recovery file.
Tested connectivity for internet traffic using public domain name.

The architecture of this solution is depicted in the following diagram. When both primary and secondary Regions are up and running, Amazon Route53 routing internet traffic to active Region (us-east-1 in this example). No internet traffic routed to passive Region (us-east-2 in this example).

Key Considerations for Cross-Region Failover:

Initiating cross-Region failover requires a thorough consideration of various factors. Here are a few essential points to consider

One or more AWS services that your application relies on in the primary Region are not responding.
The application tier deployed in the secondary Region is fully updated and prepared to assume the primary role. Amazon Route 53 Application Recovery Controller provides capability of multi-Region readiness checks. Refer to Readiness check in Amazon Route 53 Application Recovery Controller for more details.
The only option to restore customer services is by switching live traffic to the secondary Region.
Replica lag for RDS for SQL Server read replica in secondary Region is under acceptable range for the business (RPO). Initiating a cross-Region failover with non-zero replica lag can potentially lead to data loss. Refer Troubleshooting a SQL Server read replica problem to learn more.

The following diagram illustrates the architecture during disaster recovery event:

Cross-Region Failover Steps:

Let’s see how this solution works by following the order of events

Declare a disaster recovery event and disable writes on primary RDS for SQL Server instance.
Promote the RDS for SQL Server cross-Region read replica to a stand-alone DB instance in secondary Region.
After the successful promotion of the RDS for SQL Server read replica, conduct an end-to-end validation, often referred to as a smoke test. This validation process ensures that the application is functioning correctly and ready to accept external traffic in the secondary AWS Region.
Create and upload disaster recovery file on Amazon S3 bucket deployed with the disaster recovery blueprint. Use the same file name specified in HTTP endpoint of Route53 public record health check. Uploading this file will fail the Route 53 health checks for the primary Region and the failover to the secondary Region will initiate.
Once the failover to the secondary Region is complete, verify the health of the application stack and databases and ensure they are functioning properly and receiving external traffic. At this step, your application downtime is complete.
Enable in-Region high availability again by modifying the RDS instance in secondary Region and enable multi-AZ option. This is also a prerequisite of creating cross-Region read replica on RDS for SQL Server instance.
Once the AWS Region primary Region become available, recreate cross-Region disaster recovery solution by manually creating new cross-Region RDS for SQL Server read replica.
Modify the Route53 private hosted zone rds_sqlserver_private_hosted_zone and update rdsprimarydb.rds_sqlserver_private_hosted_zone CNAME record to point to newly created read replica in primary Region.

Implement the failover

Use the following steps to initiate cross Region failover:

Declare disaster recovery event and disable writes on RDS for SQL Server instance in primary Region. For example, if application is accessible, stop it to prevent accidental writes on current primary database while the DR failover in-progress.
In secondary Region, verify RDS for SQL Server read replica lag using Replica Lag Amazon CloudWatch metric. You may also run this query on primary database to get information about replica lag for all read replicas. Proceed to the next steps only if cross-Region replica lag is zero or under acceptable range for the business (RPO). Initiating cross-Region failover with non-zero replica lag can potentially lead to data loss.
Navigate to Amazon Route 53 console.
Select public hosted zone and verify the routing policies.
Navigate to Amazon Route 53 health checks dashboard.
Note down URL for the S3 file endpoint for primary Region health check record. At this stage, health checks for both the Regions must have healthy status.
Navigate to private hosted zone rds_sqlserver_private_hosted_zone and verify the CNAME entries. Application stack in secondary Region connected to RDS instance using CNAME record rdssecondarydb.rds_sqlserver_private_hosted_zone.
In secondary Region, promote the cross-Region RDS for SQL Server read replica

You may also automate this step by creating an AWS Lambda function. You may find referenced python code on this GitHub repo. For instructions on creating a function, see the AWS Lambda Developer Guide.
Promoting the RDS for SQL Server instance doesn’t change RDS endpoint URL. So, you don’t have to update Route53 private hosted zone CNAME records. Your application should be able to connect with promoted RDS for SQL Server instance using rdssecondarydb.rds_sqlserver_private_hosted_zone record.
Perform an internal smoke test or application health checks and ensures that the application is functioning correctly and ready to accept internet traffic in secondary Region.
Once application stack in secondary Region is ready to take internet traffic, initiate the failover process.
Refer step 6 to obtain recovery file URL. Upload this recovery file on Amazon S3 bucket in secondary Region. In our example, file name is initiate_failover.dr.

Uploading this file on Amazon S3 bucket causes Route53 health check for primary Region to fail (Remember we enabled invert health check status option in Route 53 health checks). Once primary Region is marked unhealthy, Route53 initiates the failover of internet traffic to secondary Region. Failover delays are depended on health check settings, request interval and failure threshold.

Once failover is complete, your application start receiving internet traffic in secondary AWS Region.
At this step, your applications are recovered from the downtime but running in single AWS availability zone. To setup in-Region high availability, modify the RDS instance in the secondary Region and enable multi-AZ for Amazon RDS for SQL Server.
Once the AWS primary Region become available again, recreate cross-Region disaster recovery solution by manually creating new cross-Region RDS for SQL Server read replica.
Update Route 53 private hosted zone rds_sqlserver_private_hosted_zone and edit CNAME record value rdsprimarydb.rds_sqlserver_private_hosted_zone with new cross-Region RDS read replica endpoint.

The following diagram shows the final architecture after failover.

Clean up

To delete the resources created to implement this solution, complete the following steps:

Delete public and private hosted zone you created.
Change the application configuration to its original state.
Delete the Amazon S3 bucket you created.

Summary

In this post, we explore the failover process for an application deployed in a multi-Region setup, employing an active/passive strategy. Amazon Route 53 public hosted zone policies failover traffic between the primary and the secondary Regions. Amazon Route53 private hosted zones policies take care of routing traffic to the appropriate database endpoints. It gives a uniform common endpoint for applications to use and application configuration doesn’t need to change during the failures. You may use AWS Lambda function to script the manual tasks required for RDS instance promotion. You can manually trigger the function or use events.

Try out this solution in your AWS account and if you have any comments or questions, leave them in the comments section.

About the author

Ravi Mathur is a Sr. Solutions Architect at AWS. He works with customers providing technical assistance and architectural guidance on various AWS services. He brings several years of experience in software engineering and architecture roles for various large-scale enterprises.

AWS Database Blog