Networking & Content Delivery

Manual Failover and Failback Strategy with Amazon Route53

Introduction

Customers use multi-region architecture to achieve application resiliency such as Active-Active or Disaster Recovery (DR). Depending on DR strategy, customers may need to have failover from one region to the next. DR strategies are covered off in detail in a prior AWS Blog. DR strategies include either an Active/Passive or Multi-Site Active/Active approaches. Active/Passive covers Backup and Restore, Pilot Light, or Warm Standby. This blog relates to the Pilot Light and Warm Standby strategies. Typically, this is done via Domain Name System changes (DNS).

This blog shows a way to build a low-cost, manual control of your DNS failover and failback using various AWS services. It can be used to invoke a failover to test operational readiness in a secondary region. The term for this pattern is Standby Takes Over Primary (STOP). Amazon Route 53 Application Recovery Controller (ARC) also provides failover related functions above and beyond manual DNS failover. We encourage readers to learn about ARC to determine if this solution solves their failover needs versus that provided by ARC. Users of the process highlighted in this blog will have the ability to control DNS failover manually, when they decide the time is right, in an out-of-band fashion. This blog is an intermediate level document. The authors assume the reader has some knowledge of AWS services.

Prerequisites

  1. An AWS account with console access and a user with IAM Permissions.
  2. Application deployed in two regions that expose an endpoint to the internet.
  3. A Route 53 Public Hosted Zone that you created with Amazon Route 53.
  4. Secured webserver in secondary region that sits outside of your application as a “Control Plane.”

Section 1: Understanding the Flow of Traffic

Overall Traffic Flow – Steady StateFigure 1: Overall Traffic Flow – Steady State

The diagram illustrates the flow of the traffic during normal operations. The primary region services all end-user traffic via an application load balancer (ALB). The ALB is spreading traffic across multiple availability zones (AZs) to an EC2 based auto-scaling group (ASG). The EC2 instances are using Amazon Aurora Global Database as the datastore, with the write instance living in the primary region.

In the secondary region, the mechanism to invoke failover and failback processes is a simple “out of band” Apache web server. We chose an EC2 instance for this demonstration’s purpose. Care should be taken in a production environment to ensure proper EC2 availability and reliability. Alternatively, you could use another HTTP endpoint for this purpose. The secondary region hosts an ALB that services customer requests when needed. The ALB is sending traffic to an EC2 based ASG that will be created when a failover is declared and traffic is shifted. The secondary region hosts an Aurora Read Replica as part of the Aurora Global Database service. Our blog is using a “Managed Planned Failover” of the Aurora Global service. Please review the docs on planned versus unplanned failover with Aurora.

Section 2: Health Checks

 

Route 53 Health Checks (HCs) are used to monitor the health and performance of your application. There are three different ways Route 53 HCs validate health:

  • An endpoint you specify (ALB, for example)
  • Status of other health checks (calculated health check)
  • State of CloudWatch alarm

Create four health checks, three monitoring different endpoints and one calculated health check to aggregate the health of the others. The first HC is monitoring the state of the ALB in the primary region. This is the ALB that is servicing end user requests.

Route 53 Health ChecksFigure 2: Route 53 Health Checks

Step 1 : Create health check to monitor the primary region ALB endpoint

  • Log in to AWS Console, select Route 53 service
  • Click on “Health checks” and “Create health check”
  • Enter a Name of your health check
    • select Endpoint and provide IP address or Domain name details (we are using the DNS name of the primary region ALB)
  • Configure CloudWatch Alarm, enter relevant Simple Notification Service (SNS) topic, if you wish to notify upon state change
  • Click “Create health check”
Health Check for Primary RegionFigure 3: Health Check for Primary Region

Step 2: Create health checks to monitor secondary region “Failover Controller” web server endpoint

We will monitor a “Failover Controller” webserver (FCWS) in the secondary region. The FCWS web server will be used as an “out of band” tool to invoke manual failover and failback. The web server will have an failover/index.html page containing a text string (FailOverNow). The HC will monitor the status of the failover/index.html and text and report as down as long as the web server is up and the text string matches. This is due to the HC being inverted. An inverted HC evaluates to healthy when the response is negative. For example, an inverted HC monitoring an HTTP endpoint that was down shows as healthy. The opposite is true where an inverted HC evaluates to unhealthy when the target has a positive response such as an HTTP endpoint replying.

Step 2a: Create an inverted health check that monitors the failover/index.html on the FCWS web server, looking for a text string FailOverNow. This health check is used for the failover from the primary region to the secondary region.
  • Log in to AWS Console, select Route 53 service
  • Click on “Health checks” and “Create health check”
  • Enter the Name of your health check
    • select Endpoint and provide IP address or Domain name details (we are using the IP address of the FCWS web server in the secondary region)
  • In the “Path” box, enter failover/index.html
  • Under Advanced Configuration, “String Matching”, select “yes.” In the “Search String” box, enter text Then under “Invert health check status”, check the box to enable.
  • Configure CloudWatch Alarm, enter relevant SNS topic, if you wish to notify upon state change
  • Click “Create health check”
Inverted Health Check 1 – FailoverFigure 4: Inverted Health Check 1 – Failover
Step 2b: Create a second inverted health check that monitors the failback/index.html page on the FCWS web server. This health check is used for the failback from the secondary region to the primary region.
  • Log in to AWS Console, select Route 53 service
  • Click on “Health checks” and “Create health check”
  • Enter the Name of your health check
    • select Endpoint and provide IP address or Domain name details (we are using the IP address of the FCWS web server in the secondary region)
  • In the “Path” box, enter failback/index.html
  • Under Advanced Configuration select “Invert health check status”, select “Yes”
  • Configure CloudWatch Alarm, enter relevant SNS topic, if you wish to notify upon state change
  • Click “Create health check”
Inverted Health Check 2 - FailbackFigure 5: Inverted Health Check 2 – Failback

Step 3: Create calculated health check

A Route 53 calculated health check monitors the health of other health checks. In our example, the calculated HC is “healthy” when two out of the three checks are healthy.

The three HC’s it is monitoring are:

  1. The primary region ALB HC
  2. The secondary region FCWS web server failover/index.html with matching string
  3. The secondary region FCWS web server failback/index.html

The conditions for healthy are:

  1. When the primary region is up and running
  2. The FCWS web server is offline
  3. The web pages are not up
  4. The text string is not matching (due to inverted nature of those HCs)
Step 3a – Calculated Health Check
  • Log in to AWS Console, select Route 53 service
  • Click on “Health checks” and “Create health check”
  • Enter the Name of your health check, select “Status of other health checks (calculated health check)”
  • Under “Monitor other health checks (calculated health check)” section, select three health checks created in Step 1 and Step 2
  • Under “Report Healthy when” select first option “at least“ and enter “2”
  • Configure CloudWatch Alarm, enter relevant SNS topic, if you wish to notify upon state change
  • Click “Create health check”
Calculated Health CheckFigure 6: Calculated Health Check

Section 3: Configuring Route 53 Failover Policy

Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service. In our example architecture, Route 53 is used to host a domain with DNS records that end users browse to for the demo application. The DNS records are configured as a Failover Policy, where one DNS record acts as the primary while the second DNS record is standby. The primary DNS record is tied to the ALB HC in the primary region. The standby record does not have a HC associated with it, which allows that record to be up when needed.

Step 1 : Create Route 53 Failover Policy

  • Log in to AWS Console, select Route 53 service
  • Click on “Hosted zones” in the Route 53 console and select your hosted zone
  • Click “Create record“ in the “Quick create record” and fill in details for where Route 53 should primarily direct traffic (primary region ALB)
Record Name: Enter subdomain name if required
Select: “Alias" and choose the endpoint to an AWS resource (primary region ALB DNS name)
Choose Region: Select your Primary Region and your end point
Routing Policy: Failover
Failover record type: Primary
Health Check ID: Name of the Calculated Health Check created in Section 2, Step 3
  • Click on “Add another record” and fill in the details as follows for where Route 53 should direct traffic in the event of a primary region failure (secondary region ALB)
Select: “Alias" and choose the endpoint to an AWS (secondary region ALB DNS name)
Choose Region: Select your secondary Region and your end point
Routing Policy: Failover
Failover record type: Secondary
  • Click on “Create Records”

Records

Section 4: Failover Process

Steady State
In a steady state, the application is up and running in the primary region. This means the load balancer in the primary region is servicing all the customer requests, the application web servers are running in the primary region, and the database is live there for writes.

The HC state is as follows:

  • HC against primary region ALB = HEALTHY
  • HC against FCWS web server failover/index.html with FailOverNow string match = HEALTHY (web server is down and/or string is not there, HC is inverted)
  • HC against FCWS web server failback/index.html = HEALTHY (web server is down, HC is inverted)
  • Calculated HC = HEALTHY (2 out of 3 HCs are HEALTHY)

Issue in Secondary Region
If an issue occurs in the secondary region, there is no impact to traffic flows. The HCs will look the same as in the steady state above. If the secondary region issue is resolved, there is no change to the traffic flows nor the HC states.

Simulate Outage in Primary Region
To test the failover from the primary region to the secondary region, we will simulate a failure at the ALB by modifying the security group. The HC to the primary region ALB will go UNHEALTHY, however the Route 53 failover policy will not change. Traffic will remain in the primary region and users will experience down time.

The HC state is as follows:

  • HC against primary region ALB = UNHEALTHY
  • HC against FCWS web server failover/index.html with FailOverNow string match = HEALTHY (web server is down and/or string is not there, HC is inverted)
  • HC against FCWS web server failback/index.html = HEALTHY (web server is down, HC is inverted)
  • Calculated HC = HEALTHY (2 out of 3 HCs are HEALTHY)

Failover to Secondary Region
In this simulation we are using a Lambda function that will promote the Aurora Read Replica instance in the secondary region to primary. For customers that need manual control, this allows them to promote the DB prior to Route 53 redirect of the traffic. This allows customers the ability to validate the state of the database prior to sending customer traffic to it. Once a decision is made to failover the DNS, start the FCWS web server instance in the secondary region. Log in to the FCWS web server and start the Apache service. Once Apache is up, verify failover/index.html and failback/index.html are up and responsive.

The HC state is as follows:

  • HC against primary region ALB = UNHEALTHY
  • HC against FCWS web server failover/index.html with FailOverNow string match = UNHEALTHY (web server is up and string is there, HC is inverted)
  • HC against FCWS web server failback/index.html = UNHEALTHY (web server is up, HC is inverted)
  • Calculated HC = UNHEALTHY (3 out of 3 HCs are UNHEALTHY)

Traffic is now redirected to the standby DNS record, which sends user requests to the ALB in the secondary region. The simulated failover is now completed. This was done without needing to change anything in the Route 53 console.

Simulate Recovery and Failback to Primary Region
To simulate recovery and failback to the primary region, do the following steps:

  • Remove security group restrictions on the ALB in the primary region
  • Drain connections in the secondary region ALB by deregistering the target instances
  • Invoke the Lambda function to promote the Aurora DB to active in the primary region
  • Stop the FCWS web server

The HC state is as follows:

  • HC against primary region ALB = HEALTHY
  • HC against FCWS web server failover/index.html with FailOverNow string match = HEALTHY (web server is down and/or string is not there, HC is inverted)
  • HC against FCWS web server failback/index.html = HEALTHY (web server is down, HC is inverted)
  • Calculated HC = HEALTHY (3 out of 3 HCs are HEALTHY)

At this point end user requests are now being sent to the primary region ALB and the Aurora DB is active there as well.

Health Check and DNS Failover Policy Matrix

The following table shows the state of the various HCs and maps to the failover policy DNS resolution:

Primary Region ALB HC FCWS WebServer Failover HC FCWS WebServer Failback HC Target Region
HEALTHY HEALTHY HEALTHY Primary
UNHEALTHY HEALTHY HEALTHY Primary
UNHEALTHY UNHEALTHY UNHEALTHY Secondary
HEALTHY UNHEALTHY UNHEALTHY Secondary
UNHEALTHY HEALTHY UNHEALTHY Secondary

Summary

In this post, we showed how an Amazon Route 53 failover policy and health checks can control traffic flows across regions. We demonstrated how to use calculated health checks with an out of band web server as control plane. With this configuration, Route 53 provides you a new way to define traffic routing and routing decisions for your application across regions. Customers can use this pattern as part of their multi-region resiliency strategy. This is a sample on an specific application, you need to verify and test on your application to ensure it meets your BCP/DR needs.

Get started today with Route 53 policy-based routing by checking the documentation and visiting the AWS console.

Jason Viar Headshot1.jpg

Jason Viar

Jason is a Solutions Architect at AWS. He works with enterprise customers throughout their journey in the cloud. Previous to AWS, Jason was a network architect for two different financial services companies. He was also a Solutions Architect with a large networking provider.

mmsankar Headshot1.jpg

Murali Sankar

Murali Sankar is a Senior Solutions Architect at AWS based out of New York. He is tech enthusiast, passionate about solving complex business problems, work with organizations to design and develop architectures on AWS, and support their journey on the cloud.