Networking & Content Delivery

Running recovery-oriented applications with Amazon Route 53 Application Recovery Controller, AWS CI/CD tools, and Terraform

Introduction

AWS customers in different industries have applications that require extremely high availability that run across several AWS Regions so that they can meet latency and business continuity requirements. Amazon Route 53 Application Recovery Controller (Route 53 ARC) supports high availability by allowing customers to continuously audit the recovery readiness of their applications and centrally coordinate rerouting workloads around failures in a safe and reliable way.

In my work as a Solutions Architect at AWS, I help Financial Services companies deliver internet-scale applications and I know from experience that it can be a daunting task. In this blog post, I will be describing to you how to use Route 53 ARC components to maximize the availability of a multi-Region web application. For recovery-oriented architectures, automation is paramount so I leverage AWS CodePipeline for continuous delivery, AWS CodeBuild for continuous integration, AWS CodeDeploy for automated code deployment, and Hashicorp Terraform as the Infrastructure-as-Code (IaC) tool. I also show you how to deploy an application incrementally to one AWS Region at a time, to avoid correlated failures. Finally, I walk through how to fail over traffic from an active Region to a standby Region, and how to fail back traffic after the issue that required the Regional failover is resolved.

Note: If you use AWS CloudFormation as a deployment tool, you can learn how to deploy an application with AWS Cloud Development Kit (CDK) and configure Route 53 ARC components with AWS CloudFormation in the Route 53 ARC Developer Guide.

Use case overview

Financial Services applications serve hundreds of thousands of customers and process millions of business transactions every day. AWS services such as Elastic Load Balancing, EC2 Auto Scaling, and Amazon DynamoDB are a great fit for this kind of applications. With Elastic Load Balancing resources such as Application Load Balancers and with Auto Scaling groups, customers can automatically scale application based on traffic, with minimal operational overhead. DynamoDB global tables provide applications with internet-scale write throughput, and seamlessly replicate data across AWS Regions within one second and with 99.999% availability, while automatically resolving conflicts using a last writer wins process.

Solution implementation

The best way to understand how to use Route 53 ARC, AWS CI/CD tools, and Terraform together is to start small. To that end, let me walk you through deploying a sample web application, the SignUp application. This application allows end-users to enter their contact information so that they can be notified when a new product from the New Startup company officially launches.

The SignUp application is written in NodeJS. It runs in active/standby mode across two AWS Regions, with two Availability Zones per Region, and stores data in a DynamoDB global table that resides in the same two AWS Regions. Route 53 ARC’s routing controls will front each of the deployments at the application layer. To perform a Regional failover, you update routing control states to stop traffic to the active Region and start traffic to the standby Region.

Prerequisites

Before you start, make sure that you have the following created or installed and ready to use:

git clone https://github.com/aws-samples/route-53-application-recovery-controller-codepipeline-with-terraform.git

With the prerequisites complete, you can get started.

Create the AWS resources in two Regions

The first step is to use Terraform to create the required AWS resources, deploy the SignUp application in two Regions, and create the required Route 53 ARC components.

Under the route-53-application-recovery-controller-codepipeline-with-terraform folder, you will find a shell script called create-db-app-cicd-stack.sh that uses Terraform to perform the following actions:

  • Create an Amazon S3 bucket, which will be used as a source code repository for the CI/CD pipeline.
  • Create an Amazon DynamoDB global table and the supporting AWS resources to run the application in two AWS Regions.
  • Create a CI/CD pipeline that includes an approval action to deploy the application, one Region at a time, by using CodePipeline, CodeBuild, and CodeDeploy.
  • Create the Route 53 ARC components for readiness checks and routing controls, Route 53 Health Checks, and Route 53 DNS records.

Set the DNS variables

Before you run the script, update the DNS Hosted Zone and DNS Domain Name variables to use the values that correspond to your Route 53 domain name, as described in the prerequisites section.

To make the updates, do the following:

  1. In the route-53-application-recovery-controller-codepipeline-with-terraform folder, edit the set-terraform-variables.sh file.
  2. On lines 10 and 11, update the following variables to use the values for your own DNS domain name.
    export TF_VAR_DNSHostedZone=Z0ABCDEFG9Z
    export TF_VAR_DNSDomainName=gtphonehome.com
  1. Save the file.

Run the script

Now create the AWS resources by running the shell script.

  1. Open your preferred terminal and change to the script directory.
cd route-53-application-recovery-controller-codepipeline-with-terraform
  1. Run the script, and redirect the script output to a local file so that you can track the deployment progress and make sure that the AWS resources were created successfully.
./create-db-app-cicd-stack.sh > my_terraform_create.log 2>&1

It may take up to 20 minutes for the script to create all the AWS resources in both Regions.

Review the AWS resources

After the script completes successfully, take a moment to review your AWS resources.

For each AWS Region, the script creates the following resources, using tf-arc as a name prefix:

  • A VPC called tf-arc-VPC with 10.0.0.0/16 as the IPv4 CIDR. One internet gateway and one NAT gateway per VPC.
  • Two Availability Zones. Each zone has a public subnet and a private subnet.
  • One internet-facing Application Load Balancer.
  • One Auto Scaling group that consists of two Amazon EC2 Linux instances. Each instance has a profile that allows it to access the DynamoDB global table. In addition, the script installs the CodeDeploy agent, which enables an instance to be used in AWS CodeDeploy deployments.
  • Two security groups, to allow access to the load balancer from the internet and to allow access to the Auto Scaling group from the load balancer.

The script creates a DynamoDB global table called nodejs-tutorial with email as the partition key and no sort key.

It creates a CI/CD Pipeline called ARC-Pipeline, which deploys the application in two AWS Regions, using the cross-Region action feature in AWS CodePipeline.

The script also creates a Route 53 ARC recovery group called tf-arc-RecoveryGroup and a Route 53 ARC cluster called tf-arc-Cluster.

Note: These AWS resources will result in charges to your AWS account. The total cost depends on how long you keep the AWS resources. For details about pricing, visit the Route 53 ARC Pricing page.

Deploy the SignUp application

The sample CI/CD pipeline has five stages, including an approval action. Initially, the pipeline deploys the application just to the active Region. The first three stages run automatically, without manual intervention:

  • In the Source stage, the process starts automatically by getting the source code from the S3 bucket. The script created the S3 bucket and uploaded the source code. You can find the source code in the route-53-application-recovery-controller-codepipeline-with-terraform/nodejs-sample-app folder.
  • For the Build stage, AWS CodeBuild uses an Amazon Linux container to install the NodeJS runtime and install all of the application’s dependencies. The buildspec section of the CodeBuild project contains the commands to run the continuous integration process.
  • In the Deploy-to-Region-1 stage, AWS CodeDeploy uses an application specification file (AppSpec file) to manage each deployment as a series of lifecycle event hooks. You can find the AppSpec file at route-53-application-recovery-controller-codepipeline-with-terraform/nodejs-sample-app/appspec.yml. This stage deploys the application to the Auto Scaling group called tf-arc-asg, which is associated to the target group tf-arc-tgrp of the Application Load Balancer. In this example, CodeDeploy deploys the application to only one EC2 instance at a time and leverages the Application Load Balancer to prevent internet traffic from being routed to an instance while it’s being updated. The Application Load Balancer also makes the instance available for traffic again after deployment to that instance is complete. In addition, the deployment is configured to automatically roll back when a deployment fails.

At this point, the pipeline execution stops and waits for verification of a successful deployment. Once you have done this, you can manually approve proceeding to deploy in the standby Region.

  1. The Manual-Approval stage, shown in the following screenshot, lets you verify that the application deployed successfully in the active Region before you deploy the application in the standby Region. To verify, start by choosing Review, which opens a comments section for you to enter confirmation text.

  1. Before you confirm and continue, check that the application you deployed is up and running. To do this, access the DNS name of the load balancer in the active Region. To view the DNS name to use, in the Amazon EC2 console, navigate to the Load Balancer page. Enter the corresponding DNS name in a browser, and then make sure that you see the home page for the SignUp application, like the following:

  1. Now that you’ve confirmed that the application is deployed, enter a comment in the text field, and choose Approve to continue the deployment to the standby Region, as shown in the following screenshot:

 By using this approach, you have the option to stop the deployment if there is a problem with the release, which prevents a bad version from propagating to the standby Region as well as the active Region. This lets you avoid correlated failures between Regions.

  1. After you approve the review, the Deploy-to-Region-2 stage deploys the application to the standby Region, following the same deployment mechanism as the Deploy-to-Region-1 stage.

  1. Finally, check that the application is up and running on the standby Region, by accessing the DNS name of the load balancer in that Region.

Use Route 53 Application Recovery Controller (ARC) routing control

Now you can look at the Route 53 ARC components created by the script, and see how to use routing controls to manage traffic between the active and standby Regions.

In addition to the AWS resources for the SignUp application, the script also creates the following Route 53 ARC components:

  • Two cells, one for each AWS Region.
  • Readiness checks, which audit mismatches in capacity, resource limits, and throttle limits across cells. Together with other monitoring information that tells you about the health of your standby cell (Region), this capability helps you understand whether your standby is ready for failover traffic from the active cell. To determine whether your application is healthy, you can use application-specific metrics from Amazon CloudWatch or another observability tool.
  • Routing controls, which enable you to re-route traffic across cells (Regions) as part of an application failover.
  • Route 53 health checks, which manage traffic failover for the application when you update routing controls.

The script also creates two Route 53 failover records:

  • Primary failover record for the active Region.
  • Secondary failover record for the standby Region.

The script associates both DNS records with the Route 53 health checks that the script also created.

The following diagram shows the Route 53 ARC components for the SignUp application:

To review the Route 53 ARC components and DNS records that the script created, sign in to the Amazon Route 53 console.

Turn on a routing control state

When a routing control state is ON, traffic flows to the cell controlled by that routing control. When the script finishes, both routing control states are set to OFF. To enable traffic to flow to the active Region, turn on the routing control state for that Region by following the steps in the AWS documentation for using the AWS CLI. (You can also update routing control states on the Amazon Route 53 console in the AWS Management Console but AWS recommends that you use the Route 53 ARC API, for example, by using the AWS CLI.)

Note that to work with routing control states, you must connect to one of the Regional cluster endpoints. You can view the list of Regional cluster endpoints for your cluster in the Route 53 console, or by using an API action, DescribeCluster. Your process for getting and changing routing control states should be prepared to try each endpoint in rotation, since cluster endpoints are cycled through available and unavailable states for regular maintenance and updates. For code examples that explain how to rotate through Regional cluster endpoints to get and set routing control states, see API examples for Application Recovery Controller.

You can get the Amazon Resource Name (ARN) of your routing control from the Route 53 ARC control panel that contains it, which is called tf-arc-ControlPanel. You’ll update the routing control state for one Region by running the following AWS CLI command, update-routing-control-state:

aws route53-recovery-cluster update-routing-control-state \
--routing-control-arn \
arn:aws:route53-recovery-control::111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/abcdefg1234567 \
--routing-control-state On \
--region us-west-2 \
--endpoint-url https://host-dddddd.us-west-2.example.com/v1

When the request is successful, the response is empty.

Your routing controls are durably stored in clusters located in five AWS Regions and changes are coordinated across cluster endpoints. In addition, these changes go through Route 53’s data plane, which has a 100% availability SLA.

Who can turn routing control states on or off?

You can make sure that only authorized personnel can turn routing control states on or off and trigger an application failover by attaching a policy, such as the following, to a given AWS user:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                   "route53-recovery-cluster:UpdateRoutingControlState",
                   "route53-recovery-cluster:UpdateRoutingControlStates",
                   "route53-recovery-cluster:GetRoutingControlState"
             ],
            "Resource": "*"
        }
    ]
}

For more information, see the identity-based IAM policy examples for Route 53 ARC.

Access the SignUp application

To access the SignUp application, do the following:

  1. Sign in to the Amazon Route 53 console.
  2. On the Hosted Zone page, view the DNS record name that corresponds to the application (tf-arc-<YOUR DOMAIN NAME>).
  3. Paste the URL in a browser.
  4. On the application home page, choose Sign up today, and fill out the web form. You’ll see a page like the following:

After you submit the form, you’ll receive the following message: Thanks for signing up! You’ll be among the first to know when we launch.

To verify that your information has been stored in the nodejs-tutorial table on both AWS Regions, sign in to the AWS Management Console, navigate to the DynamoDB console, and then choose Items. In the Items returned table, you’ll see your email address and other information, as shown in the following screenshot:

Test a Regional failover

To fail over from one Region to another, you change routing control states to reroute traffic.

Let’s say that an unplanned event causes an outage in the active Region for the SignUp application that prevents your users from accessing the application. Unplanned events can include elevated latency, application errors, human errors, and infrastructure outages caused by natural disasters or hardware failures. You can use Route 53 ARC to quickly fail over traffic from the active Region to the standby Region. As a result, your users can continue to access the application and you can achieve your low recovery point objective (RPO) and recovery time objective (RTO).

To fail over traffic, you must manually set the routing control state for your active Region to OFF to stop sending traffic to it. Then, you must set the standby Region’s routing control state to ON to start traffic flowing there. You can update several routing controls at the same time with one API call: update-routing-control-states. When the request is successful, the response is empty.

Here is an example:

aws route53-recovery-cluster update-routing-control-states \
--update-routing-control-state-entries '[{"RoutingControlArn": "arn:aws:route53-recovery-control::111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/abcdefg1234567", "RoutingControlState": "Off"},
{"RoutingControlArn": "arn:aws:route53-recovery-control::111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/hijklmnop987654321", "RoutingControlState": "On"}]' \
--region us-west-2 \
--endpoint-url https://host-dddddd.us-west-2.example.com/v1

After a few seconds, the DNS address is updated, and application traffic is now routed to the standby Region:

Now that you’ve successfully shifted traffic to the standby Region, you can start troubleshooting the issue that caused the outage in the active Region.

Test a Regional failback

You can also fail back traffic to the active Region after the issue that triggered the Regional failover gets resolved. Set the routing control state for the standby Region to OFF to stop sending traffic there, and set the routing control state for the active Region to ON to start traffic flowing there again. As with failover, you can make these changes in one CLI update by using the update-routing-control-states API operation. Here’s an example:

aws route53-recovery-cluster update-routing-control-states \
--update-routing-control-state-entries '[{"RoutingControlArn": "arn:aws:route53-recovery-control::111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/abcdefg1234567", "RoutingControlState": "On"},
{"RoutingControlArn": "arn:aws:route53-recovery-control::111122223333:controlpanel/0123456bbbbbbb0123456bbbbbb0123456/routingcontrol/hijklmnop987654321", "RoutingControlState": "Off"}]' \
--region us-west-2 \
--endpoint-url https://host-dddddd.us-west-2.example.com/v1

After a few seconds, the DNS address is updated and application traffic is now re-routed back to the active Region:

Clean up

To reduce costs, you can delete the AWS resources that you created for the example in this blog. To assist you, you can use a script that I provide to remove resources. In the route-53-application-recovery-controller-codepipeline-with-terraform folder, there is a shell script called destroy-db-app-cicd-stack.1.sh that performs the following actions:

  • Deletes the Route 53 ARC components for readiness checks and routing controls, Route 53 Health Checks, and Route 53 DNS records.
  • Deletes the CI/CD pipeline.
  • Deletes the AWS resources in two AWS Regions.
  • Deletes the Amazon DynamoDB global table.
  • Deletes the Amazon S3 Bucket where a copy of the source code was stored.

Run the shell script and redirect its output to a local file so that you can track progress of the deletion process and make sure that all AWS resources were deleted successfully. To run the script, type the following at a terminal command prompt:

cd route-53-application-recovery-controller-codepipeline-with-terraform

./destroy-db-app-cicd-stack.sh > my_terraform_delete.log 2>&1

It may take up to 20 minutes to delete all AWS resources in both Regions.

Conclusion

In this blog post, you learned how to use Route 53 ARC together with CodePipeline, CodeBuild, CodeDeploy, and Terraform to deploy and run an application with a recovery-oriented architecture on AWS. The process included partitioning the application into multiple and redundant isolated cells that align with two AWS Regions. Next, you deployed changes incrementally to one cell at a time, to avoid correlated failures between Regions. Finally, you mitigated an outage by removing an impaired cell from service and re-routing traffic to a healthy cell. Feel free to experiment with these scripts and adapt them to your needs. I hope this post has been useful, and if you have questions, feel free to start a new thread in the Amazon Route 53 Application Recovery Controller forum.

Guillermo Tantachuco

I am a Solutions Architect at AWS, where I work with Financial Services customers on all aspects of software delivery and internet-scale systems, including application and data architecture, DevOps, defense in-depth, and fault tolerance. Since 2011, I have led the delivery of cloud-native and digital transformation initiatives at Fortune 500 and global organizations. I am passionate about my family, business, technology, and soccer.