Networking & Content Delivery

Orchestrate disaster recovery automation using Amazon Route 53 ARC and AWS Step Functions

Note: To learn more about Amazon Route 53 Application Recovery Controller (Route 53 ARC), we recommend you read Part 1 and Part 2 of the series, and try out the examples. It demonstrates how the ARC service allows you to coordinate failovers and the recovery readiness of your application.

In this blog post, we provide a blueprint for orchestrating the automation of failover during disaster recovery events using Amazon Route 53 Application Recovery Controller (Route 53 ARC), AWS Step Functions, AWS Lambda, and Amazon DynamoDB.

Organizations spend a lot of effort orchestrating manual disaster recovery (DR) runbook actions during the DR scenario. An application can become unavailable for various reasons, including hardware failures, software bugs, or network device problems.

However, to be fast and reliable, the DR runbook process should be practiced regularly and must be automated to centrally coordinate failover and failback with minimal manual steps. Furthermore, you need a simple and effective automation approach to meet the Recovery Time Objective/Recovery Point Objective requirements.

This DR automation solution minimizes the need for manual intervention and helps shorten the recovery time in the event of a Regional impairment. This solution orchestrates the DR runbook processes using the “failover step function” and “failback step function” deployed in both primary and standby Regions, respectively. These sample step functions use a custom Lambda function and global DynamoDB tables to automate the Route 53 ARC Routing Controls into on/off states, which plays an important role in managing the failover and failback of AWS service entry points.

Cross Region Recovery with Amazon Route 53 ARC

Amazon Route 53 ARC is a global service, which includes a Control plane and Data plane. The Control plane is located in the us-west-2 (Oregon) Region, which enables us to create and delete resources in the ARC Cluster, and the Data plane is available in 5 regions, which provides the service’s core functionality. To be precise, any “creation & deletion” of ARC Routing Controls are Control Plane operations, and any “updates” to ARC Routing Controls are a Data Plane operation, in other words, changing the on/off states.

Route 53 ARC offers extreme reliability with its data plane to fail over the application during a regional impairment. Route53 ARC maintains the routing control states in a cluster, which is a set of five Regional endpoints. We can interact with any one of the cluster endpoints to update the state of a routing control, and it gets propagated across the five Regions of the cluster.

However, a robust failover mechanism should be independent of the Region we are trying to get out of, so it is recommended to programmatically manage routing state changes using Amazon Route 53 ARC API operations via one of the AWS SDKs. As a best practice, we recommend choosing a random cluster endpoint to get or set routing control states. If one of the Cluster endpoint request fails, gracefully handle the error and retry with the next endpoint, which should guarantee the retrieval or update of routing control states even if one cluster endpoint is unavailable.

Also, it is highly recommended not to dependent on AWS Console for making changes to Route53 ARC Routing Control states. Hence, we store the Route 53 ARC regional cluster endpoints, control panel ARN, and the order of Routing Controls in the global DynamoDB tables. Therefore, if the AWS console is inaccessible, the failover and failback sample step functions deployed in any other AWS Regions can access the ARC parameters from global DynamoDB tables. This will allow you to automate the failover/failback without accessing the Route 53 ARC AWS console.

Solution architecture

Figure 1: Diagram illustrating a multi-Region DR automation using AWS StepFunction and Lambda

Figure 1: Diagram illustrating a multi-Region DR automation using AWS StepFunction and Lambda

Route 53 ARC Control Panel and Routing Controls provide the ability to manage failover or failback for multiple layers of the application stacks from one central location, as shown in the preceding figure. Adding Step Functions and Lambda to auto-manage the ARC Routing Control states allow us to deploy the DR runbook failover/failback sequences in a particular order. The runbook should define the specific actions and the correct order in which it needs to be run during the DR failover or failback events. These ordered actions may vary for different use cases, but it is important to preserve the order of Routing Control switch states in a DynamoDB global table.

We use the Amazon Route 53 Application Recovery Controller APIs to list and update Routing Control states. To perform these API operations, we use the ARC regional cluster endpoints, control panel ARN, and Routing Control names stored in three separate DynamoDB global tables. As a best practice, we also have implemented the logic to cycle through the Route53 ARC’s 5 cluster endpoints to choose a random cluster endpoint if one of the endpoint fails.

Also in this solution, we automate the global failover/failback of RDS cluster between the primary and standby Regions using an embedded step function and custom Lambda code.

Prerequisites

In this post, we reuse the multi-Region stack design from the previously mentioned series: Building highly resilient applications using Amazon Route 53 Application Recovery Controller, Part 2: Multi-Region stack.

The multi-Region stack design, as shown in the following figure, supports an active-standby setup. In this design, the primary (active) Region is us-east-1 (N. Virginia), and the recovery (standby) Region is us-west-2 (Oregon).

Figure 2: Diagram illustrating a multi-Region active-standby AWS deployment

Figure 2: Diagram illustrating a multi-Region active-standby AWS deployment

  1. Use the AWS CloudFormation template (infra-stackset) to deploy the multi-Region stack in your AWS account. For setup instructions, see the readme file that comes with the template.
  2. Then, to apply Route 53 ARC features to the multi-Region stack, deploy the Route 53 ARC stack in your account using a second CloudFormation template (arc-stack).
  3. Then, deploy the dashboard Lambda using CloudFormation template (lambda-stackset), which deploys a pair of Lambda functions across two AWS Regions to support the operation and understanding of the behavior intended by step 2. The dashboard app can be accessed using the DNS name for the Application Load balancer (ALB) (“arcblog-DashboardLambdaAlb”) from the Amazon Elastic Compute Cloud (Amazon EC2) console, as shown in the following figure.

Example: <arcblog-DashboardLambdaAlb-xxxxxx.us-east-1.elb.amazonaws.com>

Figure 3: Status Dashboard shows even distribution across three AZs of the primary Region

Figure 3: Status Dashboard shows even distribution across three AZs of the primary Region

Set up the DR Automation stack

As part of this post, we have provided an AWS Cloud Development Kit (AWS CDK) project, use the git repo to deploy the DR Automation stack. The AWS CDK is an open-source software development framework developed by AWS for defining and provisioning cloud infrastructure resources using familiar programming languages.

The repository has two sample configuration files under the config folder: one to configure the Amazon Relational Database Service (Amazon RDS) step function stack, and another one to configure the two main step function stacks for DR automation. Follow these steps for the setup.

  1. Git clone the project.
  2. Use the commands to deploy the Amazon RDS failover stack in both of the Regions and capture the Amazon RDS failover step functions’ ARNs.
    • $ export AWS_DEFAULT_REGION=us-east-1
    • $ cdk deploy -c rdsConfig=rds_failover_config RdsFailoverStackPrimary
    • $ export AWS_DEFAULT_REGION=us-west-2
    • $ cdk deploy -c rdsConfig=rds_failover_config RdsFailoverStackSecondary
  3. The above commands should have deployed two step function in Primary and Secondary region. Pls refer the step function ARNs in the Cloud Formation outputs tab in the respective regions, which will used be in the next step(4).
  4. Use the sample <route53_arc_config_sample.yml> under the config folder to create the “route53_arc_config.yml”. Refer to the following example.
    • Fill in the primaryStepfunctionArn & primaryStepfunctionArn from Step-3.
    • Use the Route53- ARC Console to identify the ARC cluster name, endpoints, and control panel ARN, which was deployed as part of the Prerequisite.
    • Fill in the failover & failback routing controls for Primary and Secondary region in a specific order as per the DR runbook procedures.
      • account: "<Account>"
        primary: "us-east-1"
        secondary: "us-west-2"
        appName: "test"
        primaryStepfunctionArn: "arn:aws:states:us-east-1:<account number>:stateMachine:DRRDSFailoverStepFunction"
        secondaryStepfunctionArn: "arn:aws:states:us-west-2:<account number>:stateMachine:DRRDSFailoverStepFunction"
        
        arcCluster:
          clusterName: "arcblog-Cluster"
          endpoints:
            - region: "eu-west-1"
              arn: "https://<xxxxx>.route53-recovery-cluster.eu-west-1.amazonaws.com/v1"
            - region: "us-west-2"
              arn: "https://<xxxxx>.route53-recovery-cluster.us-west-2.amazonaws.com/v1"
            - region: "us-east-1"
              arn: "https://<xxxxx>.route53-recovery-cluster.us-east-1.amazonaws.com/v1"
            - region: "ap-southeast-2"
              arn: "https://<xxxxx>.route53-recovery-cluster.ap-southeast-2.amazonaws.com/v1"
            - region: "ap-northeast-1"
              arn: "https://<xxxxx>.route53-recovery-cluster.ap-northeast-1.amazonaws.com/v1"
          controlPanel:
            controlPanelArn: "arn:aws:route53-recovery-control::<account number>:controlpanel/<xxxxx>"
            failoverRoutingControls:
              primary:
                - arcblog-Cell1-us-east-1
                - arcblog-Cell1C-us-east-1c
                - arcblog-Cell1B-us-east-1b
                - arcblog-Cell1A-us-east-1a
                - arcblog-Cell1Aurora-us-east-1
              secondary:
                - arcblog-Cell2C-us-west-2c
                - arcblog-Cell2B-us-west-2b
                - arcblog-Cell2A-us-west-2a
                - arcblog-Cell2Aurora-us-west-2
                - arcblog-Cell2-us-west-2
            failbackRoutingControls:
              primary:
                - arcblog-Cell1Aurora-us-east-1
                - arcblog-Cell1C-us-east-1c
                - arcblog-Cell1B-us-east-1b
                - arcblog-Cell1A-us-east-1a
                - arcblog-Cell1-us-east-1
              secondary:
                - arcblog-Cell2-us-west-2
                - arcblog-Cell2C-us-west-2c
                - arcblog-Cell2B-us-west-2b
                - arcblog-Cell2A-us-west-2a
                - arcblog-Cell2Aurora-us-west-2
  5. Use the commands to deploy the main failover and failback step function stacks in both of the Regions.
    • $ export AWS_DEFAULT_REGION=us-east-1
    • $ cdk deploy -c config=route53_arc_config DrStackPrimary
    • $ export AWS_DEFAULT_REGION=us-west-2
    • $ cdk deploy -c config=route53_arc_config DrStackSecondary

When all of the stacks are deployed, it creates the following AWS resources.

  1. DynamoDB Global Tables
    • DrStack-ArcClusterEndpoints
      • This table has an array list of five ARC Cluster endpoints with its AWS Region.
    • DrStack-FailoverRoutingControls
      • This table contains the Control Panel ARN, and the Routing Control names for the primary and standby Regions resources in a pre-defined order.
    • DrStack-FailbackRoutingControls
      • This table contains the Control Panel ARN, and the Routing Control names for primary and standby Regions in a pre-defined order for the failback process.
  2. Step Functions State Machines in (us-east-1) N.Virginia and (us-west-2) Oregon Regions.

    Figure 4: Step Functions State Machines in (us-east-1) N.Virginia and (us-west-2) Oregon Regions

    Figure 4: Step Functions State Machines in (us-east-1) N.Virginia and (us-west-2) Oregon Regions

Orchestration

When the DR Automation step functions are ready, it’s time to test the DR orchestration flow. The step function state machines are designed to take the following steps in a specific order for DR failover and in the other order for DR failback.

  1. Stop accepting traffic in the primary Region (us-east-1) to prevent writes to the database while you’re failing it over.
  2. Fail over your database to the recovery Region (us-west-2) and make sure it’s ready to accept writes.
  3. Route user traffic to us-west-2, making it the new active Region.

Here are the corresponding Routing Control change sequences in Route53 ARC for the DR Failover scenario:

  • Primary Region – Routing Control change state sequences:
    • Turn off Routing Control of main cell – “arcblog-Cell1-us-east-1”.
    • Turn off Routing Controls of Sub AZ cells – “arcblog-Cell1A-us-east-1a”, “arcblog-Cell1B-us-east-1b”, and “arcblog-Cell1C-us-east-1c”.
    • Turn off Routing Control of RDS – “arcblog-Cell1Aurora-us-east-1”.
  • Secondary Region – Promote AWS resources:
    • Perform global failover of Aurora RDS cluster in the secondary Region.
  • Secondary Region – Routing Control change state sequences:
    • Turn on Routing Control of Amazon RDS – “arcblog-Cell2Aurora-us-west-2”.
    • Turn on Routing Controls of Sub AZ cells – “arcblog-Cell2A-us-west-2a”, “arcblog-Cell2B-us-west-2b”, and “arcblog-Cell2C-us-west-2c”.
    • Turn on Routing Control of main cell – “arcblog-Cell2-us-west-2”.

Deployment demo

Step-1: To flip the user traffic to the standby Region, start the execution of the “DRRoutingControlStepFunction-fail_over” state machine from the Step Functions console in the Secondary region, as shown in the following figure.

Step-2: After the completion of the failover state machine execution run, verify if the RDS Primary cluster and Writer instance have failed over to us-west-2 Secondary Region, as shown in the following figure.

Step-3: Also, verify the Routing Control states in Route53 ARC cluster console. All the Routing Controls for the primary Region (us-east-1) should have turned OFF and the standby Region (us-west-2) should have turned ON, as shown in the following figure.

Step-4: To revert the user traffic to the primary Region, start execution of the “DRRoutingControlStepFunction-fail_back” state machine from the Step Functions console, as shown in the following figure.

Step-5: Verify if the RDS Primary cluster has switched back to the us-east-1 Region in the Amazon RDS console, as shown in the following figure.

Step-6: Verify if the Routing controls for the standby Region (us-west-2) have turned OFF and the primary Region (us-east-1) have turned ON, as shown in the following figure.

Pls Note: Objective of this blog is to automate the DR Runbook events, and the above solution doesn’t handle multi region data reconciliation. In order to minimize or avoid data loss during Disaster Recovery, we recommend implementing a data consistency/replication design patterns based on the RTO/RPO requirements, for each individual AWS services.

Cleaning up

If you used the CloudFormation templates and AWS CDK that we provided to create AWS resources to follow along with this post, then we recommend that you delete them now to avoid future recurring charges.

Conclusion

In this post, we described the steps for orchestrating the DR strategy using AWS Step Functions on Amazon Route 53 ARC to failover and failback a RDS cluster quickly and reliably, and to minimize the DB downtime from a Regional impairment. This automation minimizes the need for manual intervention and provide greater control in testing DR solution and help shorten the recovery time to meet the business RTO/RPO requirements.

We hope this post provided you with guidance that you can use when orchestrating a DR resiliency strategy for your own environment. If your use case requires more customization, you can extend the child step function for additional AWS resources like Amazon ElastiCache, Amazon OpenSearch Service, and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

suren-125125.jpg

Suren Raju

Suren Raju is a Sr Cloud Application Architect with Amazon Web Services, based out of Dallas. With over 17 years of industry experience, he is focussed on accelerating the time from idea to well-architected solutions. He is passionate about learning new cloud technologies, and assists customers in building cloud adoption strategies, designing innovative solutions, and delivering operational excellence.

szvarun-125125.jpg

Varun Sharma

Varun Sharma is a Sr Lead Consultant working with AWS Professional team. He help customer and partners with Cloud migrations and provide consulting solutions.

carthick_image.jpeg

Karthik Balasubramanian

Karthik Balasubramanian is a Sr Cloud Application Architect with Amazon Web Services, based out of Dallas. He is an hands-on Architect, helping customers design, build and implement cloud native architectures. He is particularly interested in building distributed/de-centralized architectures. He is passionate about Kubernetes, Observability and Resiliency

March 27, 2024: The steps in the procedure for setting up the DR automation stack have been updated to incorporate additional details. The post was also edited to clarify recommendations regarding failover processes during disaster recovery events.