Creating a scalable disaster recovery plan with AWS Elastic Disaster Recovery

IT disruptions can occur for many reasons, including human error, weather, or a cyber attack. Enterprises need to have a solution in place that will get them up and running quickly with minimal downtime. When orchestrating disaster recovery at scale, it is important to automate recovery plans as much as possible. This allows for a faster recovery time objective (RTO) and also creates a repeatable, auditable process that can be documented and maintained.

AWS Elastic Disaster Recovery (DRS) is the recommended service for disaster recovery (DR) to AWS. Operated from the AWS Management Console, AWS Elastic Disaster Recovery helps you recover all of your applications and databases that run on supported Windows and Linux operating system versions. They will then run natively within Amazon Elastic Compute Cloud (Amazon EC2) in the event of a DR event or drill.

In this blog post, I demonstrate setting up a two-step automated recovery in DRS. I also provide the necessary knowledge to build additional automation into your recovery, such as triggering an Amazon Simple Notification Service (SNS) notification to notify stakeholders of the recovery event.

Note: This post assumes you have a running DRS environment with at least two source servers protected. See the Getting Started section of the DRS documentation for instructions on setting this up.

Why automate your disaster recovery plan?

When performing a disaster recovery at scale, there are often servers that have dependencies on other servers in the environment.

For example:

Application servers that connect to a database on boot.
Servers that require authentication and need to connect to a domain controller on boot to start services.

Using the architecture described in this post, you can sequence your disaster recovery launch to work based on a single API call to execute the state machine. As you will see, the servers in step 1 will be up and running before the servers in step 2 that have dependencies on step 1 servers. This ensures that services start as expected when recovering an environment with various dependencies.

Automated recovery architecture

The architecture for this automated recovery includes:

An AWS Lambda function that launches all source servers with a tag of “LaunchStep:1” called launch_step_1.

A Lambda function that launches all source servers with a tag of “LaunchStep:2” called launch_step_2.

An AWS state machine that orchestrates when to invoke these functions.

Lambda functions

In this architecture, Lambda functions are used to call on the DRS API and launch the recovery servers.

First, create the functions:

In the Lambda console, select Create function.

Select the radio button for Author from scratch then fill out the fields as follows:

Function Name: launch_step_1
Runtime: Python 3.9
Architecture: x86_64

You need to create an execution role with permissions to call on the DRS API. Then select that role for Change default execution role. The role needs to have the AWS-managed policy AWSElasticDisasterRecoveryConsoleFullAccess attached to it.

Repeat the above steps but name the second function “launch_step_2.”

At the time of writing this blog post, the DRS API is not included in the version of boto3 that Lambda imports. Therefore, you need to create a Lambda execution layer that contains the latest version of boto3.

When the functions are created, they need to be populated with code that will call on the DRS API and launch the tagged source servers.

The code for launch_step_1:

import boto3
def lambda_handler(event, context):

    drs = boto3.client('drs')
    #get list of source servers that match tag
    paginator = drs.get_paginator('describe_source_servers')
    response_iterator = paginator.paginate(
        filters={},
        maxResults = 200,
        PaginationConfig={
            'MaxItems' : 200,
            'PageSize' : 200
        }
    )
    serverItems = []
    for i in response_iterator:
        serverItems += i.get('items')
    serverTagsDict = {}
    for i in serverItems:
        serverTagsDict[i['sourceServerID']] = i['tags']
    #print(serverTagsDict)
    
    #search the dict for launch_step_1 and append them to list.
    launch_step = []
    for k, v in serverTagsDict.items():
        if v.get('LaunchStep') == '1':
            launch_step.append(k)
    print('Launching' , + len(launch_step) , 'servers')
    
    #launch the servers in the launch_step
    for i in launch_step:
        drs_launch = drs.start_recovery(
            isDrill = False,
            sourceServers = [
                {'sourceServerID': i},
            ]
        )

The code for launch_step_2:

import boto3
def lambda_handler(event, context):
    drs = boto3.client('drs')
    #get list of source servers that match tag
    paginator = drs.get_paginator('describe_source_servers')
    response_iterator = paginator.paginate(
        filters={},
        maxResults = 200,
        PaginationConfig={
            'MaxItems' : 200,
            'PageSize' : 200
        }
    )
    serverItems = []
    for i in response_iterator:
        serverItems += i.get('items')
    serverTagsDict = {}
    for i in serverItems:
        serverTagsDict[i['sourceServerID']] = i['tags']
    #print(serverTagsDict)
    
    #search the dict for launch_step_2 and append them to list.
    launch_step = []
    for k, v in serverTagsDict.items():
        if v.get('LaunchStep') == '2':
            launch_step.append(k)
    print('Launching' , + len(launch_step) , 'servers')
    
    
    #launch the servers in the launch_step
    for i in launch_step:
        drs_launch = drs.start_recovery(
            isDrill = False,
            sourceServers = [
                {'sourceServerID': i},
            ]
        )

After adding the code to the functions, select Deploy to save the changes.

Tagging DRS source servers

The functions that were created trigger a launch of any source server with the tags “LaunchStep:1” and “LaunchStep:2,” respectively. In order to tag our servers, you can navigate to the DRS source servers page.

Navigate to any server that you want to launch in step 1, then select Tags > Manage tags.

In the Key field, write “LaunchStep”, and in the Value field, write “1.”

Navigate to any server that you want to launch in step 2, then select Tags > Manage tags.

In the Key field, write “LaunchStep”, and in the Value field, write “2.”

Creating a state machine with AWS Step Functions

To plan your recovery, orchestrate your launches using a state machine created by AWS Step Functions. Step Functions is a serverless orchestration service that lets you combine AWS Lambda functions and other AWS services to build business-critical applications. Through Step Functions’ graphical console, you see your application’s workflow as a series of event-driven steps.

The state machine will perform the following:

Invoke launch_step_1 Lambda function
Wait 300 seconds
Invoke launch_step_2 Lambda function

The state machine code can be created through the visual workflow studio to look like this:

State machine code created through the visual workflow studio

For the purposes of this lab, you could also copy the following state machine code and paste it into the JSON box to generate the state machine. Be sure to replace $ACCOUNTID with your account ID and $REGION with the Region where you created the Lambda functions.

{
    "Comment": "A description of my state machine",
    "StartAt": "Lambda Invoke",
    "States": {
      "Lambda Invoke": {
        "Type": "Task",
        "Resource": "arn:aws:states:::lambda:invoke",
        "OutputPath": "$.Payload",
        "Parameters": {
          "Payload.$": "$",
          "FunctionName": "arn:aws:lambda:$REGION:$ACCOUNTID:function:launch_step_1:$LATEST"
        },
        "Retry": [
          {
            "ErrorEquals": [
              "Lambda.ServiceException",
              "Lambda.AWSLambdaException",
              "Lambda.SdkClientException"
            ],
            "IntervalSeconds": 2,
            "MaxAttempts": 6,
            "BackoffRate": 2
          }
        ],
        "Next": "Wait"
      },
      "Wait": {
        "Type": "Wait",
        "Seconds": 300,
        "Next": "Lambda Invoke (1)"
      },
      "Lambda Invoke (1)": {
        "Type": "Task",
        "Resource": "arn:aws:states:::lambda:invoke",
        "OutputPath": "$.Payload",
        "Parameters": {
          "FunctionName": "arn:aws:lambda:$REGION:$ACCOUNTID:function:launch_step_2:$LATEST",
          "Payload": {
            "sampleKey1": "test"
          }
        },
        "Retry": [
          {
            "ErrorEquals": [
              "Lambda.ServiceException",
              "Lambda.AWSLambdaException",
              "Lambda.SdkClientException"
            ],
            "IntervalSeconds": 2,
            "MaxAttempts": 6,
            "BackoffRate": 2
          }
        ],
        "End": true
      }
    }
  }

Once the state machine is executed, servers marked as launch step 1 will recover. Then five minutes will pass. Then servers marked as launch step 2 will recover.

Next steps

The state machine created in this blog is a simple two-step machine. There are many possibilities that can be built on top of this architecture to assist with an at-scale recovery. A few recommendations to explore:

Create a second recovery plan that launches the machines as a drill instead of a recovery. This would allow for a custom recovery plan for drills that differs from an actual recovery.

Add a step between Lambda Invoke (1) and End where you publish an SNS notification. This would alert necessary stakeholders that a DR plan was completed as well as the status of the recovery (success or failure).

Add a step that runs AWS Systems Manager GetInventory after recovery to confirm that failed over machines are ready to be managed via Systems Manager.

Add an Amazon Route 53 step that creates a new hosted zone for the newly failed over instances to allow traffic flow via DNS.

Cleaning up

In order to clean up all resources created in this blog post, be sure to delete the two lambda functions launch_step_1 and launch_step_2 as well as the state machine pointing to the functions.

Conclusion

In the event of a disaster, it’s important to have a solution in place that will get you up and running quickly with minimal downtime.

Combining DRS API calls with AWS state machines allows you to create an automated recovery strategy that can be maintained and executed by anyone with the necessary permissions. This allows for a DR strategy that is repeatable and auditable by the necessary stakeholders. The provided architecture can serve as a building block to create additional customization in your DR recovery plan.

If you have any comments or questions, feel free to leave them in the comments section.

AWS Storage Blog