AWS Storage Blog

Automating disaster recovery of Amazon RDS and Amazon EC2 instances

Complex environments can sometimes feel like they require complex disaster recovery (DR) solutions, which usually consist of multiple DR offerings from different vendors that may not interact with each other. There are many ways to build a DR solution in the cloud. Luckily, with AWS, you can easily configure multiple DR services and orchestrate them centrally using native AWS services.

In this post, we discuss how to use multiple AWS services to create a DR plan based on Amazon Aurora MySQL built on Amazon Relational Database Service (RDS) and server workloads running on Amazon Elastic Compute Cloud (EC2). We use a mixture of service-based tools, as well as custom scripting, to build out an entire server workload DR solution. With the solution in this post, you can automate your DR solution to swiftly recover your RDS and EC2 instances in the event of a technical disaster, which may help with business continuity goals or even compliance requirements.

Solution overview

We use Amazon Aurora MySQL built on Amazon RDS, which are built in Cross Region Read Replicas, to replicate the data to the target Region. For the Amazon EC2 workloads, we use AWS Elastic Disaster Recovery (DRS) to replicate the data to the same Region. To enact a drill for these services, we use AWS Lambda functions, invoked by AWS Step functions, to allow for a single button failover of both of these services.

General architecture diagram for AWS Elastic Disaster Recovery, with RDS cross region replication configured. Figure also shows AWS Step Functions, ans Amazon Simple Notification Service services

Walkthrough

We create a step function machine that allows us to promote our RDS Cross Region Read Replica database to a primary. We validate the promotion, and then launch our AWS DRS instances connected to that primary. We have Amazon Simple Notification Service (SNS) topics configured to send out updates to the teams that are monitoring the DR process, keeping them informed.

Through this blog, there are two formats for the custom entries you will be making. Italicized is for an option that is chosen (such as from a drop-down list), and the inline code format is for what needs to be typed in the text field.

In this post, we perform the following tasks in the following list:

  1. Create SNS Topics
  2. Create Lambda Functions
  3. Create EventBridge rule to update SNS topics
  4. Create Step Function Machines
  5. Kick off machine and validate failover

Prerequisites

For this walkthrough, you should have the following prerequisites configured in our DR target region:

Create SNS Topics

Screenshot of Amazon SNS Topics

We create two Amazon SNS topics, which notify any teams subscribed to them when the step is kicked off. This allows those teams to be up to date when the step functions are executed.

1. Navigate to Amazon Simple Notification Service.

2. Choose Create Topic.

a. Under Details and Type choose

b. Under Name enter a name for this topic. (we chose drs-invoked)

c. (Optional) – Enter a display name for text messages to mobile devices.

i. Note: As of June 1, 2021, US telecom providers no longer support person-to-person long codes for applications-to-person communications. Please see here for more information.

d. (Optional) – For Tags enter a key/value pair for easy identification later.

3. Select Create topic.

4. Once the topic is created, select the drs-invoked SNS topic from the list.

a. Take note of the Function ARN because you use it in a script later in the post.

b. Choose Create subscription.For Protocol choose Email.

5. Repeat steps 1 through 4 for the promotion of the AWS RDS cross Region replicator.

a. We chose to name this topic rds-crrr-promoted.

b. Take note of the Function ARN because you use it in a script later in the post.

Create Lambda Functions

Shows 3 lambda functions configured in the blog

We are going to create three different AWS Lambda functions to be invoked by the step function machine, as well as an AWS Identity and Access Management (IAM) role that allows these functions to operate. Each function has a specific task that will run in order.

  • rds_failover – This promotes the RDS cross Region replication instance into a primary.
  • rds_status_check – This validates that the promotion to primary has succeeded.
  • drs_failover – This invokes AWS DRS failover, launching Amazon EC2 instances.

First, create the IAM role that has the permissions needed to invoke the AWS Lambda functions.

1. Navigate to the AWS IAM dashboard.

2. Select Roles from the left-hand options.

3. Choose Create role.

a. Under Trusted entity type choose AWS Service.

b. Under Use case choose Lambda.

4. Choose Next.

5. Under Permissions policies add the following permissions:

a. AWSElasticDisasterRecoveryConsoleFullAccess

b. AmazonRDSFullAccess

6. Choose Next.

7. Under Role details and Role name enter a name for this role.

a. In this instance, we named it rds-drs-failover-role.

8. Choose Create role.

Now create the AWS Lambda functions, and add this role to them.

1. Navigate the AWS Lambda dashboard.

2. Choose Create function.

a. Choose Author from scratch.

b. Under Basic information and for Function name input the name of the function.

i. Name the first one rds_failover.

c. Under Runtime choose Python 3.9.

d. Under Permissions choose Use an existing role.

i. Choose the rds-drs-failover-role role.

3. Leave the rest of the options as default and choose Create function.

a. On the next page you can see the configuration of the rds_failover function.

b. Take note of the Function ARN as you use that in a script later in the post.

c. Under Code source, copy the following code into the environment.

i. Ensure that the name of secondary is the name of your Aurora CrossRegion Read Replica. For this post, it is rds-crrr-cluster-1.

import boto3

rds = boto3.client('rds')

secondary = "rds-drs-crrr-cluster-1"

def lambda_handler(event, context):
    failover = rds.promote_read_replica_db_cluster(
        DBClusterIdentifier= secondary
    )   
    response = "Promoting {} to primary".format(secondary)
    return response

d. Now choose Deploy.

4. Repeat steps 1 through 3 and name the function rds_status_check.

a. Under Code source, copy the following code into the environment:

import boto3
import json

rds = boto3.client('rds')

def lambda_handler(event, context):
    response = rds.describe_db_clusters(
        DBClusterIdentifier='rds-drs-crrr-cluster-1'
    )
    status = response['DBClusters'][0]['Status']
    responseJSON = {"Status": status}
    return responseJSON

b. Choose Deploy.

c. Take note of the Function ARN because you use that in a script later in the post.

5. Repeat steps 1 through 3 and name the function drs_failover.

a. Take note of the Function ARN because you use that in a script later in the post.

b. Under Code source, copy the following code into the environment:

import boto3
drs = boto3.client('drs')

def lambda_handler(event, context):
        #Describe all source servers
        paginator = drs.get_paginator('describe_source_servers')
        response_iterator = paginator.paginate(
            filters={},
            maxResults = 200,
            PaginationConfig={
                'MaxItems' : 200,
                'PageSize' : 200
            }
        )
        
        #Make a list of all source server IDs
        serverItems = []
        for i in response_iterator:
            serverItems += i.get('items')
        serverList = []
        for i in serverItems:
            serverList.append(i['sourceServerID'])

            
        #Failover all the source servers
        for i in serverList:
            failover = drs.start_recovery(
                isDrill=False,
                sourceServers=[
                    {
                        'sourceServerID': i
                    },
                ]
            )
            
        #Return a response that tells user how many servers are failing over   
        response = "failing over {} servers to secondary region".format(len(serverList))
        return response

c. Choose Deploy.

d. Take note of the Function ARN because you use that in a script later in the post.

Create Amazon EventBridge Rule

Create an Amazon EventBridge rule to notify the AWS SNS topic created previously that the AWS DRS servers have been launched during the following AWS Step Functions launch process.

  1. Navigate to the Amazon EventBridge console.
  2. Under Create a new rule choose Create rule.

a. Under Rule detail provide a (in our instance we will use drs-notify)

b. Choose Next.

c. Scroll down to Event pattern, and choose the following:

i. Event source is AWS events or EventBridge partner events.

ii. AWS service is Elastic Disaster Recovery.

iii. Event type is DRS Source Server Launch Result.

d. Choose Next.

e. Under Target 1, choose the following:

i. Target types should be AWS service.

ii. Select a target should be SNS topic.

iii. Topic should be the SNS topic we created earlier. (drs-invoked for our purposes)

iv. Choose Next and Next again on the following page.

v. Choose Create rule.

Create AWS Step Function State Machine

Shows the configuration of AWS Step Function state machine created in the blog

The AWS Step Machine automates all of the created resources by invoking AWS Lambda functions. It then validates that the functions have run properly before moving to the next step.

  1. Navigate to the AWS Step Functions dashboard.
  2. In the left-hand menu, choose State machines.
  3. Choose Create state machine.

a. Choose Write your workflow in code.

b. Under Definition provide the following code:

i. Note that the following lines must be changed to match your environment, with the ARNs created in your environment.

        1. “FunctionName”:”arn:aws:lambda:$REGION$:123456789012:function:rds_failover:$LATEST”
        2. “TopicArn”: “arn:aws:sns: $REGION$:123456789012:rds-crrr-promoted”
        3. “FunctionName”: “arn:aws:lambda: $REGION$:123456789012:function:rds_status_check:$LATEST”
        4. “FunctionName”: “arn:aws:lambda: $REGION$:123456789012:function:drs_failover:$LATEST”
        5. “TopicArn”: “arn:aws:sns:$REGION$:123456789012:drs-invoked
{
  "Comment": "A description of my state machine",
  "StartAt": "Failover RDS",
  "States": {
    "Failover RDS": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "OutputPath": "$.Payload",
      "Parameters": {
        "Payload.$": "$",
        "FunctionName": "arn:aws:lambda:$REGION$:123456789012:function:rds_failover:$LATEST"
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException"
          ],
          "IntervalSeconds": 2,
          "MaxAttempts": 6,
          "BackoffRate": 2
        }
      ],
      "Next": "Notify RDS Failover"
    },
    "Notify RDS Failover": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "Message.$": "$",
        "TopicArn": "arn:aws:sns:$REGION$:123456789012:rds-crrr-promoted"
      },
      "Next": "Check RDS Status"
    },
    "Check RDS Status": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "OutputPath": "$.Payload",
      "Parameters": {
        "Payload.$": "$",
        "FunctionName": "arn:aws:lambda:$REGION$:123456789012:function:rds_status_check:$LATEST"
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException"
          ],
          "IntervalSeconds": 2,
          "MaxAttempts": 6,
          "BackoffRate": 2
        }
      ],
      "Next": "Choice"
    },
    "Choice": {
      "Type": "Choice",
      "Choices": [
        {
          "Not": {
            "Variable": "$.Status",
            "StringMatches": "available"
          },
          "Next": "Check RDS Status",
          "Comment": "Go back and check the status again if it's not active."
        }
      ],
      "Default": "DRS Failover"
    },
    "DRS Failover": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "OutputPath": "$.Payload",
      "Parameters": {
        "Payload.$": "$",
        "FunctionName": "arn:aws:lambda:$REGION$:123456789012:function:drs_failover:$LATEST"
      },
      "Retry": [
        {
          "ErrorEquals": [
            "Lambda.ServiceException",
            "Lambda.AWSLambdaException",
            "Lambda.SdkClientException"
          ],
          "IntervalSeconds": 2,
          "MaxAttempts": 6,
          "BackoffRate": 2
        }
      ],
      "Next": "SNS Publish"
    },
    "SNS Publish": {
      "Type": "Task",
      "Resource": "arn:aws:states:::sns:publish",
      "Parameters": {
        "Message.$": "$",
        "TopicArn": "arn:aws:sns:$REGION$:123456789012:drs-invoked"
      },
      "End": true
    }
  }
}

c. Under State machine name, provide a name for this machine.

d. Under Permissions, choose Create new role.

i. This new role is automatically generated based on the requirements of the script provided in the last step.

e. Select Create state machine.

Invoke Step Machine and Validate Steps

For the final steps, invoke the step function machine, and validate that it’s running through all the processes correctly.

  1. Navigate to the AWS Step Functions console.
  2. Select the State Machine created in the previous step.

a. Choose Start execution on the following page.

b. Choose Start execution again on the following page.

  1. You are now on the Execution page where you can watch as the state machine runs through the process.

a. You can validate that the process has been completed by viewing the Graph inspector. Here you can see if the process has failed or completed successfully.

i. If there is a failure, you can select the red process, and get more information on the exception.

Cleaning up

To avoid incurring future charges, delete the resources created in this post.

Conclusion

Having a complex disaster recovery solution with multiple services can be difficult to manage and invoke during a drill or actual disaster. In this post, we showed you how to automate the DR of both Amazon RDS and Amazon EC2 protected with AWS DRS, into a single button press. This ensures you have the ability to mitigate downtime for your multi-tier applications, and control all of it through a single pane of glass. By following along with this blog, you can add to your wider DR playbook, to help protect your managed and unmanaged Amazon EC2 instances in case of disaster events.

Thank you for reading this post. If you have any comments or questions, you can add them in the comments section below.

Daniel Covey

Daniel Covey

Daniel Covey is a Solutions Architect with AWS who has spent the last 8 years helping customers protect their workloads during a Disaster. He has worked with CloudEndure before and after the acquisition by AWS, and continues to offer guidance to customers who want to ensure their data is safe from ransomware and disasters.

Kevin Lewin

Kevin Lewin

Kevin is a Cloud Operations Specialist Solution Architect at Amazon Web Services. He focuses on helping customers achieve their operational goals through observability and automation.