AWS Partner Network (APN) Blog

How Onica’s Elastic Engineering Team Automated Disaster Recovery for Amazon RDS Instances

By Oliver Fletcher, Cloud Architect, Elastic Engineering – Onica, a Rackspace Technology Company

Onica-Logo-3
Onica-APN-Badge-3
Connect with Onica-2

As organizations move to Amazon Web Services (AWS), ensuring you have an effective disaster recovery (DR) strategy in place to manage outages is paramount.

There are a number of strategies you can adopt to meet requirements to ensure business continuity. However, organizations should understand what their Recovery Point Objective (RPO) and Recovery Time Objective (RTO) should be for workloads so they can select a DR strategy that’s best suited for their AWS workloads.

Through defining a common understanding of RTO and RPO requirements, organizations can adequately design for disaster recovery solutions.

In this post, I will discuss how Onica’s Elastic Engineering team co-created a solution with a client that leverages serverless architecture and enables an automated backup and restore for Amazon Relational Database Service (Amazon RDS) instances.

A backup and restore DR strategy will back up your data using point-in-time backups to a DR location and restore this data when necessary to recover from a disaster event.

Onica, a Rackspace Technology company, is an AWS Premier Consulting Partner with multiple AWS Competencies and the Amazon RDS Service Delivery validation. Onica is also a member of the AWS Managed Service Provider (MSP) and Well-Architected Partner programs.

Adopt a DR Strategy While Reducing Operational Overheads

Effectively managing a backup and restore strategy can be burdensome, requiring manual input and additional toil for engineers. Through the use of automation, serverless architecture, and proactive monitoring, organizations can reduce operational overhead and focus on unlocking trapped value.

At a high level, the solution developed by Onica’s Elastic Engineering team leverages the RDS instance snapshot feature to create a point-in-time snapshot of an RDS instance in a primary AWS region.

The snapshot is then copied to the DR region and can be restored when a disaster recovery event occurs. To ensure that proliferation of snapshots does not occur, the solution manages the deletion of snapshots in the primary region once copied, and then retains those in the DR region for a set number of retention days only.

The solution makes use of the AWS Boto3 SDK to define an AWS Lambda function in the Python 3 programming language to automate the tasks aforementioned.

The solution also incorporates proactive monitoring and alerting to ensure engineers can monitor the backup processes. It allows engineers to have insight to when a DR event occurs in their primary AWS region. This is enabled through the use of Amazon CloudWatch, Amazon Simple Notification Service (SNS), and RDS event subscriptions.

Through the use of AWS CloudFormation and open source tools such as Runway, Stacker, and the Serverless Framework, the solution leverages infrastructure as code (IaC) so it can be treated as application code through source control and deployed in a repeatable fashion.

Solution Workflow and Architecture

The RDS Backup Lambda takes a snapshot of the RDS instance in the primary AWS region based on a cron schedule. The RDS Backup Copy Lambda then copies the RDS instance snapshot to the AWS region.

Next, the RDS Backup Cleanup Lambda deletes the snapshot in the primary AWS region based on a schedule, and then deletes the snapshot in the region based on a cron schedule and a retention period (number of days).

Finally, the RDS Backup Restore Lambda enables the restore of an RDS instance in the DR region when a disaster recovery event occurs.

Onica-RDS-Disaster-Recovery-1

Figure 1 – Solution Architecture.

AWS Lambda

AWS Lambda enables organizations to run their code without having to provision and manage the underlying infrastructure required to host their code base for applications.

This design decision helps to reduce the infrastructure footprint of a legacy backup and restore solution, and thus reduces operational overhead and hosting costs. The sections below detail the Lambda functions that enable Onica’s solution.

RDS Backup

The RDS Backup function leverages the Boto3 create_db_snapshot method to create the RDS snapshot.

Each snapshot is created using a naming convention that includes the prefix lambda-dr-snapshot, the RDS instance name, and the date when the RDS snapshot occurs. This naming convention is used to search for the snapshot when the rds-backup-copy function is run.

The figure below outlines the environment variable required to be defined within the serverless.yml file.

Environment Variable Description
instances The names of the RDS instances (e.g. instance0, instance1, instance2)

Onica-RDS-Disaster-Recovery-2

Figure 2 – RDS Backup Lambda flowchart.

RDS Backup Copy

The RDS Backup Copy function leverages the boto3 library to copy the RDS instance snapshots created by the RDS Backup Lambda from the primary region’s Amazon Simple Storage Service (Amazon S3) bucket to the DR region’s S3 bucket. This is done by using the copy_db_snapshot method from the Boto3 SDK.

The naming convention for the snapshot that’s created in the DR region follows the same convention as the RDS Backup Lambda. This Lambda also requires a number of environment variables to be set within the serverless.yml file.

Environment Variable Description
instances The names of the RDS instances (e.g. instance0, instance1, instance2)
primary_region The primary AWS region (e.g. us-west-2)
dr_region The AWS region used for disaster recovery (e.g. us-east-2)
kms_arn The ARN for the KMS key in the DR region; this is output when the KMS key is created

Onica-RDS-Disaster-Recovery-3

Figure 3 – RDS Backup Copy Lambda flowchart.

RDS Backup Cleanup

The RDS Backup Cleanup function leverages the boto3 library to delete the RDS instance snapshots created by the RDS Backup Lambda from the primary region’s S3 bucket and the DR region’s S3 bucket. This is done by using the delete_db_snapshot method from the Boto3 SDK.

The RDS snapshots in the primary region are deleted after the RDS Backup Copy function has run. The snapshots that reside in the DR region are retained for the set amount of retention days and then deleted.

Onica-RDS-Disaster-Recovery-4

Figure 4 – RDS Backup Cleanup Lambda flowchart.

This Lambda also requires a number of environment variables to be set within the serverless.yml file.

Environment Variable Description
instances The names of the RDS instances (e.g. instance0, instance1, instance2)
dr_region The AWS region used for disaster recovery (e.g. us-east-2)
retention_days The number of days to retain the snapshots (e.g. 14)

RDS Backup Restore

The RDS Backup Restore function leverages the Boto3 SDK to restore the RDS instance snapshots created by the RDS Backup Lambda from the primary region’s S3 bucket to a new RDS instance the DR region. This is done by using the restore_db_instance_from_db_snapshot method from the Boto3 SDK.

There are a number of values for environment variables that are required before this can be completed. The function also requires the DB Subnet Group is created prior to the function being run. Below are the details the environment variables need to define within the serverless.yml file.

Onica-RDS-Disaster-Recovery-5

Figure 5 – RDS Backup Restore Lambda flowchart.

IMPORTANT NOTE!

  • If multi-AWS Availability Zone is not required for the RDS instance being restored, the multi-az variable found in the handler will need to be changed.
  • If the RDS instance being restored needs to be public facing, the rds_public variable found in the handler will need to be changed.
Environment Variable Description
rds_instance The name of the RDS instance being restored
rds_instance_type The type for the RDS instance being restored
rds_subnet_group The name of the DB Subnet Group the RDS instance will use

The information below outlines the details for each of the auxiliary AWS services that will enable scheduling, encryption, and monitoring and logging for the solution.

Amazon EventBridge

The Lambda functions are scheduled using the cron scheduler, which ensures you can align with your organization’s RTO and RPO. The schedule is defined using Amazon EventBridge, and the cron expression for the Lambda functions are defined using the Serverless Framework. This is defined in the serverless.yaml file.

AWS Key Management Service (KMS)

AWS Key Management Service (KMS) is leveraged in the DR region for the RDS Backup Copy Lambda function. The key that’s provisioned using CloudFormation is used to encrypt the RDS snapshot created in the DR region when the RDS Backup Copy Lambda Function is initiated. This ensures snapshot data is protected in the DR region.

The KMS key requires an AWS Identity and Access Management (IAM) role to be defined within the dev-us-east-2 environment file. The value that needs to be defined is for the KMSIAMRole0 parameter. This role will be granted administrator access for the KMS key in the DR region; details of the actions can be found in the kms.yaml file.

Amazon SNS

SNS topics and subscriptions will be deployed using CloudFormation. The SNS topics will be used to capture RDS events for both the availability and backup categories.

SNS subscriptions will be provisioned to enable email alerts to be sent to DevOps engineers, to ensure proactive monitoring. Emails will be sent if an outage occurs for the RDS instances, and also for the RDS snapshot status when completed.

The SNS subscriptions require an email to be set within the dev-us-east-2 environment file. The value that needs to be defined is for the RDSSNSEmail parameter.

Amazon RDS Event Subscriptions

Amazon RDS event subscriptions will be deployed with CloudFormation and enable SNS topics to subscribe to the availability and backup categories for the RDS instances that are leveraging this solution.

The RDS event subscriptions require a comma delimited list of the RDS instances that should be included. This will be defined within the dev-us-east-2 environment file. The value that needs to be defined is for the DBInstances parameter.

Deployment

Deployment of the services required for the solution will be enabled through the use of Onica’s open source Runway tool. This provides a lightweight wrapper around CloudFormation and the Serverless Framework, enabling both tools to run in concert when deploying the solution to AWS.

Its main goals are to encourage GitOps best practices, avoid convoluted Makefiles/scripts (enabling identical deployments from a workstation or CI job), and enable developers/admins to use the best tool for any given job.

Prerequisites

  • Install Runway using pipenv: pipenv install runway
  • Creation of a virtual private cloud (VPC), subnets, and DB subnet group is required for the RDS Backup Restore function. Ensure the DB subnet group name is added for the rds_subnet_group variable in the serverless.yaml file for the environment variables

Stacker Environment Variables

Stacker is an open source tool and library used to create and update multiple CloudFormation stacks. Each of the environments that will be deployed will have their own Stacker environment variable files, and these will need to be updated before deployment.

The variable files are used as input into the CloudFormation templates that will deploy the auxiliary services for the solution; see each of the sections for the auxiliary services for their applicable environment variables.

The table below outlines the standard variables defined for all Stacker environment variable files.

Variable Description
namespace Variable defined to uniquely identify the CloudFormation stack
environment Variable for the environment used for tagging and naming standards
region Variable for the region used to deploy the resources to
department Variable for the department used for tagging
description Variable for the description used for tagging
workload Variable for the workload used for tagging

The steps below outline the commands that are required to be run to build each of the aforementioned services on AWS. Ensure you have already cloned the GitHub repository.

AWS Lambda Functions

  • Change directory to /dev directory and run the below command to deploy each Lambda function:
    • CLI command: DEPLOY_ENVIRONMENT=dev pipenv run runway deploy
    • Modules to deploy:
      • 1: lambdas/onica-rds-dr-backup.sls
      • 2: lambdas/onica-rds-dr-backup-copy.sls
      • 3: lambdas/onica-rds-dr-backup-cleanup.sls
  • Change directory to /dr directory and run the below command to deploy the Lambda function:
    • CLI command: DEPLOY_ENVIRONMENT=dr pipenv run runway deploy
    • Module to deploy:
      • 1: lambdas/onica-rds-dr-backup-restore.sls

Amazon SNS

  • Change directory to /dev directory and run the below command to deploy the SNS topic and subscriptions:
    • CLI command: DEPLOY_ENVIRONMENT=dev pipenv run runway deploy
    • Module to deploy:
      • 4: cloudformation/onica-rds-dr-sns.cfn

RDS Event Subscriptions

  • Change directory to /dev directory and run the below command to deploy the RDS event subscriptions:
    • CLI command: DEPLOY_ENVIRONMENT=dev pipenv run runway deploy
    • Module to deploy:
      • 5: cloudformation/onica-rds-dr-rds-event.cfn

AWS Key Management Service

  • Change directory to /dr directory and run the below command to deploy the KMS Key used to encrypt the RDS instance snapshots:
    • CLI command: DEPLOY_ENVIRONMENT=dr pipenv run runway deploy
    • Module to deploy:
      • 2: cloudformation/onica-rds-dr-kms.cfn

RPO and RTO Considerations

For RPO, it’s important to mention the initial snapshot creation needs to finish before the database instance can return to the “available state” again. This is the required state for the DB instance to accept a new snapshot creation request.

I highly recommend you schedule the time for the RDS Backup Lambda to run so the primary RDS instance aligns with the minimum RPO value possible in your environment.

For RTO, this is determined by many factors within your architecture, such as the size of your database, distance between the two regions, and instance type you chose to restore your database.

Because of the sheer number of variables at play when calculating the “average” RTO, I recommend testing your DR solution to get a valid estimate that is specific to your use case.

Cost Considerations

The two primary factors that contribute to the cost of this solution are storage cost for the snapshots, and data transfer cost to copy the snapshots between the two regions.

As far as storage cost is concerned, there is no additional charge for backup storage of up to 100 percent of your total database storage for a region. Since we don’t have an Amazon RDS instance in the secondary (DR) region, you incur backup storage cost for the snapshots in Amazon S3, which are billed at standard Amazon S3 rates.

Additionally, you also incur data transfer cost to copy the snapshot from S3 in Region 1 to Region 2.

For a comprehensive cost estimate, use the free Simple Monthly Calculator.

Summary

In this post, you have seen how easy it can be to introduce an effective disaster recovery (DR) solution for Amazon RDS instances in your AWS environment. Through the use of both automation and serverless architecture, you can effectively reduce operational overhead and hosting costs, while maintaining business continuity in the event of a disaster.

It’s important to mention that creating a DB snapshot on a single-Availability Zone DB instance results in a brief I/O suspension that can last from a few seconds to a few minutes, depending on the size and class of your DB instance.

If you plan to use a very small RPO and take multiple backups during the day, I recommend you use high-availability (multi-AZ) DB instances. They are not affected by this I/O suspension since the backup is taken on the standby.

Finally, take a look at the current limits on manual snapshots for Amazon RDS, which is currently 100 manual snapshots per account. This can be avoided by using the RDS Cleanup Lambda and scheduling it to run using cron.

The content and opinions in this blog are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

.
Onica-APN-Blog-CTA-2
.


Onica – AWS Partner Spotlight

Onica is an AWS Premier Consulting Partner that provides cloud consulting, infrastructure, and managed services, ensuring customers have the best technical solutions to solve their business challenges and deliver value for their organization.

Contact Onica | Partner Overview

*Already worked with Onica? Rate the Partner

*To review an AWS Partner, you must be a customer that has worked with them directly on a project.