AWS Storage Blog
Automate data recovery validation with AWS Backup
Your data may be your most valuable asset. Disaster events that affect your workloads can result in a loss of data. A disaster is an event that causes a serious negative impact on your business. Having backups of your data helps minimize the impact of these disaster events by giving you the ability to recover data from your backups. Whether its source code, intellectual property, or customer data, knowing your data is safe and recoverable is vital to surviving disaster events.
Backing up data and testing the ability to recover your data are best practices in the Reliability pillar of the AWS Well-Architected Framework:
- Perform data backup automatically
- Perform periodic recovery of the data to verify backup integrity and processes
The worst time to discover you can’t recover from backups is in the middle of a disaster. Manually testing data recovery from backups can be a tedious and time consuming process. AWS Backup is a fully managed service that can help you centralize and automate data backup and data recovery, and you can configure the service to notify you of job failures.
In this post, I cover how you can automate these activities. I demonstrate creating a pipeline to validate backups using AWS Backup, Amazon EventBridge, and AWS Lambda. Having an automated solution to validate data recovery from your backups and ensure that there are no faults will increase your confidence in using these data backups for disaster recovery.
Testing data recovery
It is important to verify that your data backups are reliable, and that you are able to recover data from them in a timely manner. This reduces the risk of unexpected failures that can occur when restoring data from a backup.
Two important considerations for testing data recovery:
- Do you meet the Recovery Time Objective (RTO) for your workload?
Your workloads must complete data recovery processes within the established RTO. Data recovery testing helps identify processes that exceed the RTO for the workload.
- Is the data complete and uncorrupted, with complete integrity?
Test data recovery from backups to verify that the backups contains all the source data, are accurate, and are not corrupted. Measuring integrity of recovered data is dependent on factors such as data source, data type, backup method used, restore method used, etc.
Validation is dependent on a variety of factors that can influence design. Business and technical stakeholders should work together to determine the appropriate workload RTO and data integrity validation criteria.
The focus of this post is to create an automated solution that tests the ability to restore and validate integrity of data from backups. RTO requirements and measurement of time to restore are not included. You can add logic to perform these tasks later.
Solution overview
The solution in this post automates responses for events related to AWS Backup by monitoring AWS Backup events using EventBridge and Lambda. In this post, I create EventBridge rules to monitor events related to backup and restore jobs, then use these events to invoke a Lambda function to automate data recovery testing.
In this example, I use an Amazon EC2 instance as the data source, but you can implement this solution for other data sources supported by AWS Backup. The following architecture diagram provides an overview of the solution.
- AWS Backup creates a backup of the data source based on a schedule or on-demand.
- AWS Backup emits an event after the backup job has completed, which EventBridge captures.
- EventBridge invokes a Lambda function to test data restore.
- The Lambda function makes an API call to AWS Backup to initiate a restore.
- AWS Backup initiates a restore job.
- AWS Backup emits an event after the restore job has completed, which EventBridge captures.
- EventBridge invokes the Lambda function to test recovery and perform cleanup.
- The Lambda function validates recovery of data and performs clean up by terminating the new resource.
Prerequisites
Deploy an AWS CloudFormation stack using this template. It provisions a t2.micro Amazon EC2 instance running a web application in a new Amazon Virtual Private Cloud (VPC) to be used as the data source. For parameters, select the Availability Zone to launch the resources into and keep the default value for LatestAmiId. CloudFormation automatically retrieves the latest Amazon Linux 2 AMI ID for the AWS Region you are deploying the stack in. Note down key-values from the Outputs tab.
Deployment walkthrough
To implement this solution, I create an AWS Lambda function to automate data recovery testing and two Amazon EventBridge rules that serve as triggers for the Lambda function. I also configure permissions for the role used by the Lambda function to make Amazon EC2 and AWS Backup API calls.
Create a Lambda function
- Begin by navigating to the Lambda console. Choose Create function.
- Select Author from scratch. Name the function – validate-data-recovery. Use Python 3.8 as the runtime.
- Under Permissions, expand Change default execution role. Choose Create a new role from AWS policy templates and enter the Role name “validate-data-recovery-role.” Leave the Policy templates – optional section blank. Permissions for the role will be added directly from the AWS Identity and Access Management (IAM) console.
- Select Create function at the bottom of the screen. Lambda provisions a new function and IAM role that is used during execution.
- After you have created the function, Lambda displays the function details page. Download this Lambda function package and save it locally. Scroll to the Code source section. Select Upload from and .zip file, then upload the Lambda function package you downloaded.
Configure Lambda trigger
EventBridge captures events emitted by AWS Backup. It triggers the Lambda function when an event matches the defined pattern. The Lambda function will be invoked after a backup or restore job is complete. Create two EventBridge rules, one to verify that a new resource can be provisioned from a backup (restore validation), and another to verify that the new resource contains the data and that it can be accessed (recovery validation).
Restore validation rule
- Navigate to the EventBridge console. Click Create rule.
- Provide a Name for the rule, such as “LambdaRestoreValidationTrigger.”
- Under Define pattern, select Event Pattern.
- For Event matching pattern, select Custom pattern. Paste the following code into the Event pattern text box. Click Save.
{
"source": ["aws.backup"],
"detail-type": ["Backup Job State Change"],
"detail": {
"state": ["COMPLETED"]
}
}
- Scroll down to the Select targets section. Choose Lambda function for the Target.
- From the dropdown list for Function, select the Lambda function validate-data-recovery.
- Select Create.
Recovery validation rule
- Navigate to the EventBridge console. Select Create rule.
- Provide a Name for the rule, such as “LambdaRecoveryValidationTrigger.”
- Under Define pattern, select Event Pattern.
- For Event matching pattern, select Custom pattern. Paste the following into the Event pattern textbox, then select Save.
{
"source": ["aws.backup"],
"detail-type": ["Restore Job State Change"],
"detail": {
"state": ["COMPLETED"]
}
}
- Scroll down to the Select targets section, then Choose Lambda function for the Target.
- From the dropdown list for Function, select the Lambda function validate-data-recovery.
- Click Create.
Configure IAM role
Configure the IAM role used by the Lambda function by adding permissions to interact with AWS Backup and EC2 for validating data recovery and cleanup.
- Navigate to the IAM console. Select the role validate-data-recovery-role. This is the IAM role created by Lambda.
- On the Permissions tab, select Attach policies. Search for AWSBackupOperatorAccess. Check the box next to the policy name and select Attach policy. This allows the Lambda function to start restore jobs from backups created by AWS Backup.
- The Lambda function needs permission to clean up new resources created as part of the data recovery validation. Since the data source is EC2, the validate-data-recovery-role must provide permissions to make EC2 API calls. To configure this, click on Add inline policy.
- Select the JSON tab and replace the text with the following policy. This allows the Lambda function to describe the new EC2 instance launched by AWS Backup as part of the restore process, and obtain the public IP address that is used for data recovery validation. The policy also allows the Lambda function to terminate the new EC2 instance after validating data recovery, to helping ensure the cost of data recovery validation remains low.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "EC2Cleanup",
"Action": [
"ec2:DescribeInstances",
"ec2:TerminateInstances"
],
"Resource": "*",
"Effect": "Allow"
}]
}
NOTE: The IAM policy is allowing ec2:DescribeInstances
and ec2:TerminateInstances
on all EC2 instances in the account. This is because the new EC2 instance (launched by AWS Backup as part of the restore process) does not have the tags that the original EC2 instance did. When AWS Backup supports copying tags to the new EC2 instance (restored from a backup), you can update the policy to use condition keys and configure it for least privilege.
- Select Review policy. Enter a Name for this policy, such as “lambda-validate-data-recovery-cleanup.” Afterward, select Create policy, and then you will see a Summary page.
Test the solution
AWS Backup rules run at configured schedules. To test the solution in this example, I simulate the actions performed by AWS Backup by creating an on-demand backup.
- Navigate to the AWS Backup console and select Protected Resources. Click the Create an on-demand backup button.
- Under Resource, select EC2 for Resource type. Select the EC2 Instance ID created by CloudFormation in the prerequisites section.
- Select Create backup now for Backup window.
- You can choose how long to retain backups by specifying a Retention period. Appropriate retirement of data that is no longer of value can provide cost savings and improve security. For our example, select Days with a value of 1.
- The remaining settings do not need to be changed.
- Select Create on-demand backup.
This starts a backup job, and an AMI of the Amazon EC2 instance is being created. For this example, I used an EC2 instance with a volume size of 8 GB. The backup job typically takes 5–10 minutes to finish (the process may take longer depending on the size of the data source). Select the job ID to view details, and periodically refresh the console until the job status changes to Completed.
AWS Backup emits events to EventBridge in a best-effort manner every 5 minutes. This means it can take up to 5 minutes after the backup job has completed for AWS Backup to emit an event to EventBridge and for the Lambda function to run. You can navigate to the Amazon CloudWatch Logs console and view the Lambda function logs to observe what is happening.
The Lambda function captures the backup job ID from the event and makes an API call to AWS Backup to retrieve the Recovery point ARN, IAM role used, and the Backup vault. It uses this information to make another API call to AWS Backup to retrieve the Recovery point restore metadata, which is used to make an API call to AWS Backup and initiate a restore job. View the restore job under Jobs in the AWS Backup console. Select the job ID to view details, and periodically refresh the console until the restore job status shows as Completed.
A successfully completed restore job confirms that restoration from the backup is possible and that you can launch new resources using the backup. After the restore job is completed, AWS Backup emits an event that EventBridge captures, and this event invokes the Lambda function. The function verifies that the newly created resource contains your data and is accessible as expected.
The Lambda function captures the restore job ID from the event. It makes an API call to AWS Backup to retrieve details of the newly created resource (in this example, an EC2 instance). The data recovery success criteria is that the newly created instance runs the web application just like the original EC2 instance provisioned in the prerequisites section (as part of the CloudFormation stack). The logic to validate recovery is in the Lambda function code. The Lambda function retrieves the instance ID of the new instance from the restore job details and makes an API call to EC2 to retrieve its public IP address. The function then makes an HTTP GET request to the new instance on port 80 and expects an HTTP 200 response. If this is the received response, data recovery validation is a success.
After data recovery validation, the Lambda function makes another API call to EC2 to terminate the new instance. This ensures that costs for testing are low, as the EC2 instance launched was for testing and not production. Retrieve the EC2 instance ID from the restore job Details, navigate to the EC2 console, and search for the new EC2 instance. The new EC2 instance should have the instance state of Shutting-down or Terminated. You can review the Lambda function logs to understand the steps involved in the validation and cleanup.
Cleaning up
If you followed along with this solution, complete the following steps to avoid incurring unwanted charges to your AWS account.
Delete the CloudFormation stack
- Navigate to the CloudFormation console.
- Select the stack that was launched as part of the prerequisites. Select Delete.
Delete the Lambda function
- Navigate to the Lambda console.
- Select the function validate-data-recovery, then Delete.
Delete EventBridge rules
- Navigate to the EventBridge console.
- Select your LambdaRestoreValidationTrigger, then Delete.
Delete IAM role
- Navigate to the IAM console. Select Roles.
- Select the role validate-data-recovery-role, then Delete role.
Conclusion
In this blog post, I demonstrated how to create an automated data recovery validation pipeline to test backups created by AWS Backup. I explained how Amazon EventBridge can be used to capture events emitted by AWS Backup. Next, I covered how these events can be used to invoke an AWS Lambda function to perform data recovery validation using an EC2 instance as the data source.
Backing up your data sources is crucial to ensuring you can quickly recover from failures and disaster events. Testing backups to verify data recovery further increases confidence in your data recovery processes. By using AWS Backup, Amazon EventBridge, and AWS Lambda, you can automate this process so that every backup is tested to ensure that data can be successfully recovered in alignment with best practices from the AWS Well-Architected Framework. Automating this process end-to-end is the optimal implementation to achieve operational excellence, cost savings, and peak efficiencies while minimizing potential workload downtime that can be harmful to your business.
For more information on using AWS Backup with Amazon EventBridge, check out:
- AWS Backup documentation
- Amazon EventBridge documentation
- Monitoring AWS Backup Events with EventBridge
- AWS Well-Architected best practices on failure management
Hands-on lab
Try out the AWS Well-Architected lab on testing backup and restore of data for a hands-on experience.
Thanks for reading this blog post! If you have any questions or suggestions, please leave your feedback in the comments section.