How to perform non-disruptive tests with AWS Elastic Disaster Recovery

By performing frequent disaster recovery (DR) tests and drills, your organization can prepare for unexpected IT outages caused by ransomware, human error, and other disruptions. Some organizations avoid DR testing because their testing procedures are time-consuming or costly, or because they cannot test without disrupting business. This can mean that they are unprepared to implement their DR plan if an IT outage occurs.

In this post, I review the straightforward steps and best practices to perform non-disruptive DR tests using AWS Elastic Disaster Recovery (DRS). DRS enables a simple mechanism for you to test recovery of your source environment at scale, without performance impact. You can use the same process to test and recover physical, virtual, and cloud source servers on AWS. Testing your DR plan regularly helps you verify that you can meet recovery objectives and recover applications following an IT disruption.

How AWS Elastic Disaster Recovery testing works

During normal business operation, Elastic Disaster Recovery continuously replicates data on your source servers to a low-cost staging area in your target AWS Region. When you launch source servers for tests or recovery, Elastic Disaster Recovery triggers a highly automated machine conversion process and a scalable orchestration engine. These enable you to quickly spin up your source servers in your target AWS Region. Since data replication to AWS occurs at the block level, you can use the same simple process to test and recover all of your applications and databases running on a wide range of supported OS versions. If a security incident, hardware failure, or other disruption causes an IT outage, you can quickly recover your environment by following the same set of simple steps that you use for DR tests. The servers you launch during Elastic Disaster Recovery tests will operate in the same way on AWS as the servers you launch for recovery on AWS.

Implementing a Well-Architected Framework approach to DR involves regularly testing and validating your DR plan. When you launch source servers for DR testing, you are launching a copy of your servers in your target AWS Region from the point in time you select. You can perform secure tests in an isolated environment in your Elastic Disaster Recovery Amazon Elastic Compute Cloud (EC2) launch template, which defines the settings and configuration of your launched test instances. Your EC2 launch template settings enable isolating your test instances by using a separate subnet or different security groups to avoid network conflicts. You can also launch test instances in a separate AWS account to further isolate your testing and production environments. Elastic Disaster Recovery automatically provisions the resources needed to launch your servers on AWS.

You can use Elastic Disaster Recovery to run a virtually unlimited number of tests, as often as you choose. There are no additional fees for testing, beyond payment for the provisioned resources generated during tests. You can minimize costs by performing some of your DR tests using smaller EC2 instances rather than fully provisioning resources at scale.

You should perform DR tests as your final implementation step when you deploy Elastic Disaster Recovery in your environment. Maintain disaster readiness during ongoing operations by performing DR tests anytime you make changes to your replicated source environment, such as when you add new servers or modify resource properties.

Prerequisites

For this walkthrough, you should have the following:

Elastic Disaster Recovery set up and configured it in the Region you want to perform DR in.
Servers/EC2 instances of your choice (Windows or Linux) installed with the Elastic Disaster Recovery agent and completed initial sync.
An isolated subnet where you would like to run your test drills.
Amazon EC2 launch template modified for the machines you want to run your drills on and the isolated subnet selected.

Elastic Disaster Recovery testing steps

You are able to test server recovery after your source servers have completed the initial sync to your target AWS Region.

Sign in to your AWS console and navigate to the Elastic Disaster Recovery page. Here you will see the following indications that your source servers can be tested:

The Ready for recovery column indicates that the server is ready for testing.
The Data replication status column indicates that replication is completed.
Machines that have not yet been tested will show a dash under the Last recovery result column.
The Pending actions is the next steps to perform for the DR cycle.

Picture 1 - source server

You can then perform the following steps to launch source servers on AWS for DR testing:

1. Select machines to initiate a drill.

Select machines for DR testing by checking the box to the left of each machine. You can select your entire environment, a group of machines comprising one or more applications, or a single machine to launch for testing in your target AWS Region.

Then open the Initiate Recovery Job menu and choose Initiate drill.

Picture 2 - initiate recovery

The servers you select will be launched according to the networking and security groups you previously defined during the EC2 launch template part of the prerequisites.

2. Select a recovery point.

You can select the latest recovery point, which is the current machine state. You can also select a previous point in time from which to launch the target machine.

Testing recovery from a previous point in time enables you to prepare for an IT disruption caused by data corruption. This includes use cases such as ransomware, accidental system changes, or database corruption. It allows you to recover your source machines on AWS to a point in time before data corruption occurred.

After you select a recovery point, select Initiate drill.

Picture3 - Select Point in Time

3. Verify target machine launch status.

You will see the following indications on the top of the page that the target machine launch job has been created:

Picture5 - recovery created

a. You can select View job details to see more information.

b. On the main page of the DRS console, on the left, select Recovery job history to see the status of the drill launch.

Picture6 - recovery job history

4. Select the Job ID for more details.

Navigate to the Source servers pane on the left. It will bring you back to the main Elastic Disaster Recovery console. You will see an indication of when that machine was launched under the Last recovery result column.

Picture8 - source server

5. Navigate and open the EC2 dashboard to see the running instances and perform any necessary validation of your launched applications.

Picture9 - running state

If necessary, make any configuration updates and then re-validate.

Cleaning up

After your DR testing is complete, you can delete your launched test machines through the Elastic Disaster Recovery user console to avoid incurring future costs. Elastic Disaster Recovery will automatically remove resources created during tests when you request to do so or when you launch a new test. You can prevent this cleanup by enabling termination protection.

1. To clean up the launched drills, navigate to the Elastic Disaster Recovery user console page and select Recovery instances on the left.

Picture10 - recovery instance

2. Select machines that were launched by checking the box to the left of each machine and go to Actions, then select Terminate recovery instances.

Picture11 - select terminate recovery instance

3. Select Terminate to delete the launched test machines.

Picture12 - terminate instance

Conclusion

You can follow the simple steps I described in this post to perform frequent, non-disruptive DR tests using AWS Elastic Disaster Recovery. The purpose of DR testing is to validate that your organization can maintain business continuity with minimal downtime in the event of an IT disruption. Any operational issue that arises when you perform DR tests is an opportunity to identify and resolve issues that could arise during an actual disruption.

Familiarity with testing and recovery processes enables your organization to verify that you can respond quickly if you must recover workloads on AWS. While this post described DR testing for a single server, you can facilitate scalable DR recovery plans and testing at scale.

Thanks for reading this blog post. If you have any comments or questions, don’t hesitate to leave them in the comments section.

AWS Storage Blog