How to test your AWS Elastic Disaster Recovery implementation

Maintaining application and data resilience in the face of an ever-evolving risk landscape is a challenge for applications with legacy architectures. These risks can include ransomware attacks, natural disasters, user error, hardware faults, and many others. Organizations want to recover workloads within appropriate timescales with minimal loss of data from an unforeseen event. Organizations seek dependable solutions to achieve resilience goals within financial, regulatory, and time-related constraints. For cloud-native applications, these requirements are often addressed through highly available multi-Availability Zone architectures. For legacy applications with monolithic architectures, this is often not possible due to being constrained to a single on-premises data center or AWS Availability Zone.

AWS DRS Service Diagram

AWS Elastic Disaster Recovery (AWS DRS) minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications using affordable storage, minimal compute, and point-in-time recovery. If you need to recover applications, you can launch recovery instances on AWS within minutes, using the most up-to-date server state or restore to a previous point in time. After your applications are running on AWS, you can choose to keep them running in AWS, or you can initiate data replication back to your primary site when the issue is resolved. You can also use Elastic Disaster Recovery to replicate across AWS Availability Zones or Regions for additional resiliency within AWS.

This blog post outlines items to consider when testing your Elastic Disaster Recovery implementation. It is aimed at IT leadership, service management, application support teams and business continuity teams.

Business Continuity Process and Disaster Recovery Testing

Disaster recovery testing needs to cover validation of the DR tooling, and confirmation that business applications are recoverable to another facility, and the associated operational processes required. By performing the tests, it helps you estimate the recovery time objectives you’ll be able to achieve. The testing we discuss here is related to these narrow aspects of a wider business continuity planning (BCP) initiative. BCP testing validates that business services can be recovered in the event of disruption as a part of a wider BCP test exercise, which may also include work area recovery.

Validating the DR tooling can generally be achieved without interrupting live systems and is limited to a selection of representative servers. These servers can be failed over within a strictly isolated environment, and the testing is limited to confirming that the servers boot up and can access their locally attached data. To perform application level testing, test users need a mechanism to access the isolated network which protects production systems and data.

Elastic Disaster Recovery Service Considerations

For the remainder of this blog post, we will outline items to consider when performing a DR test using AWS DRS.

DR scenarios and blast radius: Prior to implementing your disaster recovery solution, you should understand what you classify as a ‘disaster’. For example, do you classify a disaster as a critical application failure, or are you focused on an entire data center failure? In the latter, more careful planning will be needed as shared services and network components will also need to be considered as part of the implementation. Time needs to be taken to understand the failure and incident types, and to outline the resolution which will be applied.

DR solution compatibility: Elastic Disaster Recovery allows RPOs of seconds and RTOs of minutes. However, this is only possible for servers that are compatible with both the DRS agent and the AWS cloud. There are some circumstances where workloads cannot be protected and require their own DR solution (for example NAS systems which don’t run on DRS compatible operating systems, appliances, etc) or certain network appliances. The AWS Disaster Recovery of On-Premises Applications to AWS Whitepaper can help with suggestions on how to approach these challenges.

Data not on block storage: Elastic Disaster Recovery synchronises data as it’s written to disk. Therefore, it’s important to understand that in-memory workloads and any data that sits in a disk write cache will not be synchronised to your DR site via Elastic Disaster Recovery.

When running a non-isolated DR drill, ensuring that applications are gracefully shut-down and the source OS has completed all disk-write operations will help to avoid data loss or corruption. This will also include NFS devices where the OS does not view the NFS share as block storage and therefore does not synchronise this data. The AWS Replication Agent will synchronise data stored on SAN volumes presented as a local disk, given this is exposed to the operating system as block storage.

Software licensing: When testing and using Elastic Disaster Recovery, it is important to understand software licensing restrictions which may apply. One scenario to consider is whether your existing licenses allow for temporary concurrent operations during DR events and drills. This is because you will be in a scenario where you have 2x the instances running at the same time.

Elastic Disaster Recovery will default to using License Included EC2 instances for Windows operating systems and BYOL (bring your own license) for Linux operating systems. This is configurable within Elastic Disaster Recovery, however ensuring compliance with licensing terms remains a customer responsibility in line with the AWS Shared Responsibility Model. In addition, you can use AMIs, with pay as you go license options, via AWS Marketplace, as Elastic Disaster Recovery instances. You do this by specifying the AMI ID in the Launch Template for the DRS-protected recovery instance. This can provide another option to address software licensing restrictions.

Shared services: Access to shared services are often required, for example domain controllers for servers with no local admin accounts, shared network storage systems, load balancers, or license key servers. Consideration should be given to how these will be accessed in various DR scenarios, for both isolated and non-isolated testing.

Isolated vs non-isolated DR drills: The first consideration should be whether the DR drill will impact production traffic. Using Elastic Disaster Recovery, you have the option to test your DR solution without impacting production traffic, by launching copies of your source servers in an isolated network environment. However, Elastic Disaster Recovery does not create this isolated network environment for you – the choice of an isolated or non-isolated drill is defined when you configure the Elastic Disaster Recovery Launch Settings and Launch Templates.

An isolated environment should be designed to mirror the live environment, and would ideally be implemented using an automated approach to help ensure equivalence with the live environment. In this scenario, the isolated network environment needs to be in place prior to a DR drill. When running a drill in an isolated network environment, the live applications and servers are not impacted as communications with the live environment will be blocked.

Elastic Disaster Recovery creates bootable images of your current production servers and launches these into the isolated network environment. You will then have two instances of your applications and servers running, allowing you to perform application testing in the isolated network environment without risk of impacting live services. During an isolated DR drill, Elastic Disaster Recovery continues to replicate your production servers, so that if a real disaster event occurs during your isolated DR drill, you can still perform disaster recovery.

Consideration will need to be given to how your Test Analysts will gain safe access to the isolated network, applications and servers, to perform application acceptance testing. This is can be achieved using Appstream or Citrix on AWS, implemented within the isolated network, with tightly configured network firewall rules allowing only the network protocols to support remote access.

With Elastic Disaster Recovery, you also have the option to perform a non-isolated ‘disruptive’ DR exercise. This scenario is similar to a real DR situation, where production traffic is redirected to the recovery environment, and is referred to as a “recovery event”.

In order to fully verify your DR capability, some customers may choose to or be required to perform a live DR event. We recommend this would only be performed after successfully conducting routine drills to verify both the technical DR solution operates as expected as well as the associated processes.

Microsoft Windows Server Failover Clusters (WSFC):
Elastic Disaster Recovery can replicate the nodes and data hosted within a traditional Microsoft Windows Server Failover Cluster, allowing you to recover the application running on a single node. However, manual effort will be required to recreate a clustered environment in AWS.

When considering DR protection in AWS for a Microsoft Windows Server Failover Clusters located on-premises, you could also evaluate alternative approaches for achieving this objective in AWS. For example, with WSFC hosting databases, consider implementing a new cluster in AWS and use native database replication capabilities to protect the data in AWS. Alternatively, you could setup database transaction log shipping to a managed Amazon RDS SQL instance. Amazon RDS SQL can be configured to use multiple availability zones for HA in AWS. Be sure to identify your on-premises WSFC during your disaster recovery design phase to allow time for the right solutions to be implemented here.

Conclusion and Next Steps

We have introduced some of the common scenarios encountered when planning and executing DR tests using the AWS Elastic Disaster Recovery Service. We discussed the following topic areas at a high level to be considered as part of your DR testing strategy:

Isolated vs non-isolated DR drills
Shared services
Software Licensing
DR scenarios and blast radius
DR Solution Compatibility
Data not on block storage
Microsoft Windows Server Failover Clusters (WSFC)

AWS recommends that you perform ongoing DR drills to help maintain your disaster readiness, especially when changes are made which can be either technical (such as adding new servers to the application) or non-technical (such as changing responsibilities of support teams).

About the authors:

AWS Cloud Operations Blog

How to test your AWS Elastic Disaster Recovery implementation

Business Continuity Process and Disaster Recovery Testing

Elastic Disaster Recovery Service Considerations

Conclusion and Next Steps

Resources

Follow

Learn

Resources

Developers

Help