Building a disaster recovery site on AWS for workloads on Microsoft Azure

Some enterprises run their IT operations using a multi-cloud environment, often for compliance, cost, or redundancy reasons. At times, these enterprises may be required to host a copy of their data, or even a full disaster recovery (DR) solution, on another cloud provider to provide an additional layer of protection.

In this post, we show you how to build a full end-to-end DR solution on AWS if your main workloads are on Microsoft Azure using AWS Disaster Recovery Service (AWS DRS). We will build the necessary networking between Azure and AWS, replicate the Virtual Machines (VMs) on Azure to AWS, and then failover from Azure to AWS. After completing the DR testing, we will show you how to failback from AWS to Azure to return to normal operations.

AWS DRS provides scalable, cost-effective application recovery to AWS using a non-distributive continuous block-level replication mechanism. This allows customers to achieve a crash-consistent recovery point objective (RPO) of seconds, and a recovery time objective (RTO) typically ranging between 5–20 minutes.

Solution walkthrough

The following are the high-level steps to build this solution.

Build connectivity between Azure and AWS.
Configure AWS DRS on Azure, replicate source VMs on Azure, and then failover from Azure to AWS.
Once the DR test is complete, return to normal operations by failing back from AWS to Azure.

There are many resources online that you can use to perform Steps 1 and 2. Therefore, to keep the post short, I point you to these references as we go. The focus of this post is the failback section only.

Prerequisites

These prerequisites are required for following along with this post:

You have an account on AWS and on Azure.
You have Amazon Virtual Private Cloud (Amazon VPC) and a Virtual Network (VNET) on Azure with the necessary subnets.
You have a number of VMs on Azure for testing.
You have a dedicated subnet to host the replication server of AWS DRS.

1. Establish connectivity between Azure and AWS

There are several ways to establish network connectivity between Azure and AWS Cloud. The quickest one is to setup a site-to-site VPN. This method comes with some disadvantages, such as the limited throughput and unpredictable routing via the public Internet. If you’re looking for more dedicated bandwidth, then you could use a combination of AWS Direct Connect and Azure ExpressRoute to build a private connectivity between the two clouds.

For the purpose of this demo, I used the steps here to build a site-to-site VPN. After completing the setup, the network architecture should look like this:

AWS GCP network architecture

There are two things to consider about this architecture:

I used Virtual Private Gateway (VGW) on the AWS side for the demo. If you have multiple networks or multiple Amazon VPCs, then I recommend that you use AWS Transit Gateway
On AWS, you always have two public IPs: one per each VPN tunnel. On Azure, this doesn’t happen by default. In this case, you must manually add high-availability using Active/Passive mode on the Azure side.

2. Configure AWS DRS on Azure, and replicate source VMs and failover from Azure to AWS

These steps must be completed to install AWS DRS agent on each VM on Azure and complete failover to AWS.

2.1. Create an AWS Identity and Access Management (IAM) user on AWS for AWS DRS.
2.2. Install AWS Replication Agent on the source environment on Azure VMs.
2.3. Configure launch settings on AWS DRS.
2.4. Prepare for failover and then perform the failover.

After completing the failover, the servers on AWS DRS console show as Ready for Recovery.

DRS Source Servers Ready for Recovery

3. Return to normal operations and failback from AWS to Azure

Once I complete my DR exercise testing, I must return the replication direction to its original status before the failover. This way all of the data that has been replicated from Azure to AWS during the failover process will be replicated back to Azure. This process is called failback, and to completing it consists of the following steps:

3.1. Prepare Failback Client
3.2. Meet the networking requirements
3.3. Create an IAM user and generate IAM credentials for the failback operation
3.4. Boot Failback Client on Azure and complete the replication
3.5. Cutover and switch the replication back to normal

3.1. Prepare Failback Client

The Failback Client is used to boot the server to which my system fails back.

As the Failback Client is in the livecd.iso format, I can’t use it directly to boot a VM on Azure. To resolve this, I convert it to a virtual disk (VHD) that can be used on Azure. To do so, I use VMware workstation with failback ISO and attach a disk and partition it with EXT4/XFS filesystem. The conversion process installs Linux Kernel, generates the Grand Unified Bootloader (GRUB) configuration file (grub.cfg), and then installs GRUB on the disk. The output from this conversion process is a virtual disk (/dev/sdb as VMDK), which I use to create an Azure compatible image that I can later use to create the failback VM.

The process to convert the failback client is as follows:

3.1.1. Download the failback client. The download link for the client depends on the Region where the Recovery instances are located. I’m using us-east-2. I use this link to download the Failback Client software.

3.1.2. Convert the failback client to a VMDK format. To make this process easier, we created a script that does this conversion. You can download it from here. The solution includes step-by -step instructions.

3.1.3. Convert the VMDK file resulted from step 3-1-2 into a format that can be used to boot a VM on Azure (VHD). Detailed steps are here. During this step, you will likely encounter this error “The entry 0 is not a supported disk database entry for the descriptor”. Check the steps here to troubleshoot and resolve the error.

3.2. Meet the networking requirements

The failback process requires the following connections to be permitted. More details can be found here.

TCP 1500 from the Failback Client on Azure to the Recovery instance on AWS.
TCP 443 from the Failback Client on Azure to the Amazon Simple Storage Service (Amazon S3) endpoint.
TCP 443 from the Failback Client on Azure to the AWS DRS endpoint.

AWS DRS failback replication architecture-networking

3.3. Create an IAM user and generate IAM credentials for the failback operation

Follow the instructions in Using the Failback Client to create an IAM user for the failback. Note the keys generated, as we utilize them in a later step.

3.4. Boot failback client on Azure and complete the replication

In this step, I use the image I created in 3-2-3 to boot the failback client on Azure and pair it with AWS. This way, the replication direction is reversed and all of the data that had previously been replicated (from Azure to AWS) during the failover process is reversed and replicated back from AWS to Azure. Make sure that you have enough space on the VM that you’ll be using to boot the failback client to accommodate any additional data written to AWS during the failback.

Note the following:

The server on which the Failback Client runs must have at least 4 GB of dedicated RAM.
Make sure to deactivate Secure Boot on the server on which the Failback Client runs.

Make sure that the server to which you are failing back has the same number of volumes or more than the Recovery Instance, and that the volume sizes are equal to or larger than the ones on the recovery instance as shown here.

lsblk show instance volumes

It’s time to start the failback process run “start.sh” in the command line. The client asks you to provide the following details.

AWS access key and secret access keys: You must enter your credentials into the Failback Client that you created in Step 3-3.
Recovery instance ID: The Failback Client will try to map the configurations of the server where the Failback Client is installed with the recovery instances on AWS. If it doesn’t find a matching instance, then it asks you to enter the recovery instance ID from which you’d like to failback. You can identify the instance ID from the AWS DRS Console.
Local block device: The failback client tries to map the volumes in the recovery instance and the failback server. If it doesn’t find mapping, then it asks you to manually enter the volume ID where you want the data to be replicated.

start replication

After successfully mapping the volumes and establishing a connection, the Failback Client will download the replication software and start the reversed replication from AWS back to Azure. At this point, the screen shows Replication in progress.

Data replication status

After some time, I see the replication has progressed to 45%.

Recovery instances progress 45percent

Once the Recovery instances that I plan to fail back show the previous statuses, I select the checkbox to the left of each Instance ID and choose Failback. This stops data replication and starts the conversion process. Furthermore, this finalizes the failback process and creates a replica of each Recovery instance on the corresponding source server.

Continue with failback for1 instance

This action creates a Job, which you can follow on the Recovery job history page in the AWS DRS console. After the failback is complete, the Failback Client shows that the failback has been completed successfully.

Connectivity established

3.5. Cutover and switch the replication back to normal

At this point, I have all of the data from the Recovery instance (on AWS) replicated to the new volume (sdb) that I created on Failback Client VM on Azure. The last step to complete the failback process is to use this volume to create a new image, and then use the image to spin up a new VM on Azure. The details of this step can be found here. This new VM has the original data (i.e., the data before the disaster) and the data written on AWS during the DR test. Then, I boot this VM and use it to return to normal operations.

Solution cost and pricing

The main components that affect your pricing for these solutions are as follows:

AWS DRS: this includes the cost per-hour for the number of servers being replicated, the use of Amazon Elastic Block Storage (Amazon EBS), Amazon EBS Snapshots, and Amazon Elastic Compute Cloud (Amazon EC2)
AWS Site-to Site VPN
Azure egress traffic (for replication and failover)
AWS egress traffic (for reversed-replication and failback)
Other components used on the Azure side

Cleaning up

To avoid incurring unwanted AWS costs after performing these steps, delete the AWS and Azure resources created for this demonstration. These include the VMs and networking components on Azure, the replication instances, Amazon EBS snapshots, and the Recovery Instance created by AWS DRS on AWS.

Conclusion

In this post, I showed you the steps to build a DR solution on AWS Cloud for workloads hosted on Azure using AWS DRS. First, we built connectivity between the two clouds, installed and configured AWS DRS on the Azure side, and completed the replication and failover from Azure to AWS. The second part of this exercise is returning your DR operations to normal after the disaster (or the DR test) is completed. You do that by reversing the replication direction to be from AWS back to Azure. We did this by preparing the AWS Failback Client to make it compatible to boot on Azure VMs. I shared the step-by-step order for this and instructions on a GitHub project to achieve that. Then, we performed the failback on Azure.

Using this solution you have your date continuously replicated from Azure to AWS and the target Amazon EC2 instances on AWS are pre-configured using AWS DRS Console. Also, using the conversion script we provided on Github, you can convert the failback client to an Azure compatible image and use it to build a VM on Azure to return to your normal operation after the DR testing (or the actual outage) is concluded.

There are a few other considerations for a successful failback/failover that must be planned for. For example, planning for DNS cutover. We didn’t cover this in this post to keep it simple, but we covered the options available in similar posts when we discussed building a DR site on AWS for GCP workloads. You can check it here.