Disaster recovery for VMware Cloud on AWS using AWS Elastic Disaster Recovery

Disasters due to natural calamities, application failures, service failures, human errors, or ransomware cause downtime for business applications as well as data loss and revenue impact. To help prepare for potential outages, companies running VMware on-premises are increasingly incorporating VMware Cloud on AWS (VMC) into their hybrid cloud strategy. Disaster recovery (DR) is also a vital part of that strategy to help ensure business continuity in the event of a disaster.

VMC lets you quickly migrate VMware workloads to a VMware managed Software-Defined Data Center (SDDC) running in AWS. You can extend your on-premises data centers without replatforming or refactoring applications. You can also deploy VMware’s SDDCs and consume vSphere workloads on AWS’s global infrastructure as a managed service. Use native AWS services with Virtual Machines (VMs) in the SDDC to reduce operational overhead and lower your Total Cost of Ownership (TCO) while increasing your workload’s agility and scalability.

With AWS Elastic Disaster Recovery (DRS) you can quickly and easily implement new DR plans or migrate existing DR plans onto AWS. The source servers can be deployed on AWS or your existing physical or virtual data centers, private clouds, other public clouds or hybrid cloud environments, such as VMC. Elastic Disaster Recovery supports the use of AWS as an elastic recovery site without investing in on-premises DR infrastructure that lies idle until needed. It also gives you the flexibility to pay on an hourly basis. This removes the need to commit to a long-term contract or a set number of servers, which is a benefit over on-premises or data center recovery solutions.

In this post, we demonstrate using Elastic Disaster Recovery as an AWS native solution that enables customers to minimize downtime and data loss while providing a cost-effective DR solution for VMC. We provide step-by-step configuration for failover from a VMC source machine to a target Amazon Elastic Compute Cloud (Amazon EC2) in a recovery Region. Then, we demonstrate failback from Amazon EC2 using Elastic Disaster Recovery for a full recovery back to VMC, thereby completing your DR environment.

Solution overview

There are two required accounts when setting up VMC. First is the VMware Cloud SDDC account. This is an AWS account that runs the SDDC or VMC resources. It’s owned and operated by VMware. The second required account is an AWS customer-owned account. To successfully attach this account to the SDDC, it should have at least one Amazon Virtual Private Cloud (Amazon VPC) within that account. We call this the Connected VPC. During the provisioning of the VMWare Cloud SDDC account by following the steps outlined in deploying VMC SDDC, AWS Elastic Network Interfaces (ENI) are set up in this connected VPC to provide high bandwidth and low latency access to AWS services within this VPC.

Start with a setup of a VMware Cloud SDDC account and a Connected VPC account. In the Elastic Disaster Recovery target account, you set up a VPC and connect it via an AWS Transit Gateway attachment to an AWS Transit Gateway shared with the VPC using AWS Resource Access Manager (AWS RAM) from the Connected VPC. This Transit Gateway is peered with VMware Transit Connect – a white-labeled Transit Gateway provided by VMware. This peering then connects the VPC in the Elastic Disaster Recovery target account with the VMware Cloud SDDC.

Deploy an Elastic Disaster Recovery agent on the source VMC VMs in the VMware Cloud SDDC. Then you connect it to Elastic Disaster Recovery running in the Elastic Disaster Recovery target account. With this setup, you can now continuously replicate block storage volumes from the source VMs in the SDDC into a low-cost staging area. This consists of Amazon EC2 deployed in the VPC in the target Elastic Disaster Recovery account. In the case of a disaster, you can failback to the source VMs on VMC from the Elastic Disaster Recovery console to a fully-provisioned state in minutes.

Failback is the act of redirecting traffic from your recovery system back to your primary system (source servers). Elastic Disaster Recovery lets you perform a scalable failback for VMware vCenter with the DRS Mass Failback Automation client (DRSFA client) by replicating data from your Recovery EC2 instances on AWS back to your source server VMs using this failback client. In this post, you use the one-click failback using the DRSFA client to validate failback to a different VMC SDDC VM (source server) than the source server used during failover.

The following diagram describes the full setup of Elastic Disaster Recovery with VMC SDDC as described previously:

Architecture diagram that describes our setup of Elastic Disaster Recovery with VMC SDDC

Figure 1: VMware Cloud on AWS multi-Region DR using the Elastic Disaster Recovery service

Prerequisites

The following prerequisites are required to continue.

Follow these Deployment steps to get started setting up the Connected VPC in your AWS account and activate VMC.

During the process of purchasing VMC, specify an email contact for your Organization on the order form submitted to AWS. After the purchase is processed, AWS sends a welcome email to the specified email address. Follow the steps there to activate your VMware Cloud on the AWS account and access the VMC console.

Follow the steps outlined in Deploying VMware Cloud on AWS SDDC to deploy your SDDC and set up your Connected VPC by connecting to your AWS account. Follow the steps outlined in this VMware Transit Connect – Simplifying Networking for VMC VMware post to configure VMware Transit Connect and the SDDC to SDDC connectivity model using SDDC Groups. Next, follow the steps outlined in this Getting Started with VMware Transit Connect Intra-Region Peering for VMC VMware post to setup intra-Region peering of the Transit Connect with a native Transit Gateway in your Connected VPC.
Navigate to the Connected VPC account where the peered External Transit Gateway resides and follow the steps here to create a resource share using AWS RAM to allow sharing with external accounts. Accept the resource share for the Transit Gateway from the Connected VPC in your Elastic Disaster Recovery target account.

Setup

1. Create a VPC in the Elastic Disaster Recovery target account in the same Region where the Transit Gateway was shared. Navigate to the AWS CloudFormation console and create a stack by launching the aws-vpcsetup-v1.yml CloudFormation template. This template sets up a VPC that consists of public and private subnets in a minimum of two Availability Zones (AZs). It also provisions an AWS Identity and Access Management (IAM) user required to set up the Elastic Disaster Recovery replication agent. The private subnets have outbound internet access via a NAT Gateway. The template takes no parameters. After the successful launch of the template, select Stacks on the left panel. Select the aws-vpcsetup-v1.yml template and select the Outputs tab:

a. Note the values for the DRSUserAccessKeyId and DRSUserSecretAccessKey. You’ll need them to install the Elastic Disaster Recovery replication agent in the Setup section.

2. Follow the steps to create a Transit Gateway attachment for your VPC in the Elastic Disaster Recovery target account with the Transit Gateway that was shared from the Connected VPC. Select one subnet (for example, vpc1_sn_A1 and vpc1_sn_B1) for each AZ to be used by the Transit Gateway to route traffic.

3. Follow the steps to add routes to a route table and add a route in the route table attached to the vpc1_sn_A1 Use a prefix list containing the private IP addresses of the SDDC and the Connected VPC as the destination and the shared Transit Gateway as the target for the route. Follow the steps here to get networking and security details for your SDDC and Connected VPC.

4. Navigate to the Elastic Disaster Recovery console of your Elastic Disaster Recovery target account. Set up the configuration that your source VMC SDDC VMs will take on by default in Elastic Disaster Recovery when the Elastic Disaster Recovery replication agent is installed on them.

a. Select Set default replication settings. Under Replication server configuration select vpc1_sn_A1 as the Staging area subnet and select small as the Replication server instance type.

b. Select Next and use the default options for the Amazon Elastic Block Store (Amazon EBS) volume type, Amazon EBS encryption, and Security groups under the Volume and security groups

c. Select Next and use the default options under the Data routing and throttling tab as well as the Point in time (PIT) policy

d. Select Next and then Create default to save your default replication configuration.

5. Log in to VMware Cloud Services here and select Inventory and Navigate to your SDDC. Select Open VCenter and use your default credentials to log in to the vCenter web client. Follow the instructions here to create a new VM in VMware vSphere 7.0 or using an existing VM with a supported Elastic Disaster Recovery operating system. In this case, I use an Ubuntu Server 20.04 LTS (HVM) as shown in the following figure.

Figure showing details of the VMware on Cloud VM used as the source machine

Figure 2: VMware on Cloud SDDC VM to be used as the source machine

a. Download the Elastic Disaster Recovery agent installer:

wget -O ./aws-replication-installer-init.py https://aws-elastic-disaster-recovery-us-west-2.s3.amazonaws.com/latest/linux/aws-replication-installer-init.py

b. Run the Elastic Disaster Recovery agent installer: sudo python3 aws-replication-installer-init.py

c. When prompted by the agent installation process, enter the following information, After responding to the prompts, you should see that the installation was successful:

i. AWS Region Name: Enter the AWS Region (in our case this is us-east-1) of the Elastic Disaster Recovery service on the target Elastic Disaster Recovery account.

ii. AWS Access Key ID: Enter the value of the DRSUserAccessKeyId from Step 1b of the prerequisites section.

iii. AWS Secret Access Key: Enter the value from the DRSUserSecretAccessKey from step 1b of the prerequisites section.

iv. When prompted to Choose the disks you want to replicate, hit the Enter key to select all.

Create a VMware user in the your VMware environment that is used by the DRSFA client to automate VMware actions required for failback. Use the permissions as documented here (item #16 in the DRSFA prerequisites) assigned via role assignment in vSphere. To do this, follow detailed steps as provided in this VMWare video or as provided here in the VMWare vSphere 7.0 documentation for assigning roles and permissions via the vSphere Client 7.0 (‘DRSFA’ as shown in the following figure in our case).

Figure showing details of the VMware on Cloud SDDC role with permissions used by the Elastic Disaster Recovery failback client

Figure 3: VMware on Cloud SDDC role with permissions used by the Elastic Disaster Recovery failback client

Download the latest Elastic Disaster Recovery failback client and upload it to your VMware Cloud on the AWS SDDC datastore (‘WorkloadDatastore->drsfa-lab’ in our case) as shown in the following figure.

Figure showing details of the VMware on Cloud Datastore used by the Elastic Disaster Recovery failback client containing the iso files for the client

Figure 4: VMware on Cloud SDDC VM Datastore used by the Elastic Disaster Recovery failback client containing the iso files used by the client

Test and run

In this section, we first demonstrate failover from a VMC source machine to a target Amazon Elastic Compute Cloud (Amazon EC2) in a recovery Region and then demonstrate failback from Amazon EC2 using Elastic Disaster Recovery for a full recovery back to VMC.

Failover

Navigate back to the Elastic Disaster Recovery console in the us-east-1 Region. Select Source servers on the left panel. You should be able to select the source server that corresponds to the VMC SDDC VM to review the initial sync process. The initial sync is comprised of several tasks. At any time, you may return back to the Elastic Disaster Recovery console to monitor the data replication progress. Once the initial sync is complete, the Data replication status here will be shown as Healthy.

After completion of the initial sync process and from the Elastic Disaster Recovery console in the us-east-1 Region, select Source servers on the left panel. Select the VMware Cloud on the AWS VM source server, and select Launch settings.

a. In the General launch settings tab, select Edit, set Instance type right sizing to ‘None’, and then save the settings.

b. In the EC2 launch template tab, select Edit and modify the Amazon EC2 launch template:

i. For the Template version description provide a description for “vmc launch template”.

ii. In the Instance Type tab, select Manually select instance type and select c5.large.

iii. Under the Network settings tab in the Subnet section, select the ‘vpc1_sn_A2’ private subnet. In the Firewall section, select Select existing security group and select the ‘SG_BASTION’ security group.

iv. Scroll down and select Create template version.

v. Select the link to the template version that you just created in the Success notification box. Select Actions, and select Set Default Version. Select “vmclaunchtemplate” as the Template version and select the Set as default version button.

Perform a failover of the VMware Cloud on the AWS SDDC VM server to a target server in us-east-1 in the Elastic Disaster Recovery target account using Elastic Disaster Recovery:

a. Navigate to the Elastic Disaster Recovery console in the us-east-1 Region. Select Source servers on the left panel. Select the VMC SDDC VM source server (vmc-drs-vm). Select Initiate recovery job and then Initiate Drill from the top right panel.

b. On the Select a point in time page, for Points in time, select Use most recent data. Scroll down and select Initiate drill.

c. Select Recovery job history from the left panel of the Elastic Disaster Recovery console to monitor the job log and make sure that your launch is successful.

Job log for the failover in Elastic Disaster Recovery showing details of the recovery process for the VMC SDDC VM

Figure 5: Job log for the failover in Elastic Disaster Recovery showing details of the recovery process for the VMC SDDC VM

d. After recovery is completed, you can view the final status of the recovery process for the VMC SDDC VM source server.

Figure 6: Recovery dashboard in Elastic Disaster Recovery showing the recovery status for the VMC SDDC VM

Failback

Using your vCenter web client, follow the instructions to create a new VM. Or use an existing VMware Cloud Services VM that uses the supported Ubuntu Server 20.04 LTS operating system for the failback client as outlined in the DRSFA client prerequisites. Once your VM instance is set up and ready, follow the instructions here to install the DRSFA failback client on your VM and use the same Datastore as in the prerequisites.

a. Replace the IAM AWSElasticDisasterRecoveryFailbackPolicy managed policy with AWSElasticDisasterRecoveryFailbackInstallationPolicy in the Generating IAM credentials and configuring CloudWatch logging section of the instructions.

b. Here are the environment variables that I used to install the DRSFA client in the Running the DRSFA client section of the instructions. Note the syntax for the datastore path and the relative paths for the failback client and seed iso files.

VMC VM shell showing the variables used for installing the DRSFA failback client

Figure 7: VMC VM shell showing the variables used for installing the DRSFA failback client

c. Run the failback client:

VMC VM shell showing ready to initiate failback from Amazon EC2 recovery instances

Figure 8: VMC VM shell showing ready to initiate failback from Amazon EC2 recovery instances

Perform a failback now that your failback client has been installed successfully and paired with Elastic Disaster Recovery. Navigate to the Elastic Disaster Recovery console in the us-east-1 Region. Select Recovery instances on the left panel, and select the Instance ID that corresponds to VMC SDDC VM source server (vmc-drs-vm). Make sure that the Failback state is Ready and Data replication status is Healthy. Select

Elastic Disaster Recovery console showing Failback state is Ready and Data replication status is Healthy

Figure 9: Perform failback from Elastic Disaster Recovery console by selecting Amazon EC2 recovery instances while making sure that the Failback state is Ready and Data replication status is Healthy

Monitor the data replication process to ensure the source VM is stopped and the conversion process to the recovery EC2 instance is started by checking that the Failback state is In Progress and Data replication status is Finalizing sync. This will finalize the failback process and will create a replica instance on the source VM.

Elastic Disaster Recovery console demonstrating monitoring of failback progress and showing the Failback state as ‘In Progress’ and Data replication status as ‘Finalizing sync’

Figure 10: Elastic Disaster Recovery console demonstrating monitoring of failback progress and showing the Failback state as ‘In Progress’ and Data replication status as ‘Finalizing sync’

Once the failback is complete, the failback client on the source VM will show that the failback has been completed successfully.

Figure 11: Elastic Disaster Recovery console demonstrating successful completion of failback and showing the Failback state as ‘Completed’ and Data replication status as ‘Completed’

Cleaning up

To avoid recurring charges, and to clean up your account after trying the solution outlined in this post, perform the following:

Follow these steps to terminate the EC2 instance running the failover DR environment the us-west-2 Region of the Elastic Disaster Recovery target account.
Delete the cloudformation stack for the aws-vpcsetup-v1 template in the us-west-2 Region of the Elastic Disaster Recovery target account (the account that contains your BYOVPC VPC)

Conclusion

In this post, we demonstrated the use of Elastic Disaster Recovery as an AWS native solution that enables customers to minimize downtime and data loss while providing a cost-effective DR solution for VMC. We first provided a step-by-step configuration for failover from a VMC source machine to a target Amazon EC2 in a recovery Region. We then demonstrated failback from Amazon EC2 using Elastic Disaster Recovery for a full recovery back to VMC.

Customers who run VMware on-premises are increasingly incorporating VMC into their hybrid cloud strategy to prepare for a potential outage. While a benefit of a VMC migration is that it allows customers to quickly migrate VMware workloads to a VMware-managed SDDC running in AWS without replatforming or refactoring applications, we also demonstrated in this blog post how customers can now use Elastic Disaster Recovery as an AWS native DR solution thus supporting a vital consideration of providing business continuity in the event of a disaster for their VMC environment.

Thank you for reading this post. Comment in the comments section with any questions.