Configuring Cross-Region DR of your Amazon EC2 workloads with CloudEndure
Any number of events can cause IT outages that could adversely affect your business, stopping it from being able to serve customers or causing loss of valuable enterprise data. These events can result from application errors, human errors, malicious attacks, or infrastructure outages caused by natural disaster or hardware failure. You can mitigate against infrastructure outages by running in the cloud, since AWS offers redundancy and resiliency in the event of natural disasters and hardware failures. However, customers still have to plan for other types of unplanned outages that, in some cases, may require them to rapidly change the AWS Region where they run their production workloads.
How long can you afford to be down in the event of an outage, and how much data can you afford to lose? Any disaster recovery (DR) solution should help address business requirements around Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Customers, who are running their production workloads in AWS, can satisfy the requirement for low RTO and RPO using CloudEndure Disaster Recovery.
In this blog post, we walk through how you can use CloudEndure Disaster Recovery to automate the failover and failback of Amazon EC2-hosted workloads from one AWS Region to another.
Overview of CloudEndure Disaster Recovery
CloudEndure Disaster Recovery is a SaaS application that manages the replication of servers from any infrastructure to AWS for DR purposes. The servers can be virtual machines (VM), physical servers, or in this case, Amazon EC2 instances. The source infrastructure can be an on-premises data center, another cloud provider, or in this case, another AWS Region.
Firstly, a lightweight agent installed on each source instance initiates the replication of data. This agent, which communicates securely with CloudEndure Disaster Recovery, scans the source instance for all attached disks and performs block-level replication of those disks to a staging area in the target Region. The staging area is a VPC subnet pre-configured by the customer that hosts resources created by CloudEndure. These resources include Amazon EC2 instances with CloudEndure replication software installed, and a sufficient number of Amazon EBS volumes attached to match the number of continuously replicated disks. After the initial replication is completed, the CloudEndure agent monitors any changes to the source instance and replicates those changes continuously. This ensures the data stored in the target Region is always up to date.
Once the source and target are fully synced and in continuous replication mode, a customer can choose to trigger failover to Recovery Mode or perform a Recovery Test. Once a customer triggers failover, CloudEndure Disaster Recovery uses Amazon EC2 APIs to launch new target instances into a new or existing VPC created by the customer. CloudEndure also uses the EC2 APIs to attach new Amazon EBS volumes to those instances. CloudEndure uses Amazon EBS snapshots, taken from the volumes in the staging area that are storing the replicated data, to create the new target volumes.
After they have performed a failover or test, customers can use CloudEndure Disaster Recovery to automate the failback process to the original source Region. Failing back transitions the workloads running in the target Region back to the source Region without data loss.
The benefits of using CloudEndure Disaster Recovery to configure recovery of workloads across AWS Regions include the following:
- Customers can achieve low RTO and RPO in a DR scenario since CloudEndure Disaster Recovery can continuously replicate data from one Region to another, ensuring that they can fail over in minutes with no data loss.
- CloudEndure Disaster Recovery supports using any pair of AWS Regions as source and target, satisfying any compliance requirements that mandate a minimum distance between production and DR sites.
- Customers can choose to have CloudEndure Disaster Recovery replicate the Virtual Private Cloud (VPC) and networking settings of the source Region when they fail over to the DR Region, simplifying the failover process and shortening the RPO further.
- Customers can choose to modify the specifications of the target instance that are instantiated at failover. This is particularly useful in a DR or test scenario where customers may prefer launching lower-powered but more cost-effective resources.
CloudEndure Disaster Recovery tutorial
For this tutorial, we assume that the reader already has a CloudEndure Disaster Recovery account. If not, you can register for a new account. Note that this is different account than your CloudEndure Migration account and a separate email address is required. You also need a source AWS Region with Amazon EC2-hosted workloads running and a target AWS Region defined.
For this tutorial, you should have the following prerequisites met:
- A CloudEndure IAM user with required credentials, as defined in the CloudEndure documentation.
- A staging area subnet created in the target Region, as defined in the CloudEndure documentation.
- A staging area subnet created in the source Region, as defined in the CloudEndure documentation.
- A VPC and subnet in the target Region where our target instances will be launched.
- Connectivity between the source and target Regions using inter-Region VPC peering or inter-Region AWS Transit Gateway where available.
- The security groups that will be attached to the failed over instances.
- Prepare your network in both Regions as defined in the CloudEndure documentation.
Setting up CloudEndure
Once you have met the preceding prerequisites and have a CloudEndure Disaster Recovery account, you can configure CloudEndure.
- Log in to the CloudEndure Disaster Recovery console and create a new DR project.
- You can set up the new project in the Setup & Info tab, beginning with entering the credentials for the CloudEndure Disaster Recovery IAM user.
- Go to Replication Settings to finish setting up your project.
- Select your source and target AWS Regions.
- Select a specific instance type for your replication server. For this tutorial, you can use the Default instance type.
- For default disk type, choose Use fast SSD data disks. This speeds up replication for disks that are larger than 500 GB.
- For the subnet, choose the subnet created for your staging area. This can be in a dedicated staging VPC or a VPC shared with other resources.
- In this tutorial, for the Security Group to apply to the replication server, choose Default CloudEndure Security Group.
- You can select whether to use a public or private network for sending the replicated data from the source instances to the staging area. Choose Use VPN or Direct Connect and Disable public IP to direct replication traffic over your inter-Region VPC Peering Connection or AWS Transit Gateway.
- [Optional] You can enable encryption of the Amazon EBS volumes used by the replication instances to store replicated data.
- [Optional] For Staging Area Tags, enter a Key and Value.
- For Network Bandwidth Throttling, select Disabled. Enable this option only if you must control the amount of bandwidth used for replication traffic.
Installing the CloudEndure agent and initiating replication
Next, install the CloudEndure agent on each of the instances that will be replicated. Doing so initiates replication to the staging area. After saving changes made in the previous steps to Replication Settings, the console presents you with a dialog box showing the instructions for downloading and installing the CloudEndure agent for Linux and Windows machines.
- Follow the instructions to download and to install the agent for each of your source instances. Once installed, the agent automatically scans all disks attached to the source instance and begins replicating the data for all discovered disks, using the configuration defined earlier in Replication settings.
Once the CloudEndure agent is ready to begin replication, the CloudEndure DR SaaS application launches and configures the needed replication instances in the staging area. In addition, CloudEndure initiates replication from the source instance to the replication instance.
- You can track the replication progress in the Machines page of the CloudEndure console.
When replication is complete, the replication status changes from a progress bar to Continuous Data Protection. This indicates that initial replication has completed and CloudEndure is continuously syncing source instance disks to the disks in the staging area.
Performing a Disaster Recovery failover
Once the initial replication is complete, you can conduct a DR failover test or perform the actual DR failover. The steps are the same for both. Before initiating failover, you must configure the blueprint for each target machine. The CloudEndure blueprint defines how the target environment is instantiated by CloudEndure during cutover.
NOTE: AWS recommends that customers perform frequent failover tests to ensure that there will be no issues during the time of actual failover.
- For each target machine, configure the Blueprint
- For Machine Type, choose the instance type for the new target machine. The default is to Copy Source, which prompts CloudEndure Disaster Recovery to use the same instance type as the source instance to launch the target.
- Typically, you choose the same Launch Type as used by the source instance.
- Choose the VPC and subnet where the target machine will be launched. For this tutorial, we are using a VPC and subnet that we created earlier in the target Region.
- Choose the security groups that will be associated with the target machines. For this tutorial, we are using a security group that was already created on the target site.
- For Private IP, choose Copy source if you want to retain the same CIDR range as Source. For this tutorial, we have configured this setting as Create new, since we are using a VPC with a different CIDR than the source.
- For this tutorial, configure CloudEndure Disaster Recovery to create the Elastic IP address and Public IP according to subnet configuration.
- [Optional] For Tags, enter a key and value.
- For Disks, you can choose the same disk type as the source instance or a different disk type.
- After the blueprint is saved for each target machine, you can initiate recovery by selecting each instance you want to recover and then choosing Recovery Mode.
CloudEndure Disaster Recovery provides the option to choose a specific recovery point when launching the target machine. CloudEndure creates these recovery points using periodic Amazon EBS snapshots.
- Select the desired recovery point to continue with the launch.
CloudEndure Recovery mode initiates snapshots of all the Amazon EBS volumes in the staging area required to launch the selected target machines. The snapshots are used to create new Amazon EBS volumes that will be attached to the new target machines. You can monitor the progress on the Job Progress page in the CloudEndure user console and see when the job is finished.
Once failover is complete, you will be able to access your newly launched target instances. Since the target instance is an exact copy of the source instance, you can access the newly launched instance using the same SSH key used for the source instance.
Prepare for Failback
Note: As a pre-requisite, configuring failback is not allowed until over machines in a Project have been failed over.
- Once all machines are failed over, prepare for failback by choosing Prepare for Failback.
The following dialog box appears, confirming that you have reversed data replication, and the newly launched target machines act as source machines.
- Select the machines in question and choose Failback Settings. Configure the settings by following the corresponding steps outlined in Step 3 of the “Setting up CloudEndure” section.
The previously replicated CloudEndure agent will already be running. Once the CloudEndure agent is ready to begin replication, the CloudEndure Disaster Recovery SaaS application launches and configures the needed replication instances in the staging area. In addition, CloudEndure initiates replication from the source instance to the replication instance.
As previously mentioned, you can track the replication progress in the Machines page of the CloudEndure Disaster Recovery console.
When replication is complete, the replication status changes from a progress bar to Continuous Data Protection. This indicates that initial replication has completed and that CloudEndure is continuously synchronizing source instance disks to the disks in the staging area.
Performing a Disaster Recovery failback
Once the initial replication is complete, you can conduct a DR failback test or perform the actual DR failback. The steps are the same for both. Before initiating failback, you must configure the blueprint for each target machine. The CloudEndure blueprint defines how the target environment is instantiated by CloudEndure during cutover.
- Follow the steps outlined in the failover and failback process overview to configure the CloudEndure blueprint and to perform failback.
You can track the progress under the Job Progress section.
- The last step once failback has completed is to choose Return to Normal Operation under Project Actions.
When you return to normal operations, CloudEndure reverses data replication again from the failed back source Region to the target Region.
When replication is complete, the replication status changes from a progress bar to Continuous Data Protection. This indicates that initial replication has completed and that CloudEndure is continuously synchronizing source instance disks to the disks in the staging area in the target Region.
Cleaning up CloudEndure Disaster Recovery
If you want to remove the DR environment after this tutorial, uninstall the CloudEndure Disaster Recovery agent by removing machines from the CloudEndure user console.
- Choose Machine Actions.
- Choose Remove Machines from This Console. It takes up to 60 minutes for CloudEndure Migration to clean up the replication instances and volumes in the staging area.
When all of the agents are uninstalled, delete the staging area VPC. This deletes all AWS resources that you created for replication. In addition, remember to delete all the resources created in the target Region during failover.
In this post, we walked through how you can use CloudEndure Disaster Recovery to fail over and to fail back Amazon EC2-hosted workloads from one AWS Region to another. Unplanned IT outages will always be a concern when doing business in the digital age. Customers, including those who are running in the cloud, should have a solution that can help them recover quickly from an outage with minimal data loss. The potential impact of not having a DR solution is loss of ability for businesses to serve their customers or suffering data loss. CloudEndure is able to help AWS customers achieve their recovery time and recovery point objectives, while enabling them to focus on business outcome by allowing AWS to take over the undifferentiated heavy lifting of managing DR software.
Thanks for reading this blog post! If you have any comments or questions, please don’t hesitate to leave them in the comments section.