Understanding AWS High Availability and Replication for vSphere Administrators

Introduction

vSphere HA is a fundamental and frequently used feature of vSphere. If any of several failure scenarios occur, it restarts a virtual machine. The failure scenarios range from VM or host crashes to unresponsive hosts (for example, due to network isolation or outage).

Translating vSphere High Availability (HA) to the public cloud can be a perplexing process. Some of the concepts (for example, automated VM restart and VM/host health monitoring) are quite similar. While others (such as admission control) do not apply in the same way to both on-premises and cloud environments. This post will mitigate some of that confusion for experienced VMware administrators who are starting to adopt AWS. We will cover both high availability and replication, which are cornerstones of a robust disaster recovery plan.

High Availability in AWS

Understanding high availability in the AWS cloud requires understanding the AWS concepts of Regions and Availability Zones (AZ). AWS Regions consist of multiple, physically separated, and isolated Availability Zones that are connected with low latency, high throughput, and highly redundant networking. Availability Zones consist of one or more discrete data centers, each with redundant power, networking, and connectivity, and housed in separate facilities. Many AWS services, such as Amazon Elastic Block Store (Amazon EBS), offer AZ-level availability, while others, such as Amazon Simple Storage Service (Amazon S3), offer regional availability. Every Amazon Elastic Compute Cloud (Amazon EC2) instance has an Amazon EBS volume as its root, and may have additional Amazon EBS volumes attached. Amazon EC2 instances can also attach other forms of storage (NFS exports, CIFS shares, or Amazon S3 buckets, for example.) An Amazon EBS volume is durable across an Availability Zone, which means a volume can be attached to any instance in an Availability Zone. (And for some purposes, even multiple instances at the same time; analogous to vSphere “multi-writer” mode.) If an instance fails, its Amazon EBS root volume is reattached to the new instance at boot time.

In the event of the failure of a single or multiple virtual machines (instances), the recovery process is quite similar. Regardless of whether you’re on-premises in a vSphere environment or in the AWS cloud. The same is true for the failure of a physical host. There are two significant differences. First, in the cloud, provisioning physical hardware is done for you automatically. Second, with a few exceptions, AWS services are available across an Availability Zone. Imagine if your VSAN cluster (or datastore based on a traditional SAN array) spanned every single vSphere host in the entire data center.

Because the public cloud works differently than on-premises infrastructure, it is important to make a distinction between stateless applications and stateful applications. This is because recovery mechanisms differ depending on the application or service state. A stateless application is one that does not maintain or store any information about a user’s previous interactions or session state. Each request to the application is treated independently, without any knowledge of past requests or the user’s context. Take for instance, a fleet of VMware virtual machines or Amazon EC2 instances running Apache and serving your web site, front-ended by a load balancer. If a compute instance (whether a virtual machine or an Amazon EC2 instance) crashes for any reason, it can be restarted on demand because this service is stateless. Other examples of stateless applications include RESTful APIs and serverless functions. A stateful application is one that maintains or “remembers” the state of a user’s session or interaction over time. Examples include web browsers (cookies, history, etc.), online shopping carts, and email clients.

Some legacy applications have stateless components, but also contain state within a subset of instances or virtual machines. Take the example of a LAMP server used for a blog. If the blog posts are stored in a MySQL database running on the server, then the server is stateful. In the event that the server crashes, check for consistency and use mechanisms to return to the original state of the system prior to the crash. This distinction has repercussions for recovery. For stateless instances, we can rapidly build and start a new instance from a template. For stateful instances, restart from the same boot/root disk as the previous instance was using.

High Availability with Stateful Applications

First, let’s consider the stateful case. If a stateful instance fails and Simplified Automatic Recovery (SAR) is enabled, Amazon EC2 will automatically reboot the instance. (SAR is enabled by default.) This is analogous to enabling HA for a vSphere VM. For finer control, we can detect and recover by using Amazon CloudWatch alarms. Two alarms are of interest to us. The first is “reboot”. When an instance crashes, it will fail a status check. Configuring a CloudWatch alarm with a reboot action will reboot the instance on the same physical host. You can observe instance status on the console, from the CLI, and via PowerShell.

The second alarm is “recover”, which triggers on host failure. Another important difference between public cloud and on-premises infrastructure is that vSphere HA is subject to admission control. Before rebooting a failed instance, vSphere needs to determine if any host in the cluster has sufficient resources to boot the instance. If sufficient capacity is unavailable, the instance cannot start. In the cloud, capacity is on-demand, so concepts like admission control are enacted in the service itself. AWS will automatically determine a physical host with sufficient capacity to restart your instance when its host fails. The Amazon EC2 service will fail a system check status for that instance, not a health check. If an instance fails a system check status, this indicates that the instance itself may have been operational, but the underlying host has failed. To restart the instance, configure a CloudWatch alarm with a recover action.

Figure 1: StatusCheckFailed_Instance alarm triggered when instance crashes; alarm cleared after automatic reboot.

High Availability with Stateless Applications

Now let’s address the case of stateless applications. The simplest way to implement stateless recovery in AWS is with an Amazon EC2 Auto Scaling group. An Auto Scaling group ensures that a specific number of instances are always running. It can increase the number of running instances to address a higher load, or reduce the number of running instances to scale back when the load decreases. For the basic vSphere translation use case, the Auto Scaling group can be configured with the minimum, maximum, and desired instance count all set to 1. This means that the group will always maintain one running instance.

Figure 2: Basic Auto Scaling group for one instance.

Keep in mind that in the event an instance fails, the Auto Scaling group starts a new instance, not the specific instance that crashed. An Auto Scaling group launches a new instance from a template and an Amazon Machine Image (AMI). In traditional on-premises virtualization terms, an AMI is like a vSphere template, and an Amazon EBS volume is similar to a VMDK.

Take the previous example of a fleet of instances running Apache and serving your web site, front-ended by a load balancer. If one instance or the underlying host crashes for any reason, it is replaced by a new instance. Configuring an Auto Scaling group that spans multiple Availability Zones protects against a rack or data center failure. If you configure an Auto Scaling group with subnets attached in multiple Availability Zones, the Auto Scaling group will restart your Amazon EC2 instance(s) in a different Availability Zone if an instance in one AZ fails. If your instance has a load balancer on the front-end and it’s a regional service like Elastic Load Balancing (ELB), the new instances can handle traffic as soon as they’re up and running.

Replication Concepts

We have been addressing the direct translation of vSphere availability concepts to Amazon EC2 without any modifications, as is typical of “lift and shift” migrations. In other words, a 1:1 mapping of on-premises virtual machines that comprise a workload to Amazon EC2 instances that comprise the same workload running in the cloud. This implies stateless services remain stateless in the cloud, and stateful services remain stateful. Many applications can be made stateless by simple modifications, which can improve scalability, performance, and fault tolerance. One way to make an application stateless is to change stateful components like databases to use highly available cloud-based services. For example, by using a fully managed cloud database such as Amazon Relational Database Service (Amazon RDS) for MYSQL, or Amazon DynamoDB you can offload the work of database availability to AWS. This has several other benefits. If your blog gets so popular that the number of comments exceeds the MySQL capacity of your instance, switching from self-managed MySQL to Amazon RDS enables automatic scalability. Instead of spending time and effort resizing your instance and tuning the database, you simply adjust your Amazon RDS settings. Or, if MySQL is running fine but Apache is overloaded, you similarly have two choices. You can add more LAMP servers and point them to the original MySQL instance, which involves more infrastructure work (“undifferentiated heavy lifting”). Or you can separate the database and web service functions for a more scalable design.

Now let’s address replication and recovery to a different AZ or region, accomplished on-premises by vSphere Replication (VR). vSphere Replication has two primary use-cases: local replication from one data center to another, and long-distance replication, for example to a different seismic zone. Depending on your RPO and RTO targets, you may choose different AWS services for replication. If your goal is synchronous replication (RPO=0), consider Amazon FSx for NetApp ONTAP. For RPO in seconds to minutes, use AWS Elastic Disaster Recovery. For less stringent RPO and RTO requirements, conventional backup and recovery works well. AWS Backup offers features like scheduling and data retention policies, as well as capabilities such as anomaly detection on your backups. AWS Backups are stored on Amazon S3, which is a regional resource spanning multiple Availability Zones. As such, any resource backed up with AWS Backup is resistant to an AZ failure. AWS Backup can also cover the long-distance use case with cross-region replication. The minimum schedule time for AWS Backup is 1 hour, and AWS Backup uses Amazon EBS incremental snapshots, analogous to vSphere Changed Block Tracking (CBT). Third-party ISV products available on AWS Marketplace can also provide backup and disaster recovery capabilities for on-premises vSphere environments to fail over to an Amazon Virtual Private Cloud (Amazon VPC).

Figure 3: Create an On-Demand AWS Backup

A recovery in vSphere Replication offers two options familiar to vSphere administrators, “Synchronize recent changes” and “Use latest available data”. The first option requires that the source VM is powered off and synchronizes source and target; it can be described as a cross-data center cold vMotion. To emulate this pattern in AWS, there is no requirement that the instance is shut down. However, a running instance will most likely have subsequent changes if it continues to run. If you have cross-region AWS Backups configured, your latest backup may be sufficiently recent to use. In this case you can simply restore from the latest backup in the target region, which corresponds to the “Use latest available data” choice in vSphere. If you have not enabled cross-region backups, you can select the most recent backup in your vault and copy to the target region.

Figure 4: Copy an AWS backup to a different region

On-premises cross-region replication jobs are frequently constrained by Internet or VPN bandwidth, or by the high cost of dark fiber. Cloud replication runs on the AWS network, not the Internet, and SLAs are available for Amazon S3 replication. Historically, a passive disaster recovery site is a very cost-inefficient architecture component, which forces customers to choose between two options. One is to configure and pay for enough idle capacity to fail over all critical business applications. The other is to pay for minimal capacity (that’s still idle) but sufficient to support some subset of critical workloads. Using AWS as a disaster recovery site offers an economic advantage in that you can configure only as much active capacity as required, and not pay for idle resources.

Backup and recovery products typically store data in their native format. For example, if you back up a vSphere VM, you get a copy of data contained in the VMDK. But VMDKs do not natively boot or run on AWS infrastructure. For fast recovery of vSphere environments in AWS, you have two choices. Either restore to VMC on AWS, or replicate into AWS instances on an ongoing basis using AWS Elastic Disaster Recovery. For recovery where RTO requirements are less stringent, consider using AWS Backup (or a partner backup solution). These solutions use AWS Import/Export to convert on-premises virtual machines to Amazon EC2 instances.

Conclusion

In summary, vSphere High Availability and Replication both have close analogues in the public cloud (SAR, CloudWatch alarms, and Auto Scaling groups). The primary differences are the additional resiliency from regional services, the benefits of automation, and the “pay-as-you go” model. Now that you know how to protect virtual machines once they’ve been migrated to AWS, I encourage you to experiment. Use Auto Scaling groups, snapshots, snapshot replication, and AWS Backup to get a head start on your cloud DR architecture.

AWS Cloud Operations Blog