AWS Storage Blog

Deploy a highly available AWS Storage Gateway on a VMware vSphere cluster

Tens of thousands of customers are using AWS Storage Gateway today for business-critical applications. By providing common storage protocols such as iSCSI, NFS, SMB, and iSCSI VTL, Storage Gateway makes it easy for customers to continue using their existing applications while taking advantage of virtually unlimited cloud storage. Customers use Storage Gateway to simplify storage management and reduce costs for three key hybrid cloud storage use cases – to move on‑premises backups to the cloud, to reduce on‑premises storage with cloud-backed file shares, and to provide low latency access to data in AWS for on‑premises applications.

Storage Gateway has launched a number of features this year to help customers meet the increasing demands of their most critical applications. Today we are launching Storage Gateway High Availability (HA) on VMware to meet the operational needs of uninterruptible, latency-sensitive workloads such as media drives, streaming log repositories, and storage for scientific instruments.

aws-storage-gateway-high-availability

Many AWS Storage Gateway customers today are deploying and running their gateways on VMware vSphere clusters. With the new high availability feature, these customers can now expand their use of AWS Storage Gateway for applications that have high availability needs and cannot be interrupted. Through a set of health checks integrated with VMware vSphere High Availability (vSphere HA), a Storage Gateway deployed in a VMware environment on‑premises or in VMware CloudTM on AWS will automatically recover from most service interruptions in under 60 seconds with no data loss, protecting storage workloads against hardware, hypervisor, or network failures; or software errors such as connection timeouts and file share or volume unavailability. This eliminates the need for custom scripts for health‑checks, auto‑restart, and alerting.

Diving a bit deeper, the new health checks enhance failure detection by leveraging both the vSphere HA VM and Application Monitoring services. If either service stops receiving heartbeats from the Storage Gateway virtual machine (VM), the VM will be restarted on another ESXi host to restore service operation. Both types of heartbeats are sent to the host via the virtual machine communication interface (VMCI) channel provided by the open‑vm‑tools VMware Tools package installed on the virtual appliance. Specifically, the VM Monitoring service listens for heartbeats generated by the VMware Tools agent as well as for network and disk I/O activity in the guest, and the Storage Gateway suite of storage server services have all been augmented to send heartbeats to vSphere HA Application monitoring service via the vSphere Guest SDK. This powerful combination of new health checks should help quickly restore service uptime in a failure scenario.

Storage Gateway can be deployed in one of three modes: Tape Gateway, Volume Gateway, and File Gateway. The mode you choose depends upon the needs of your application. The new high availability feature works with new and existing File Gateways, new Volume Gateways, and new Tape Gateways, running on VMware ESXi. Volume and Tape Gateways created prior to November 20, 2019 and running on VMware ESXi will receive software updates for High Availability [starting April 2020].

In this blog, we will show you how to deploy an AWS File Gateway in an on‑premises VMware vSphere cluster or in a VMware Cloud on AWS software-defined data center (SDDC), and how to configure that environment for high availability. We will also show you how to run a simple test to validate failure recovery of your File Gateway.

Architecture and prerequisites

Architecture diagram of using File Gateway

Using a File Gateway is straightforward. Simply deploy the gateway as a VM in your VMware vSphere cluster or VMware Cloud on AWS SDDC, create an NFS or SMB file share that connects to your S3 bucket, and then present the file share to your applications.

To use the HA feature of Storage Gateway, your VMware environment must provide the following:

  • A cluster with vSphere HA enabled
  • A shared datastore (such as a SAN or vSAN) that is available to all hosts in the cluster. This shared datastore will be used for the operating system and cache storage virtual hard disks by Storage Gateway.

These capabilities are automatically provided and enabled in VMware Cloud on AWS. For on‑premises deployments, you must make sure they are enabled in your VMware vSphere cluster.

Also make sure that your Storage Gateway VM has the correct network ports opened for communication to the AWS Storage Gateway service as well as on‑premises applications and resources. Please see the Port Requirements for File Gateway for full details.

Deploy the File Gateway VM

Follow the AWS Storage Gateway product documentation to deploy your File Gateway. When prompted in the AWS Management Console, select “VMware ESXi” as your host platform and download the OVA image to your workstation. From there, log into vCenter and deploy the OVA image. When allocating the root disk and the cache volume to the VM, make sure that the datastore is available to all hosts in the vSphere cluster. This will ensure optimal HA behavior for the VM.

As you are deploying your gateway, consider the following:

Once the gateway has been successfully deployed and activated, you can proceed to create an NFS or SMB file share. The file share will provide standard file protocol access to your applications so they can access the data in your S3 bucket.

Configure vSphere HA

Before using your File Gateway, you should verify the HA settings on your vSphere cluster. Log in to vCenter and select the cluster where the File Gateway will be deployed. Under the Configure tab, select vSphere Availability and then click on the Edit… button, as shown below:

Configure vSphere HA

In the “Edit Cluster Settings” dialog box, select the Failures and responses tab to configure how vSphere will respond to VM failures.

“Edit Cluster Settings” dialog box, select the “Failures and responses” tab to configure how vSphere will respond to VM failures.Configure the cluster as follows:

  • Host Failure Response: set to “Restart VMs”
  • Response for Host Isolation: set to “Shut down and restart VMs”
  • Datastore with PDL: set to “Disabled”
  • Datastore with APD: set to “Disabled”
  • VM Monitoring: for Storage Gateway to properly respond to hardware failures, this must be set to “VM and Application Monitoring.” We recommend that you also configure a VM Override for the File Gateway VM for defining the custom VM monitoring sensitivity specifications described below. Do not set this to Disabled at either the cluster or VM Override levels because it will prevent the Storage Gateway HA feature from operating properly.

Expand the VM Monitoring section to configure monitoring sensitivity:

expand the VM Monitoring section to configure monitoring sensitivity

Configure monitoring sensitivity as follows:

  • Failure Interval: 30 seconds
  • Minimum uptime: 120 seconds
  • Maximum per-VM resets: 3
  • Maximum resets time window: 1 hrs

If you have other VMs running on the cluster, you may want to set these variables specifically for your VM. They can only be configured after the File Gateway has been deployed from the OVA image. Once you have deployed the VM, you can override VM settings in the vSphere Client by going to “cluster” > “configure” > “VM Overrides.” You can then add a new VM override option to change the values. If the cluster is set to VM Monitoring only, you must also enable “VM and Application monitoring.”

Verify VMware HA

Once your gateway is deployed and activated, you can verify the VMware HA configuration from the AWS Storage Gateway management console using the steps in the documentation. Note that this procedure will reboot your gateway, resulting in a few minutes of interrupted connectivity.

If everything was setup and configured correctly, then you should see the following message in the console:

Testing

To make sure that your gateway recovers as expected, you can use the following steps to test a failure and recovery scenario:

  1. Create an NFS file share on your gateway and connect it to your S3 bucket.
  2. Mount the NFS file share on a Linux machine.
  3. Run the following command to walk the directory tree of the S3 bucket, where “<mount point>” is the directory on your Linux machine where the NFS file share was mounted. This command will build up the metadata cache on the gateway.
[user@client1 ~]$ find <mount point> -type f

NOTE: Depending upon the number of files in your S3 bucket, the above command could take some time to run.

Once the find command has completed, copy the following bash script to your Linux machine. This script will use the netcat (nc) command to continuously check connectivity with the File Gateway, printing a message when connectivity is lost and when it is restored. If it is not already installed on your Linux machine, you must install the netcat utility.

#!/bin/bash

nc -zw 1 $1 $2
status=$?
if [ $status -ne 0 ]
then
    echo "Error or gateway not responding"
    exit 1
else
    echo "Gateway is responding"
    date
fi

while true
do
    nc -zw 1 $1 $2 &>/dev/null
    status=$?
    if [ $status -eq 1 ]
    then
        echo "Gateway not responding"
        date
        while [ $status -eq 1 ]
        do
            nc -zw 1 $1 $2
            status=$?
            if [ $status -eq 0 ]
            then
                echo "Gateway back online"
                date
                exit
            fi
        done
    fi
done

Assuming you saved the script as test.sh, execute the script as follows:

[user@client1 ~]$ ./test.sh <gateway ip address> 2049

This will test for connectivity to your File Gateway on port 2049, which is the TCP port used by the NFS protocol. While the script is executing, reset the File Gateway VM through the vCenter console. This will simulate a VM failure and will provide you with an approximation of how much time will be spent during VM recovery. Once recovery completes, the script will finish and you should see output similar to the following:

[user@client1 ~]$ ./test.sh 172.20.10.180 2049
Gateway is responding
Fri Nov 15 16:58:43 UTC 2019
Gateway not responding
Fri Nov 15 16:58:56 UTC 2019
Gateway back online
Fri Nov 15 16:59:07 UTC 2019

In this case, it took about 11 seconds for our File Gateway to recover from the VM reset.

You can also take a look at the CloudWatch logs for your gateway to get more information on VMware HA events.

Conclusion

In this blog post we’ve shown you how to deploy an AWS File Gateway with enhanced HA capabilities. We also highlighted some of the key considerations to remember when deploying Storage Gateway in VMware environments. We then showed you how to run a simple test to validate recovery times for your gateway.

Although we focused on using File Gateway in this blog, remember that the new HA feature applies to Tape Gateway and Volume Gateway as well.

For more information, including use cases, customer stories, and helpful videos, check out the AWS Storage Gateway product page.

Next steps to consider

  • Deploy an AWS Storage Gateway VM in your vSphere cluster or VMware Cloud on AWS SDDC today
  • Test the new Storage Gateway HA feature for yourself with the script provided above
Jeff Bartley

Jeff Bartley

Jeff is a Solutions Architect at AWS, focused on Hybrid Cloud Storage and Data Transfer. He enjoys helping customers tackle their biggest storage challenges through cloud-scale architectures. A native of Southern California, Jeff loves to get outdoors whenever he can.

Troy Lindsay

Troy Lindsay

Troy Lindsay is a Senior Partner Solutions Architect at AWS facilitating the engineering partnership with VMware. He is passionate about helping customers solve problems via product development, architecture, open source, and automation. He enjoys reading, dogs, and kakuro puzzles.