Migrating your Microsoft Windows clusters to AWS using CloudEndure Migration

UPDATE (7/16/2021): This blog post describes CloudEndure Migration. AWS Application Migration Service, the next generation of CloudEndure Migration, is now the recommended service for lift-and-shift migrations to AWS.

Many organizations leverage Windows Server Failover Cluster (WSFC) with shared storage to build a group of servers that work together. This enables them to increase the availability of their applications and services, ensuring that they serve their own customers effectively and that they consistently remain fully operational. Customers often migrate their clusters to AWS to take advantage of all that AWS has to offer, in terms of fully managed and cost-efficient services across a vast portfolio of categories. I frequently get questions from customers on best practices when migrating these types of clusters to AWS, and things they should consider in terms of storage, replication, and cutover options to ensure a successful and consistent migration.

In this blog post, I share some best practices to migrate Windows clusters to AWS using CloudEndure Migration, and provide recommendations for steps to consider before, during, and after migration.

Migrating a standalone server to AWS

CloudEndure Migration is a block-level replication service that simplifies the process of migrating applications from physical, virtual, and cloud-based servers to AWS. The process of migrating a standalone server to AWS using CloudEndure Migration is straightforward. These are the high-level steps (check here for more details):

Establish connectivity between the source environment and AWS.
Complete the networking requirements for CloudEndure Migration.
Register for a CloudEndure account, create a migration project, and get an agent installation token.
Install the CloudEndure agent on the source server.
Configure the Blueprint for the target server in the CloudEndure console.
Wait for the replication to complete.
Launch a target machine using Test mode.
Launch a target machine in Cutover mode.
Shut down the source machine.

Apart from the preceding steps, when you migrate a cluster, there are additional considerations, as multiple servers in the cluster will share the same storage. In a Windows environment, High Availability (HA), is typically deployed across multiple nodes that are members of Windows Failover Server Cluster (WSFC). This setup can be implemented in multiple HA, like in a File Server Cluster, or a SQL Server Always On Failover Cluster Instance. Apply the considerations in the following section in cases where WSFC and shared storage are used.

Shared storage considerations

Your cluster likely has storage that is shared between all cluster nodes. In the most popular cases, the shared storage is implemented as SAN (Storage Area Network). Since CloudEndure Migration supports only block-level replication, it replicates only SAN clusters. In SAN clusters, the storage disks appear to the server as local drives, and you must install the CloudEndure agent locally on each cluster node (see different types of installation in the next section).

In cases where you implement your shared storage using Network Attached Storage (NAS), the volumes appear to the server as shared volumes, over network, using protocols like Server Messaging Protocol (SMB). If you want to migrate volumes implemented using NAS, then you must install the CloudEndure agent on the NAS server. This replicates the volumes from the NAS server only. The CloudEndure agent will not pick the volumes on the other server (servers that use NAS as their storage).

Another consideration is the way the volume is represented to the cluster node. Is it mapped as an actual drive letter on SAN? Or is it a mapping point to another folder? In some cases, if it’s mapped to a folder, you must run the CloudEndure agent installation using a special parameter: --force-volumes. This cancels the automatic agent installer detection of physical disks to replicate and replicates the exact list of drives that you input as part of the command. If you have such a case, I recommend that you reach out to AWS Support for guidance on using this solution specifically on your environment.

Having three volumes on the server is an example of when to use the --force-volumes parameter to force the CloudEndure agent to pick up the volumes:

Installer_win.exe —force-volumes —no-prompt -t <installation token> —drives=“\\.\PHYSICALDRIVE0,\\.\PHYSICALDRIVE1,\\.\PHYSICALDRIVE2,\\.\PHYSICALDRIVE3

You can retrieve the installation code from the project in the CloudEndure console. Check here for more information.

Replication considerations

The replication speed from the source environment to the CloudEndure Staging Area depends on several factors: the uplink speed from that server to the replication server and the bandwidth available, the overall disk storage, the changes in the disk while replicating and the I/O speed of the local disk or SAN storage. When migrating a cluster, in addition to the previous factors, you must think about your plan in case the primary node of the cluster fails during the replication. The failure could be caused by a network disruption, storage failure, or operating system problems to name a few.

Depending on how you need to proceed in case of a failover event, such as network, operating system or hardware failures, here are the scenarios I recommend you consider:

Scenario 1: Replication time is short (a few hours)

Considering the preceding replication speed factor, if you have a few TBs of data with a network bandwidth of 1 Gbps or higher, you should be able to complete your replication in a few hours. In this approach, you install the CloudEndure agent on every node that makes your cluster. If the primary node fails over during the replication, the ownership of the shared storage will move to the second node and will be removed from the replication. At this point, you reinstall the agent on the new node and the replication will start the secondary node from the beginning. This is the default behavior and the easiest to implement as long as it’s acceptable to lose a few hours of replication that had already been completed on the primary node before it failed.

Scenario 2: Replication time is long (hours to days) and cluster’s failback time is short

Unlike the previous option, you may have a case where you can’t tolerate the extra time to restart the replication from scratch on a secondary node after a failover incident. In this case, I recommend that you stop the replication on the primary node, where the failure occurred, and failover to the secondary node. After the cluster becomes healthy again (maybe after resolving the root cause), you can failback to the primary node and restart the CloudEndure service. This enables you to rescan the disks and replicate only the incremental data, since replication stopped on the primary.

These are the steps at a high level:

Install the CloudEndure agent on the primary node of the cluster.
Disable the CloudEndureVolumeUpdater service on the primary node of the cluster. This service is responsible for adjusting the replication volumes when a disk size on the source machine is changed. The service is also responsible for removing any disks not attached to the machine from replication. Disabling this service will enable you to change the default behavior of CloudEndure and force the replication not to move to the secondary node when the primary fails.
If you have a failover incident, the primary node fails and the storage ownership will be moved to a secondary cluster node.
Resolve the problem on the primary node and failback from secondary to primary node.
At this point, the cluster is healthy again.
Restart the CloudEndure service on the primary node. This will trigger the CloudEndure agent to rescan the disks and replicate the incremental data.

Scenario 3: Replication time is long and cluster’s failback time is long

This scenario assumes that you have a failure event on the cluster that requires an unknown and potentially long duration to fix. In that case, you don’t want to hold your replication until the cluster failback and becomes healthy. In this method, you force CloudEndure Migration to ignore the changes on the disks caused by the failover/failback on all nodes and control the replication process manually. Similar to scenario-2, except that you don’t wait for the cluster to become healthy to continue the replication from there. Instead, when there is a failover to a secondary node, you continue the replication from there for the incremental data (from the last time this node was active). In other words, you deal with your cluster replication as if it’s a group of standalone nodes, from the replication perspective.

These are the steps at a high level:

Install the CloudEndure agent on every node of the cluster
Disable the CloudEndureVolumeUpdater service on every node of the cluster (for more details on CloudEndureVolumeUpdater service, check Option-2)
Assume you have a cluster failover incident at some point of the replication.
Switch to the active node of the cluster and restart the CloudEndure service.
The restart will trigger disks re-scan and will replicate the incremental data (from the last time this node was active).

There is a disadvantage to this approach. If the source cluster is X machines that have Y storage each, CloudEndure Migration maintains the Y storage X times on the cloud for replication (resulting in X times Y used storage).

Cutover consideration

The last phase of your cluster migration, after you confirm that the replication has reached a Continuous Data Protection (CDP) state, is to launch a target machine in Test mode, and then in Cutover mode. Bringing block-level shared storage to AWS will require adjustments to your architecture. If you try to bring a WSFC service on EC2 target instances before transitioning to the new architecture, it will fail. Therefore, at this point, you must decide how to implement HA requirements on AWS. A popular approach is to break the cluster and use alternative approaches for HA that don’t require shared storage. Since CloudEndure Migration supports only lift and shift migration, the changes you make at this phase will be done outside of the CloudEndure console.

Depending on your application, there are multiple options for shared file storage on Windows. For example, if your application requires shared storage on AWS you can use Amazon FSx for Windows File Server, which provides fully managed, highly reliable, and scalable file storage that uses SMB. If you use SQL Server Always ON Failover Cluster Instances (FCI) on-premises (or a simple file server cluster), you can deploy similar architecture on AWS using Amazon FSx. For details check out the blog post “Simplify your Microsoft SQL Server high availability deployments using Amazon FSx for Windows File Server.” Another alternative could be to move to SQL Server Always On Availability Group on EC2. For details check SQL Server with WSFC on AWS Quick Start. For additional details, check out the whitepaper on best practices to deploy MS SQL Server on AWS.

Conclusion

Many customers use Windows Server Failover Cluster (WSFC) to ensure that their applications and services are highly available. Often, organizations using these clusters seek to migrate them to AWS to take advantage of the extensive AWS Cloud services portfolio of fully managed services and cost-efficient services, like Amazon FSx for Windows File Server. Migrating Windows clusters with shared storage to AWS requires a few considerations, such as choosing the replication method in case of a failover event, and installing your CloudEndure agent. In this blog post, I shared some prescriptive guidance on bringing your Windows Cluster to AWS using CloudEndure Migration. I walked through three replication scenarios based on time considerations, and I discussed different ways to bring your cluster online on AWS, such as breaking the cluster or using Amazon FSx.

If you have a comment or a question, please leave a comment in the comments section. I look forward to hearing from you.