Running SQL Server Stretch Failover Cluster Instance on Amazon EC2

In this blog post, we explore how to implement Microsoft SQL Server Failover Cluster Instance (FCI) on Stretch Cluster Replication in Amazon Elastic Compute Cloud (EC2). We are going to use Amazon Elastic Block Store (EBS) Multi-Attach on io2 volumes with Microsoft Windows Server Storage Replica technology as a shared storage solution across Availability Zones (AZs).

Introduction

Microsoft Stretch Cluster technology extends failover clusters across two locations, offering disaster recovery and business continuity. It uses active-active configurations to meet compliance standards and optimize resources.

Organizations can leverage Amazon EBS Multi-Attach on io2 and Storage Replica for cost-effective shared storage in Multi-AZ environments. This stretch cluster solution offers technical benefits supporting high availability.

Solution overview

The deployment we propose matches the architecture in Figure 1. You can deploy this solution using either Multi-AZ or Multi-Region strategies based on the network requirements. Evaluate what best fits your environment and the Storage Replica synchronization mode (synchronous or asynchronous).

Architecture diagram covering the proposed solution, illustrating two pairs of EC2 instances. Each pair on one availability zone, and having io2 multi-attach with Storage Replica between the pairs.

Figure 1 – Architecture diagram covering the proposed solution

To facilitate the nodes identification throughout the post, we have named them as:

AZ-A	AZ-B
SRVSQL01	SRVSQL03
SRVSQL02	SRVSQL04

Prerequisites

Deploy an Amazon Virtual Private Cloud (VPC) with a minimum of two subnets per each AZ.
Deploy a Microsoft Active Directory (AD) to enable domain-joining for Amazon EC2 instances and support Windows Server Failover Cluster installation. This can be either self-managed or an instance of AWS Managed Microsoft Active Directory. We used the corp.local domain in our setup.
An AD user with the required permissions to deploy a cluster and an AD service account for the SQL service. This SQL service account will be required when running the SQL Installation setup.
Two Amazon EC2 instances built on the AWS Nitro System, in different subnets, running Windows Server 2016 or later in each AZ . The instances should be launched in spread placement groups, to ensure they are placed across distinct underlying hardware.
SQL Server Enterprise edition installation media (or Developer edition for non-production environment).
Windows Server Failover Cluster (WSFC) to be deployed among the four nodes, using the multi-subnet architecture. Check our best practices and recommendations for SQL Server clustering on Amazon EC2 documentation to know more.
Each Amazon EC2 instance on the cluster should have one Elastic Network Interface (ENI) assigned that includes three private IP addresses on the subnet: a primary IP address, a cluster IP address, and a SQL instance IP address. Check our SQL cluster IP addresses assignment documentation for more details.
At least two io2 volumes with the Multi-Attach feature enabled and attached to each pair of nodes. One must be used as the Storage Replica log store. While the other used to store the SQL Server data and log files. We used three volumes attached to each pair of nodes, as described:
- Storage Replica logs (50 GiB).
- SQL data disk (100 GiB).
- SQL logs disk (150 GiB).
- To increase storage replication performance and reduce cost, for TempDB , we used the instance store available on instance r5ad.xlarge.
An SMB file share to set up the file share witness (most useful for multisite environments). We used Amazon FSx for Windows File Server.

Walkthrough

This section will cover the steps to deploy the proposed solution.

1. Prepare the nodes to work with EBS Multi-Attach on io2

Check the How to deploy a SQL Server failover cluster with Amazon EBS Multi-Attach on Windows Server blog post and follow steps 2 to 4 to prepare the Windows Server nodes to support EBS Multi-Attach on io2, which comprises:

Create the EBS io2 volumes on each AZ, enabling the multi-attach feature, and attach to the pairs of Amazon EC2 instances. The AZ must be the same as where the Amazon EC2 instances are running.
Open a Remote Desktop Protocol (RDP) connection into SRVSQL01, initialize the disks as GPT and format the volumes. Take note of the letters applied to the volumes as they will be required moving forward. Do the same on SRVSQL03 (use the same drive letters).
Ensure the latest AWS NVMe drivers are installed on each node and configure SCSI persistent reservations for Windows instances.

2. Set up the Windows Server Failover Cluster

Follow the steps below to set up the cluster:

1. On each Amazon EC2 instance, install the required Windows features running a PowerShell command prompt as Administrator:

Install-WindowsFeature -Name Failover-Clustering,Storage-Replica,FS-FileServer,RSAT-AD-Tools -IncludeManagementTools -Restart

2. After rebooting the Amazon EC2 instances, RDP into SRVSQL01, open the Windows PowerShell ISE as Administrator, and run the following script. The output should be like Figure 2.

$nodes = "node1", "node2", "node3", "node4" 
$vips = "cluster_ip_1", "cluster_ip_2", "cluster_ip_3", "cluster_ip_4" 
$ClusterNm = "ClusterName" 
New-Cluster -Name $ClusterNm -Node $nodes -StaticAddress $vips

Picture highlighting the cluster creation using PowerShell script.

Figure 2 – Cluster creation using PowerShell script

Note: You may see a warning within the output. This is likely the quorum witness setting, that is unable to consider part of the disks. We are going to fix it in the next steps.

3. Once the cluster is created, run the following PowerShell script as Administrator. This is going to create the Fault Domains (Figure 3).

New-ClusterFaultDomain -Name Site1 -Type Site -Description "Description1" -Location "Location1"
New-ClusterFaultDomain -Name Site2 -Type Site -Description "Description2" -Location "Location2"

Set-ClusterFaultDomain -Name node1 -Parent Site1
Set-ClusterFaultDomain -Name node2 -Parent Site1
Set-ClusterFaultDomain -Name node3 -Parent Site2
Set-ClusterFaultDomain -Name node4 -Parent Site2

(Get-Cluster).PreferredSite="Site1"
(Get-Cluster).AutoAssignNodeSite=1

Figure showing the output of the PowerShell script that setups the Fault Domains within the cluster.

Figure 3 – Fault Domains setup

4. To ensure the cluster is considering all the available disks, run the PowerShell command:

Get-ClusterAvailableDisk -All | Add-ClusterDisk

5. Run the following PowerShell script to set up the file share witness, like in Figure 4.

$Clusterfqdn = "clustername.domain.local"
$FSWitness = '\\fileserver\witness' # defines the file-share location for the quorum configuration
Set-ClusterQuorum -Cluster $Clusterfqdn -FileShareWitness $FSWitness #Defines the cluster quorum - this would be a File-Share Quorum

Setting the file share witness.

Figure 4 – Setting the file share witness

6. When opening the Failover Cluster Manager snap-in, you will notice part of the disks are offline and with an information message saying, Clustered storage is not connected to the node (Figure 5). This will change once the Storage Replica is enabled. Rename the online disks to friendly names.

Figure showing the cluster disks as available storage.

Figure 5 – Cluster disks as available storage

7. Using the Failover Cluster Manager snap-in, validate the cluster running all tests. Expect warnings regarding the multi-subnet architecture and the offline disks (Figure 6).

Figure showing the cluster validation result with expected warnings.

Figure 6 – Cluster validation result

3. Install SQL Server Failover Cluster Instance

For the next steps, we are going to install the SQL Server instance on the first node of the environment. With that, we can set up the Storage Replica and then install SQL Server on the remaining nodes.

1. Using the Active Directory Users and Computers snap-in, add Create Computer Objects permission to the cluster computer object to the organizational unit it is at. This permission is needed to allow the cluster to create the SQL Server FCI computer object (Figure 7).

Figure showing the final step of adding the Active Directory permission to the cluster computer object.

Figure 7 – AD permission to the cluster computer object

2. On SRVSQL01, run the SQL Server setup and select a New SQL Server failover cluster installation, following the installation instructions. On Cluster Disk Selection, expect warnings about the disks that cannot be brought online (Figure 8). Click cancel on each dialog to continue the setup. Select the data and logs disks available and click on Next (Figure 9).

Warning dialog about the offline disks

Figure 8 – Warning dialog about the offline disks

Figure showing the disk selection to be used by the SQL instance.

Figure 9 – Select the disks to be used by the SQL instance

3. Move through the end of the SQL Installation setup. While the installation is running, you will see the warnings about the offline disks again. Click on Cancel on each dialog and continue the setup (Figure 10). After a few minutes, you should see the Complete page (Figure 11). Do not install SQL on the other nodes yet.

Figure showing the offline disks validation warnings.

Figure 10 – Offline disks validation warning

Figure showing the SQL installation complete page.

Figure 11 – SQL installation complete

4. Set up the Storage Replica feature

Once we have the SQL installation complete on the first node, the data and log disks will be assigned to the SQL Server role. We can now enable the Storage Replica feature to the second AZ volumes.

1. On SRVSQL01, run the following PowerShell command-line as Administrator. Change the drives letters as applied to your environment (Figure 12).

New-SRPartnership -SourceComputerName node1 -SourceRGName rg01 -SourceVolumeName "D:","E:" -SourceLogVolumeName R: -DestinationComputerName node3 -DestinationRGName rg02 -DestinationVolumeName "D:","E:" -DestinationLogVolumeName R: -ReplicationMode Asynchronous -EnableConsistencyGroups

Figure showing the PowerShell output when creating the Storage Replica partnership.

Figure 12 – PowerShell output when creating the Storage Replica partnership

2. Now you should see all the disks online (rename them to friendly names) and the initial block copy (Figure 13). SRVSQL03 owned disks are assigned to SR Group 1, the Storage Replica group.

Figure showing the Storage Replica setup working and running the initial block copy.

Figure 13 – Storage Replica setup

3. Use the Disks section from the Failover Cluster Management snap-in to check the block copy status, as shown in the previous Figure 13. Once the initial block copy is complete, run the add node to a SQL Server failover cluster installation wizard on the remaining nodes.

5. Failover the SQL service between nodes and Availability Zones

After installing SQL Server to the remaining nodes and having the replica complete, we can failover the SQL instance between nodes and AZs.

1. Connect to the SQL instance through the Client Access Name using SQL Server Management Studio. In the Additional Connection Parameters, add multisubnetfailover=true parameter (Figure 14).

Figure showing the connection to the SQL Server Failover Cluster Instance using the multisubnetfailover parameter.

Figure 14 – Connecting to the SQL Server FCI using the multisubnetfailover parameter

2. Use the following T-SQL statements to see the current owner while failing over the SQL service. Figure 15 shows SRVSQL01 as the current node, before failing over.

SELECT @@servername;
GO

SELECT NodeName, status, status_description, is_current_owner   
FROM sys.dm_os_cluster_nodes
ORDER BY NodeName;
GO

Figure showing the current SQL Server Failover Cluster Instance node owner.

Figure 15 – Current SQL Server FCI node owner

3. On the Failover Cluster Manager snap-in, right-click the SQL Server role, Move and Select Node…, select SRVSQL02, and click on OK (Figure 16). Seconds later, the instance should be running on the second node (Figure 17).

Figure showing the manual process to failover the SQL Server instance.

Figure 16 – Manually failover the SQL Server instance

Figure showing the SQL Server Failover instance running on the second node of the first AZ.

Figure 17 – SQL Server FCI running on the second node of the first AZ

4. Repeat the steps manually failing over to SRVSQL03 (Figure 18) and SRVSQL04 (Figure 19). Note how the workload recognizes the disks same as when running on the first two nodes (Figure 20).

Figure showing the SQL Server Failover instance running on the first node of the second AZ.

Figure 18 – SQL Server FCI running on the first node of the second AZ

Figure showing the SQL Server Failover instance running on the second node of the second AZ.

Figure 19 – SQL Server FCI running on the second node of the second AZ

Figure of the Failover Cluster Manager snap-in showing the second node of the second AZ as the current owner of the replicated disks.

Figure 20 – Second node of the second AZ as the current owner of the replicated disks

Best Practices and Considerations

To ensure optimal performance, reliability, and security of your SQL Server Stretch Failover Cluster Instance on Amazon EC2, consider the following best practices:

Application

To ensure seamless failover operations, it is recommended to include the multisubnetfailover=true parameter in your application’s connection string. This setting optimizes client connectivity during cluster node transitions, particularly in multi-subnet environments, resulting in faster detection of failovers and improved application responsiveness.
Cross-AZ Failover Performance: During failover, workloads must migrate between AZs, which may introduce a temporary performance impact or service disruption if your application is not hosted in the same AZ as the database. If you have a low latency application requirement, consider failing over the application to the same AZ as the active database node.

Cost Optimization

Right-size your Amazon EC2 instances based on workload requirements.
Use AWS Cost Explorer to track cross-AZ data transfer costs.
Consider Reserved Instances for long-term cost savings.
Optimize io2 volume sizes and IOPS based on actual usage patterns.
To reduce data replication, only use replicated volumes for SQL Server database data and log files.
Evaluate the best Microsoft licensing fit for your environment. This blog post covered a 4-node architecture using SQL Server Enterprise edition (or Developer), and the same instructions can be used to deploy a 3-node architecture. SQL Server Standard edition can be used if your requirements fit its resources limitations and if you need up to two nodes only.

Security

Use Amazon EBS volume encryption.
Implement least privilege access for cluster management.
Regularly audit cluster and SQL Server security settings.

Disaster Recovery

Maintain regular backups of cluster configuration.
Document cluster rebuild procedures.
Test disaster recovery procedures periodically.

Performance

Before committing to this architectural design, thoroughly test its performance under various conditions including load, stress, and endurance scenarios. Measure response times, throughput, IOPS, and resource usage while simulating real-world usage patterns and failure conditions. This validation will help prevent costly architectural changes later and ensure the system meets performance requirements.

Clean up

To avoid ongoing charges and remove all resources created in this walkthrough, run the following PowerShell code or manually complete these steps in order:

1. Remove SQL Server FCI (repeat this process on all remaining nodes):

# Run on the active node
$instanceName = "SQL_INSTANCE_NAME"
Setup.exe /Action=RemoveNode /InstanceName=$instanceName /Quiet

2. Remove the Storage Replica Partnership:

# Run on the source node
Get-SRGroup | Remove-SRPartnership
Get-SRGroup | Remove-SRGroup

3. Remove the Windows Server Failover Cluster:

# Run on any node
Remove-Cluster -Force -CleanupAD

4. Detach and Delete EBS Volumes
5. Terminate the Amazon EC2 Instances
6. Clean up Active Directory (service accounts, computer objects, and permissions granted)
7. Remove the file share witness

Note: These steps will permanently delete all resources and data. Before performing, ensure you have:

Backed up any important data
Documented any configuration settings you might need in the future
Verified that no production workloads are depending on these resource

Conclusion

In this blog post, we documented the steps to deploy a SQL Server Stretch FCI on Amazon EC2. Customers that have similar storage replications requirements within Windows Server and SQL Server environments may leverage the solution to ensure their workloads are high available across two AZs or even two Regions.

AWS has significantly more services, and more features within those services, than any other cloud provider, making it faster, easier, and more cost effective to move your existing applications to the cloud and build nearly anything you can imagine. Give your Microsoft applications the infrastructure they need to drive the business outcomes you want. Visit our .NET on AWS and AWS Database blogs for additional guidance and options for your Microsoft workloads. Contact us to start your migration and modernization journey today.

Select your cookie preferences

Microsoft Workloads on AWS