Microsoft Workloads on AWS
Running SQL Server Stretch Failover Cluster Instance on Amazon EC2
In this blog post, we explore how to implement Microsoft SQL Server Failover Cluster Instance (FCI) on Stretch Cluster Replication in Amazon Elastic Compute Cloud (EC2). We are going to use Amazon Elastic Block Store (EBS) Multi-Attach on io2 volumes with Microsoft Windows Server Storage Replica technology as a shared storage solution across Availability Zones (AZs).
Introduction
Microsoft Stretch Cluster technology extends failover clusters across two locations, offering disaster recovery and business continuity. It uses active-active configurations to meet compliance standards and optimize resources.
Organizations can leverage Amazon EBS Multi-Attach on io2 and Storage Replica for cost-effective shared storage in Multi-AZ environments. This stretch cluster solution offers technical benefits supporting high availability.
Solution overview
The deployment we propose matches the architecture in Figure 1. You can deploy this solution using either Multi-AZ or Multi-Region strategies based on the network requirements. Evaluate what best fits your environment and the Storage Replica synchronization mode (synchronous or asynchronous).
Figure 1 – Architecture diagram covering the proposed solution
To facilitate the nodes identification throughout the post, we have named them as:
AZ-A | AZ-B |
SRVSQL01 | SRVSQL03 |
SRVSQL02 | SRVSQL04 |
Prerequisites
- Deploy an Amazon Virtual Private Cloud (VPC) with a minimum of two subnets per each AZ.
- Deploy a Microsoft Active Directory (AD) to enable domain-joining for Amazon EC2 instances and support Windows Server Failover Cluster installation. This can be either self-managed or an instance of AWS Managed Microsoft Active Directory. We used the corp.local domain in our setup.
- An AD user with the required permissions to deploy a cluster and an AD service account for the SQL service. This SQL service account will be required when running the SQL Installation setup.
- Two Amazon EC2 instances built on the AWS Nitro System, in different subnets, running Windows Server 2016 or later in each AZ . The instances should be launched in spread placement groups, to ensure they are placed across distinct underlying hardware.
- SQL Server Enterprise edition installation media (or Developer edition for non-production environment).
- Windows Server Failover Cluster (WSFC) to be deployed among the four nodes, using the multi-subnet architecture. Check our best practices and recommendations for SQL Server clustering on Amazon EC2 documentation to know more.
- Each Amazon EC2 instance on the cluster should have one Elastic Network Interface (ENI) assigned that includes three private IP addresses on the subnet: a primary IP address, a cluster IP address, and a SQL instance IP address. Check our SQL cluster IP addresses assignment documentation for more details.
- At least two io2 volumes with the Multi-Attach feature enabled and attached to each pair of nodes. One must be used as the Storage Replica log store. While the other used to store the SQL Server data and log files. We used three volumes attached to each pair of nodes, as described:
- Storage Replica logs (50 GiB).
- SQL data disk (100 GiB).
- SQL logs disk (150 GiB).
- To increase storage replication performance and reduce cost, for TempDB , we used the instance store available on instance r5ad.xlarge.
- An SMB file share to set up the file share witness (most useful for multisite environments). We used Amazon FSx for Windows File Server.
Walkthrough
This section will cover the steps to deploy the proposed solution.
1. Prepare the nodes to work with EBS Multi-Attach on io2
Check the How to deploy a SQL Server failover cluster with Amazon EBS Multi-Attach on Windows Server blog post and follow steps 2 to 4 to prepare the Windows Server nodes to support EBS Multi-Attach on io2, which comprises:
- Create the EBS io2 volumes on each AZ, enabling the multi-attach feature, and attach to the pairs of Amazon EC2 instances. The AZ must be the same as where the Amazon EC2 instances are running.
- Open a Remote Desktop Protocol (RDP) connection into SRVSQL01, initialize the disks as GPT and format the volumes. Take note of the letters applied to the volumes as they will be required moving forward. Do the same on SRVSQL03 (use the same drive letters).
- Ensure the latest AWS NVMe drivers are installed on each node and configure SCSI persistent reservations for Windows instances.
2. Set up the Windows Server Failover Cluster
Follow the steps below to set up the cluster:
1. On each Amazon EC2 instance, install the required Windows features running a PowerShell command prompt as Administrator:
2. After rebooting the Amazon EC2 instances, RDP into SRVSQL01, open the Windows PowerShell ISE as Administrator, and run the following script. The output should be like Figure 2.
Figure 2 – Cluster creation using PowerShell script
Note: You may see a warning within the output. This is likely the quorum witness setting, that is unable to consider part of the disks. We are going to fix it in the next steps.
3. Once the cluster is created, run the following PowerShell script as Administrator. This is going to create the Fault Domains (Figure 3).
Figure 3 – Fault Domains setup
4. To ensure the cluster is considering all the available disks, run the PowerShell command:
5. Run the following PowerShell script to set up the file share witness, like in Figure 4.
Figure 4 – Setting the file share witness
6. When opening the Failover Cluster Manager snap-in, you will notice part of the disks are offline and with an information message saying, Clustered storage is not connected to the node (Figure 5). This will change once the Storage Replica is enabled. Rename the online disks to friendly names.
Figure 5 – Cluster disks as available storage
7. Using the Failover Cluster Manager snap-in, validate the cluster running all tests. Expect warnings regarding the multi-subnet architecture and the offline disks (Figure 6).
Figure 6 – Cluster validation result
3. Install SQL Server Failover Cluster Instance
For the next steps, we are going to install the SQL Server instance on the first node of the environment. With that, we can set up the Storage Replica and then install SQL Server on the remaining nodes.
1. Using the Active Directory Users and Computers snap-in, add Create Computer Objects permission to the cluster computer object to the organizational unit it is at. This permission is needed to allow the cluster to create the SQL Server FCI computer object (Figure 7).
Figure 7 – AD permission to the cluster computer object
2. On SRVSQL01, run the SQL Server setup and select a New SQL Server failover cluster installation, following the installation instructions. On Cluster Disk Selection, expect warnings about the disks that cannot be brought online (Figure 8). Click cancel on each dialog to continue the setup. Select the data and logs disks available and click on Next (Figure 9).
Figure 8 – Warning dialog about the offline disks
Figure 9 – Select the disks to be used by the SQL instance
3. Move through the end of the SQL Installation setup. While the installation is running, you will see the warnings about the offline disks again. Click on Cancel on each dialog and continue the setup (Figure 10). After a few minutes, you should see the Complete page (Figure 11). Do not install SQL on the other nodes yet.
Figure 10 – Offline disks validation warning
Figure 11 – SQL installation complete
4. Set up the Storage Replica feature
Once we have the SQL installation complete on the first node, the data and log disks will be assigned to the SQL Server role. We can now enable the Storage Replica feature to the second AZ volumes.
1. On SRVSQL01, run the following PowerShell command-line as Administrator. Change the drives letters as applied to your environment (Figure 12).
Figure 12 – PowerShell output when creating the Storage Replica partnership
2. Now you should see all the disks online (rename them to friendly names) and the initial block copy (Figure 13). SRVSQL03 owned disks are assigned to SR Group 1, the Storage Replica group.
Figure 13 – Storage Replica setup
3. Use the Disks section from the Failover Cluster Management snap-in to check the block copy status, as shown in the previous Figure 13. Once the initial block copy is complete, run the add node to a SQL Server failover cluster installation wizard on the remaining nodes.
5. Failover the SQL service between nodes and Availability Zones
After installing SQL Server to the remaining nodes and having the replica complete, we can failover the SQL instance between nodes and AZs.
1. Connect to the SQL instance through the Client Access Name using SQL Server Management Studio. In the Additional Connection Parameters, add multisubnetfailover=true parameter (Figure 14).
Figure 14 – Connecting to the SQL Server FCI using the multisubnetfailover parameter
2. Use the following T-SQL statements to see the current owner while failing over the SQL service. Figure 15 shows SRVSQL01 as the current node, before failing over.
Figure 15 – Current SQL Server FCI node owner
3. On the Failover Cluster Manager snap-in, right-click the SQL Server role, Move and Select Node…, select SRVSQL02, and click on OK (Figure 16). Seconds later, the instance should be running on the second node (Figure 17).
Figure 16 – Manually failover the SQL Server instance
Figure 17 – SQL Server FCI running on the second node of the first AZ
4. Repeat the steps manually failing over to SRVSQL03 (Figure 18) and SRVSQL04 (Figure 19). Note how the workload recognizes the disks same as when running on the first two nodes (Figure 20).
Figure 18 – SQL Server FCI running on the first node of the second AZ
Figure 19 – SQL Server FCI running on the second node of the second AZ
Figure 20 – Second node of the second AZ as the current owner of the replicated disks
Best Practices and Considerations
To ensure optimal performance, reliability, and security of your SQL Server Stretch Failover Cluster Instance on Amazon EC2, consider the following best practices:
Application
- To ensure seamless failover operations, it is recommended to include the multisubnetfailover=true parameter in your application’s connection string. This setting optimizes client connectivity during cluster node transitions, particularly in multi-subnet environments, resulting in faster detection of failovers and improved application responsiveness.
- Cross-AZ Failover Performance: During failover, workloads must migrate between AZs, which may introduce a temporary performance impact or service disruption if your application is not hosted in the same AZ as the database. If you have a low latency application requirement, consider failing over the application to the same AZ as the active database node.
Cost Optimization
- Right-size your Amazon EC2 instances based on workload requirements.
- Use AWS Cost Explorer to track cross-AZ data transfer costs.
- Consider Reserved Instances for long-term cost savings.
- Optimize io2 volume sizes and IOPS based on actual usage patterns.
- To reduce data replication, only use replicated volumes for SQL Server database data and log files.
- Evaluate the best Microsoft licensing fit for your environment. This blog post covered a 4-node architecture using SQL Server Enterprise edition (or Developer), and the same instructions can be used to deploy a 3-node architecture. SQL Server Standard edition can be used if your requirements fit its resources limitations and if you need up to two nodes only.
Security
- Use Amazon EBS volume encryption.
- Implement least privilege access for cluster management.
- Regularly audit cluster and SQL Server security settings.
Disaster Recovery
- Maintain regular backups of cluster configuration.
- Document cluster rebuild procedures.
- Test disaster recovery procedures periodically.
Performance
- Before committing to this architectural design, thoroughly test its performance under various conditions including load, stress, and endurance scenarios. Measure response times, throughput, IOPS, and resource usage while simulating real-world usage patterns and failure conditions. This validation will help prevent costly architectural changes later and ensure the system meets performance requirements.
Clean up
To avoid ongoing charges and remove all resources created in this walkthrough, run the following PowerShell code or manually complete these steps in order:
1. Remove SQL Server FCI (repeat this process on all remaining nodes):
2. Remove the Storage Replica Partnership:
3. Remove the Windows Server Failover Cluster:
4. Detach and Delete EBS Volumes
5. Terminate the Amazon EC2 Instances
6. Clean up Active Directory (service accounts, computer objects, and permissions granted)
7. Remove the file share witness
Note: These steps will permanently delete all resources and data. Before performing, ensure you have:
- Backed up any important data
- Documented any configuration settings you might need in the future
- Verified that no production workloads are depending on these resource
Conclusion
In this blog post, we documented the steps to deploy a SQL Server Stretch FCI on Amazon EC2. Customers that have similar storage replications requirements within Windows Server and SQL Server environments may leverage the solution to ensure their workloads are high available across two AZs or even two Regions.
AWS has significantly more services, and more features within those services, than any other cloud provider, making it faster, easier, and more cost effective to move your existing applications to the cloud and build nearly anything you can imagine. Give your Microsoft applications the infrastructure they need to drive the business outcomes you want. Visit our .NET on AWS and AWS Database blogs for additional guidance and options for your Microsoft workloads. Contact us to start your migration and modernization journey today.