Reduce failover times for SQL Server on Amazon EC2 instance using Network Load Balancer

This blog post will show you how to use AWS’ Network Load Balancer (NLB) to reduce outage time for legacy clients in response to a SQL Server Always On availability group failover

A popular option to achieve SQL Server high availability (HA) is to implement SQL Server multi-subnet clustering for Always-On availability groups (AO-AG) across Availability Zones (AZ). With the SQL Server Always On availability group scenario, the cluster registers the AO-AG Listener in DNS with two or more IP addresses—one for each node in the cluster.

Modern drivers and clients support the MultiSubnetFailover option. With this option enabled, the application attempts to connect to all IP addresses in parallel. If a connection attempt succeeds, the driver discards any pending connection attempts.

However, legacy SQL Server client applications, including several commercial off-the-shelf software (COTS), do not support this feature. Let’s consider a two-node cluster that registers both Listener IP addresses in DNS. If an application does not support the MultiSubnetFailover option, it will query the DNS and then iterate sequentially through all IP addresses. Half of the requests will connect to the active node, and the other half will attempt to connect to the passive node. The latter may cause a connection failure.

We can mitigate this situation to an extent by setting the cluster parameter RegisterAllProvidersIP=0, which will force the cluster to register only active IPs for the AO-AG Listener in DNS. In this case, legacy SQL Server clients may recover from AO-AG failover, but their outage time will depend on the time to re-register a new IP in DNS, a DNS replication delay if multiple DNS servers are used, and a DNS record Time-to-Live. If a failover occurs, the client’s recovery time could be 15 minutes or longer, depending on your HostRecordTTL setting and the setting of your cross-site DNS replication schedule. With this configuration, the outage time will equally affect both the legacy and modern clients.

The solution we’re presenting in this blog post, based on using AWS NLB for SQL Server AO-AG running on Amazon EC2, reduces outage time for legacy clients to approximately one minute using our recommended configuration for the health check. Modern drivers and clients connecting to the AO-AG Listener are not affected.

Architecture overview

Figure 1. Solution Architecture

Figure 1 captures the architecture of the proposed solution. We have a cluster of two SQL Servers representing primary and secondary AO-AG nodes. These two servers are configured as targets in our NLB target group. Client applications connect to the active AO-AG node through the NLB, which routes clients’ requests to the target group, which, in turn, routes requests to target(s) that pass the health check. With the SQL Server configuration changes, which we will discuss later, only the primary node of AO-AG will be able to pass the health check. Thus, client connection requests will always be routed to the primary node.

Another benefit of this approach is that the SQL Server listener, which we omitted from Figure 1, is still available, allowing clients that support multi-subnet failover to connect to SQL Server AO-AG through the listener, while legacy clients can only connect through the NLB.

For more details on how to deploy SQL Server on AWS and create a multi-AZ Always-On availability group, please refer to the following documentation:

Prerequisites

To complete the walkthrough presented in this blog post, you need a SQL Server Always-On availability group deployed on Amazon EC2 instances running in multiple subnets within the same or separate AZ. You can automatically deploy and configure SQL Server on Amazon EC2 by using AWS Launch Wizard for SQL Server.

Walkthrough

We will walk you through the following steps:

Enable a health check on AO-AG nodes
Configure the target group using the Amazon EC2 console
Create the Network Load Balancer
Test connectivity through Network Load Balancer

Step 1: Enable a health check on AO-AG nodes

First, choose an available port for the health check probe. In this example, we will use port 59999 as the probe port for the health check. This is just our choice; you may select any other available port.

Figure 2 shows the PowerShell script you will need to run to configure the health check for the AO-AG nodes. Script parameter $ListenerProbePort sets the port for the health check. This script will change the cluster parameters for the network name of the listener to allow the health check to detect which node is currently active. The script will also check and, if necessary, update the Windows firewall rules to make sure that the selected probe port is open on the firewall.

Import-Module FailoverClusters
[int]$ListenerProbePort = 59999
$NodeName = $env:COMPUTERNAME
$NetworkName = (Get-ClusterNetworkInterface | Where-Object { $_.Node -eq $NodeName }).Network.Name
$ClstrObj = Get-ClusterResource |
	Where-Object { $_.ResourceType -eq "IP Address" } |
	Where-Object { $_.OwnerGroup -ne "Cluster Group" } |
	Where-Object { (Get-ClusterParameter -InputObject $_ -Name Network).Value -eq $NetworkName }
$ListenerIP = (Get-ClusterParameter -InputObject $ClstrObj -Name Address).Value

$PortFilter = Get-NetFirewallPortFilter | Where-Object { $_.LocalPort -eq $ListenerProbePort }
if ($Null -eq $PortFilter) {
	New-NetFirewallRule -DisplayName "SQL AO-AG Probe Port" `
		-Direction "Inbound" `
		-Action "Allow" `
		-Protocol "TCP" `
		-LocalPort $ListenerProbePort
}

Set-ClusterParameter -InputObject $ClstrObj -Multiple `
@{"Address" = $ListenerIP; "ProbePort" = $ListenerProbePort; "SubnetMask" = "255.255.255.255"; "Network" = $NetworkName; "EnableDhcp" = 0 }

if ($ClstrObj.State -eq "Online") {
	Stop-ClusterResource -InputObject $ClstrObj;
	Start-ClusterResource -InputObject $ClstrObj;
	Start-ClusterResource -InputObject (Get-ClusterResource | Where-Object { $_.Name -eq $ClstrObj.OwnerGroup })
}

Figure 2. PowerShell to configure Health Check

This script needs to be executed on every node in the AO-AG cluster. If you execute this script on the primary node, the availability group cluster role will get disabled. The script discovers this situation and brings the role back online. However, this may cause the cluster to become temporarily unavailable. Thus, we recommend running this script on all secondary nodes, performing failover, and then running the script on the previous primary node.

Step 2: Configure the target group using the Amazon EC2 console

In our example, the target group contains two SQL Server instances. The health checks verify which SQL Server instance currently owns the availability group listener using the ListenerProbePort defined and configured in Step 1 above.

AWS documentation details how to create and configure a target group for a Network Load Balancer. Creating and configuring a target group starts with choosing a target type, as shown in Figure 3. You may select multiple target types, but in our case, only Instances and IP Addresses apply. In our example, we selected Instances.

Figure 3. Selecting the target type for the Target Group

In Figure 4, you set the Target Group name – we selected TgSqlAOAG, but feel free to choose a name that fits your pattern. For the Protocol, select TCP; for the Port, enter the port your SQL Server is configured for. We used default port 1433. The VPC selection will be preloaded with your default VPC; you may select another VPC depending on your network configuration. We left it at the default value.

Figure 4. Basic configuration – Name, Port, and VPC

Figure 5 shows the configuration of the health checks for our target group. The Health check protocol selection is preset to TCP, as we selected the TCP protocol in the previous step. For the health check Port, we set port 59999 configured in Step 1 of the walkthrough. If you selected a different Probe Port, use your value in this field.

The Healthy threshold and Interval define how fast the health check will react to the change in AO-AG primary node. We set the Healthy threshold to 2 and Interval to 10 seconds. The Unhealthy threshold and Timeout are preset for a TCP health check to 2 and 10 seconds, respectively. With these settings, the NLB will initiate a health check roughly every 10 seconds and require two successful or failed attempts to change the status of the target. The health check will wait up to 10 seconds for the nodes to reply. Thus, it may take about 20 seconds for a node to transition from an Unhealthy to a Healthy state and about 40 seconds to transition from a Healthy to an Unhealthy state.

Ensure that the ports you specify for your SQL Server connection in Figure 4 and the health check probe port in Figure 5 are open on the security groups of the instances and accessible for the NLB.

Figure 5. Configuring Health check

Now you will add targets to the target group. Because we selected Instances in Figure 3, we are now presented with the list of instances in our VPC. Select SQL Server instances comprising AO-AG, as shown in Figure 6, and select Include as pending below button.

Figure 6. Registering targets with the Target Group.

Finally, confirm your selections and select Create target group button, as shown in Figure 7.

Figure 7. Confirmation screen

Step 3: Create the Network Load Balancer

Now you are ready to create the NLB. Please check AWS documentation for a detailed outline of all the required steps. You will create the NLB by setting up the Basic configuration, as shown in Figure 8. We named our NLB NLB_SQLServer – please select the name matching your naming conventions.

Figure 8. NLB Basic configuration

Now you will configure the NLB network mapping. Our sample configuration is displayed in Figure 9. Because we selected Internal as the NLB scheme in Figure 8, we need to choose private subnet(s) for our NLB, preferably at least 2.

Figure 9. Configuring NLB network mapping.

Now you can configure the listeners and routing for your NLB. In Figure 10, for Protocol, we selected TCP and the default SQL Server port 1433. If your SQL Server is configured with a different port, please use your port for this field. The Default action for the NLB is to route requests to the target group TgSqlAOAG, which we created in the previous step.

Figure 10. Configuring listeners and routing for NLB

Optionally, you may configure Tags for your NLB to facilitate accounting and management. After reviewing your NLB configuration, select Add listener button.

There is one more step to complete the configuration of your NLB so that it successfully routes requests from the clients to your active SQL Server AO-AG node. For this, on the EC2 Console, navigate to Load Balancers, select your load balancer, and then activate the Action menu, as shown in Figure 11. From the menu, choose Edit attributes.

Figure 11. Editing NLB attributes.

In the Edit load balancer attributes dialog, select Enable for Cross-zone load balancing property, as shown in Figure 12, and then select Save button.

Figure 12. Selecting Cross-zone load balancing.

The Network Load Balancer, which we just created, will implement a health check as specified on the target group and route clients’ requests on port 1433 to the healthy targets in the target group.

Step 4: Test the connection to the NLB

Test the connection to SQL Server over NLB by completing the following steps:

Obtain the DNS name of the load balancer you created. On the Amazon EC2 console select Load Balancers. Select the load balancer you created and copy the DNS name.
Use remote desktop protocol (RDP) to connect to a SQL Server instance that’s in the same virtual network but does not own the replica. This server can be the other SQL Server instance in the cluster.
Use the sqlcmd utility to test the connection. For example, the following script establishes an sqlcmd connection to the primary replica through the listener with Windows authentication:

sqlcmd -S <Load Balancer DNS Name> -E

The SQLCMD connection automatically connects to the SQL Server instance that hosts the primary replica.

Cleanup

To remove the NLB, go to the Load Balancers section in the Amazon EC2 console, select the NLB you created, select the Actions button, select Delete, and then confirm you want to delete the NLB by selecting Yes, Delete.

Then go to Target Groups, select the target group you created, select Actions, then select Delete, and then confirm by selecting Yes, Delete.

Conclusion

This blog post showed how to set up and use the AWS Network Load Balancer to reduce outage time for legacy clients in response to SQL Server Always On availability Group failover.

The traditional approach to solving this issue for legacy clients is based upon setting the cluster parameter RegisterAllProvidersIP=0 so that the cluster registers only the IP address of the primary node in DNS. This approach results in a failover time of approximately 15 minutes for legacy clients. Using NLB to address this issue, as presented in this blog post, reduces the failover time for legacy clients to about 70-80 seconds. Using NLB for legacy clients does not prevent modern clients from using the SQL Server AO-AG listener to resolve failover issues.

AWS can help you assess how your company can get the most out of cloud. Join the millions of AWS customers that trust us to migrate and modernize their most important applications in the cloud. To learn more on modernizing Windows Server or SQL Server, visit Windows on AWS. Contact us to start your modernization journey today.

Microsoft Workloads on AWS