Multi-region SQL Server deployment using distributed availability groups

A multi-region architecture for SQL Server is often a topic of interest that comes up when working with our customers. The fundamental reasons why customers adopt a multi-region architecture approach for SQL Server deployments is:

Business continuity and disaster recovery
Geographically distributed customer base and improving latency for end-users

This post explains the architecture patterns that you can follow to effectively design a highly available SQL Server deployment, which spans two or more AWS Regions. You also learn how to use the multi-region approach to scale out the read workloads and improve the latency for your globally distributed end-users.

Architectures: traditional and optimal

This post discusses two architectures: the traditional approach of a multi-region Always On availability group, and the optimal approach of a multi-region Distributed availability group.

Multi-region Always On availability group

With the traditional approach, you can establish an inter-region VPC peering connection, span your single Windows Server Failover Cluster (WSFC) across two Regions, and build an Always On availability group deployment with nodes in those two Regions. The following diagram illustrates this architecture.

In this architecture, Region A hosts the primary replica. The synchronous secondary replica is also hosted in Region A. The primary replica is in AZ1 of Region A, and the synchronous secondary replica is in AZ2. Data transfer is synchronous between the replicas, with automatic failover as the failover mode. In the event of a connectivity failure or other failure of AZ1, automatic failover kicks in and the AZ2 node becomes the primary replica. This is no different than the behavior that occurs if any single database on the server that participates in the Always On availability group fails.

As a best practice, the applications should connect via the Always On availability group listener with the latest supported drivers and key parameters (such as MultiSubNetFailover=True) to facilitate the failover and make sure that the applications seamlessly connect to the new replica without any errors or timeouts. There are also quorum-related settings that you must consider, which this post explains in a later section.

Region B in this architecture is considered as the secondary Region. A second and third secondary replica is configured in this region and is asynchronously synched from the primary, which is hosted in Region A. The failover mode is configured as manual because the data transfer mode is set as asynchronous. Asynchronous transfer is the recommended practice between Region-to-Region communications. If there is a catastrophic regional failure in Region A, a manual failover can be triggered and applications can connect to the newly promoted primary replica in Region B via the availability group listener.

One of the key drawbacks of this approach is related to the data synchronization aspect for the secondary replicas for Region B. The primary replica in Region A is responsible for feeding all the secondary replicas and for every secondary replica in Region B, the primary replica must send data separately over the network. Multiple data transfers can add up the overall data transfer charges so it’s not a cost-optimized strategy.

One mitigation for this is to only have a single node in secondary Region B, and add a new node in the event of a regional failover. The other drawback concerns security. When you span your WSFCs across Regions, you must open up a large set of ports for your security groups. This includes the TCP/UDP port for the cluster service, TCP port for RPC, UDP for Cluster Administrator, ICMP, TCP for SMB, and randomly allocated high UDP ports. Opening a large range of ports is often a concern that internal security teams flag. For more information about the ports that are involved, see Service overview and network port requirements for Windows on the Windows support website.

Multi-region distributed availability group

An architecture with distributed availability groups is an optimal approach for multi-region SQL Server deployment. A distributed availability group is a special type of availability group that spans two separate availability groups. You can think of it as an availability group of availability groups. The underlying availability groups are configured on two different WSFC clusters.

You can also use the distributed availability group approach to design a hybrid SQL Server deployment architecture with SQL Server nodes deployed in on-premises data centers and on AWS. For more information, see How to architect a hybrid Microsoft SQL Server solution using distributed availability groups.

The following diagram illustrates the multi-region distributed availability group architecture.

In this architecture, Region A hosts the primary replica. There is a standalone WSFC cluster named WSFC1 for Region A and doesn’t span Regions like the previous architecture. The synchronous secondary replica is also hosted in Region A, preferably in a second Availability Zone. The primary replica is in AZ1 of Region A, and the synchronous secondary replica is in AZ2. Both the replicas are part of a standalone availability group named AG1. Data transfer is synchronous between the replicas with automatic failover as the failover mode. If you have a failure, automatic failover kicks in and the AZ2 node becomes the primary replica.

As a best practice, the applications should connect via the Always On availability group listener with the latest supported drivers and key parameters (such as MultiSubNetFailover=True) to facilitate the failover and make sure that the applications seamlessly connect to the new replica without any errors or timeouts. There are also quorum-related settings to consider, which this post explains in a later section.

Region B in this architecture is considered as the secondary Region. A second secondary replica is configured in this Region, and is asynchronously synched from the primary, which is hosted in Region A. This replica is called the forwarder. The concept of a forwarder is new, and it’s one of the core functionalities for the distributed availability group. The forwarder is responsible for synchronizing the other replicas in Region B.

The forwarder is one of the key advantages of this optimal architecture. The primary replica in Region A sends data asynchronously to the forwarder only, and the forwarder sends data synchronously or asynchronously to the other replicas in Region B. This reduces the overall data transfer that happens from Region A to Region B in case you have multiple nodes in Region B. Because WSFC1 and WSFC2 are standalone and independent clusters, you also don’t need to open a large set of ports. This reduces the security risk.

Scaling out reads

You can further extend the distributed availability group architecture to provide read replicas in Region B. This approach is useful if you have users of the applications in two different Regions, and they can consume the reads from a Region that’s closer to them. This improves the latency for read-only SQL queries. The following diagram illustrates this architecture.

This architecture has three additional replicas configured for Region B. The forwarder is responsible for the data synchronization of these three replicas. You can use additional replicas for scaling out the reads. Each availability group supports one primary replica and up to eight secondary replicas. Essentially, you can have up to 18 read replicas with this configuration.

If you are interested in a simpler highly available single region deployment of SQL Server, then you can leverage AWS Launch Wizard for SQL Server. AWS Launch Wizard for SQL Server is a simple, intuitive and free to use wizard-based experience that enables quick and easy deployment of highly available SQL Server solutions on AWS. The wizard walks you through an end-to-end deployment experience of always on availability groups using prescriptive guidance. With AWS Launch Wizard, you can specify your application’s components ranging from SQL Server node settings and AMIs to Virtual Private Cloud (VPC), Active Directory (AD), and EC2 instances to effortlessly create a production-ready deployment.

Things to consider

When choosing your deployment strategy, you should consider bandwidth, quorum settings, and automatic seeding.

Bandwidth

Bandwidth is a key consideration for a multi-region deployment. For example, assume you have set a multi-region SQL Server deployment using distributed availability groups across US East (N. Virginia) and US West (Oregon). The network latency for these two Regions is in the range of 75–80 milliseconds. If your workload is a high OLTP system, you should conduct stress and performance benchmarking tests to make sure that your secondary replicas in the other Region are being synchronized without a huge lag. A significant delay to synchronize the secondary replicas has an adverse impact on your Recovery Point Objective (RPO) SLAs.

Quorum settings

With distributed availability groups, because you have two different WSFC, you treat the quorum settings separately. You must set the quorum votes based on your node count. A file share witness is a recommended witness type, and you can use the fully managed service Amazon FSx for Windows File Server for the file share witness requirements.

You can also use Amazon FSx to design Always On failover cluster instances (FCI). For more information, see Simplify your Microsoft SQL Server high availability deployments using Amazon FSx for Windows File Server.

Automatic seeding

SQL Server 2016 introduced automatic seeding of availability groups. When you create an availability group with automatic seeding, SQL Server automatically creates the secondary replicas for every database in the group. You no longer have to backup and restore secondary replicas manually. This feature can come handy if you deal with a relatively smaller set of databases. If you work with larger databases, automatic seeding isn’t recommended. For more information about automatic seeding, see Use automatic seeding to initialize an Always On availability group on the Microsoft documentation website.

Conclusion

A multi-region strategy for your mission-critical SQL Server deployments is key. This post focused on how to achieve that optimally by using distributed availability groups. You also learned about other benefits such as read scale outs by using distributed availability groups.

Stay tuned for further posts about optimizing SQL Server deployment.

About the Author

Anup Sivadas is a Solutions Architect at Amazon Web Services. He works with our customers to provide architectural guidance and technical assistance on AWS services, helping them improving the value of their solutions when using AWS.