Implement a SQL Server HA/DR solution on AWS Outposts

In highly-regulated industries, such as financial services, in-country data residency requirements can often challenge cloud adoption, especially in countries without an AWS Region. In such a case, AWS Outposts could be a good solution to achieve durability and availability whilst complying with in-country data requirements. Outposts, which provides a pool of compute and storage capacity installed in the customer’s datacenter, runs as an extension of an Availability Zone, where it is anchored to and connected via a service link.

Some workloads have durability requirements to sustain large failure scenarios, like Outposts’ rack or complete data center failure. For these workloads, customers should deploy multiple Outposts in separate data centers, such as production and DR, and architect their workloads to provide disaster recovery capability. Each Outposts will be anchored on a separate Availability Zone and connected to the region through its service link.

Whilst Outposts supports Amazon Relational Database Service (Amazon RDS) for SQL Server, only single instance with no automatic failover is available at the time this blog post was written. Alternatively, leveraging on self-managed Microsoft SQL Server on Amazon Elastic Compute Cloud (Amazon EC2) instances, which gives full control over the setup of the infrastructure and the database environment, can provide high availability (HA) and disaster recovery (DR) from several deployment options from Microsoft SQL Server.

In this blog post, I will provide guidance for architecting a HA and DR solution for a Microsoft SQL Server deployment on AWS Outposts. To enable HA and DR, we will leverage SQL Server Always On Availability Group (AG) on self-managed Amazon EC2 instances across two Outposts hosted in separate customer datacenters.

Challenges building a Windows Server Failover Cluster across multiple AWS Outposts

Whilst Outposts are connected to the AWS Region through the service links, resources across 2 Outposts cannot communicate with each other by transiting traffic through the Region. The traffic across the two Outposts should flow through the customer network.

As shown on Figure 1 below, AWS Outposts is interconnected to the customer network via a local gateway (LGW), which performs routing and network address translation (NAT) with IPs from a customer-owned IP pool (CoIP). The AWS Outposts VPC CIDRs are not advertised against the customer network, only the CoIP pool CIDR is known within the customer network. LGW works in a similar way as the internet gateway (IGW), where IPs are translated to Elastic IPs when traversing the IGW.

Figure 1. High level AWS Outposts interconnection with on-premises network with CoIP

To enable connectivity across 2 Outposts through the customer internal network, the communication will need to go through the LGWs with the IPs being translated to the CoIPs. Therefore, when an instance A on Outposts #1 wants to talk to an instance B on Outposts #2, instance A would need to reach out to the CoIP of instance B. Instance A will also get NATed when going through the LGW.

Refer to the Connecting AWS Outposts to on-premises data sources blog post, which provides additional information on the integration between Outposts and an on-premises network.

In our scenario, we consider the architecture as a multi-site setup, with HA on the primary Outposts in the production site and DR on a secondary Outposts hosted in a different site.

SQL Server Always On AG requires all nodes to be under a Windows Server Failover Cluster (WSFC). Whilst WSFC on a single Outposts allows HA within the Outposts, spanning a WSFC cluster across 2 Outposts is not possible due to the network communication needing to go through the LGWs and therefore be NATed. The nodes joining the WSFC are not “aware” of the nodes CoIPs (as the IP translation occurs on the LGW), making the communication across WSFC nodes on 2 Outposts not possible.

SQL Server provides a special type of AG, called distributed availability group. Distributed AG essentially spans an AG group across multiple AGs, each under their own WSFC, removing the need to have all nodes under the same WSFC.

Whilst different approaches can be considered for SQL Server DR, including log shipping or SQL Server Replication, distributed AG would be preferred as it supports synchronous replication within the production Outposts for HA and asynchronous replication across Outposts for DR with low RPO and RTO.

With distributed AG, whilst the failover/failback requires a manual intervention, there would be no need for additional configuration or a manual database re-synchronization as soon as the distributed AG is re-established, assuming all instances across the sites come back up and are running with the same configuration and the connectivity is re-established. This would greatly simplify the operations during a DR scenario.

Overview of solution using Always On distributed AGs

We want to achieve HA within the production Outposts and a DR to a second Outposts in the DR site.

Figure 2. SQL Server Always On Distributed AG architecture on AWS Outposts

Figure 2 describes the solution architecture to achieve HA and DR across 2 AWS Outposts hosted in different datacenters, one for production and the second one for DR.

Each Outposts is interconnected to the customer network via a local gateway, which performs routing and network address translation (NAT) with IPs from a customer-owned IP pool (CoIP). This enables the communication across the 2 Outposts via the customer’s network.

Each Outposts hosts a SQL Server Always On AG configured on its own WSFC. Both AGs provide HA in both production and DR datacenters, with 2 replicas (one primary taking care of both read and write) and a secondary replica, which can be configured either as a standby or to handle read only.

If no HA is required in the DR site, we could configure the DR AG with only one replica.

A distributed AG spans the 2 AGs providing DR capability. As the communication between production and DR AGs goes through the LGW, the communications need to be established through the AG listener CoIPs.

Please note that in a distributed AG, whilst each AG has a primary replica, only one copy of the database accepts write queries. In this architecture, only the primary replica in AG1 accepts writes.

Walkthrough

Prior to configuring an Always On distributed AG, we would need to prepare the environments. Let’s first prepare the production environment on the 1^st Outpost:

Create the VPC, subnets, and configure security groups
Install and configure a WSFC and SQL Always On AGs. You can leverage the knowledge center information to configure the clusters.

When creating an AG via the “New Availability Group Wizard“, the Database Mirroring endpoint is created on each node joining the AG with encryption enabled by default using AES encryption algorithm with Windows authentication.

However, if you are configuring the AGs manually, you would need to create the Database Mirroring endpoint on each node to ensure connections between the AGs in the distributed AG.

You can execute the following T-SQL script to create the mirroring endpoint:

CREATE ENDPOINT [Hadr-endpoint]
STATE=STARTED
AS TCP (LISTENER_PORT = 5022, LISTENER_IP = ALL)
FOR DATA_MIRRORING (
ROLE = ALL,
AUTHENTICATION = WINDOWS NEGOTIATE,
ENCRYPTION = REQUIRED ALGORITHM AES
)

Figure 3. Always On AG from Microsoft SQL Server Management Studio on Production Outposts

At this point, we have achieved HA on the production Outposts by configuring 2-node SQL Server Always On AG. The AG will have its own AG listener (denoted with “1” in Figure 3) and a Database Mirroring endpoint (denoted with “2” in Figure 3).

Repeat the same steps to create the DR Always On AG on Outposts #2. In my example, the AG for DR has only a single node.

Please note that for the “DR” AG, no database is required during the cluster creation, as it will be created by the distributed AG.

Figure 4. Always On AG from Microsoft SQL Server Management Studio on DR Outposts

We now have a 2^nd SQL Server Always On AG with a single node on the DR Outposts, as shown on Figure 4.

Please also note that the instances on Outposts #1 can’t communicate to the instances on Outposts #2, as no CoIP were assigned.

I will detail further the remaining steps, including:

Allocating and assigning CoIPs
DNS records configuration
Configuration of the Distributed AGs
Testing the failover and failback between production and DR

Allocate and assign the CoIPs to the EC2 instances IPs

As part of the AG configuration, 2 secondary IPs were allocated to each EC2 instance, one for the WSFC and the other for the AG listener. To enable communication across the production and DR Outposts through the customer network, we will need to allocate and assign CoIPs against all EC2 instances IPs, including the secondary IPs.

To help configuring the solution, I have listed in Table 1 below an example of the environment’s requirements, including instances IPs, CoIPs mapping and DNS. Please replace highlighted elements with specific values from your environment.

	Production Outpost		DR Outpost
	EC2 Primary	EC2 Secondary	EC2 Primary	EC2 Secondary (Optional)
VPC CIDR	192.168.0.0/24		192.168.1.0/24
Private subnets	192.168.0.0/28	192.168.0.16/28	192.168.1.0/28	192.168.1.16/28
EC2 Primary IP	192.168.0.5	192.168.0.20	192.168.1.5	192.168.1.20
Primary EC2 DNS Name	sql-prd-01	sql-prd-02	sql-dr-01	sql-dr-02
EC2 Primary CoIP (mapped against Primary IP)	10.100.100.5	10.100.100.20	10.200.100.5	10.200.100.20
EC2 WSFC IP	192.168.0.6	192.168.0.21	192.168.1.6	192.168.1.21
EC2 SQL Server AG IP	192.168.0.7	192.168.0.22	192.168.1.7	192.168.1.22
SQL Server AG Listener DNS Name	prd-listener		dr-listener
EC2 SQL Server AG CoIP (mapped against EC2 SQL Server AG IP)	10.100.100.7	10.100.100.22	10.200.100.7	10.200.100.22
SQL Server AG Listener CoIP DNS Name	prd-coip-listener		dr-coip-listener
DNS Domain	mydomain.com

Table 1. Configuration details with local IP, COIPs and DNS names

The configuration of the CoIP on EC2 is currently done manually via the AWS Management Console or the EC2 API. As we want to allocate specific IP addresses from the CoIP pool, we would need to use the AWS CLI.

To configure a CoIP, you would need to:

Know the CoIP pool ID with the following command:

aws ec2 describe-coip-pools

Figure 5. Retrieve the CoIP pool ID

Allocate the specific IP address from the CoIP pool by running the following command. Please replace highlighted elements with specific values from your environment:

aws ec2 allocate-address --address 10.100.100.7 --customer-owned-ipv4-pool ipv4pool-coip-092738572a73e5874

Figure 6. Allocate an IP address from a CoIP pool

Associate this new IP to the instance on the specific private IP address running this command. Please replace highlighted elements with specific values from your environment:

aws ec2 associate-address --allocation-id eipalloc-046f4f653fbeb092e --instance-id i-03b78d35eaff9d391 --private-ip-address 192.168.0.7

Figure 7. Associate a CoIP to an instance

Repeat the steps to associate all the EC2 instances IPs to respective CoIPs across the 2 Outposts.

Once all CoIPs are associated to the EC2 instances IPs, we will need to create DNS records, especially for the 2 SQL Server AG listeners.

DNS records configuration

A distributed AG is established by connecting the AGs via their listeners. As the communication across the 2 AGs goes through the Outposts Local Gateways and gets NATed with CoIP, the “CoIP” listener would be required to establish the communication for the distributed AG. We would therefore need to create a DNS record pointing to the CoIP of the listeners.

It’s also important to ensure that all nodes can communicate with the listeners across the 2 AGs, as well as with the listeners CoIP.

In this example, we create the DNS record with the addresses and names from Table 1. In your environment, the addresses and record names may be different.

prd-coip-listener A 10.100.100.7 #(sql-prd-01 SQL AG CoIP)
prd-coip-listener A 10.100.100.22 #(sql-prd-02 SQL AG CoIP)
dr-coip-listener A 10.200.100.7 #(sql-dr-01 SQL AG CoIP)
dr-coip-listener A 10.200.100.22 #(sql-dr-02 SQL AG CoIP)

Configuration of the Distributed AGs

We will assume that all security groups and firewall (if any on your internal network) are allowing traffic from the SQL instances (primary CoIP) to the SQL Server AG listener CoIP and mirroring endpoint port across the two Outposts (port 5022 in this blog post).

To create the distributed AG, we will use T-SQL to implement the following 2 steps:

Create the distributed AG on the production cluster.

On the prod AG cluster, launch SQL Server Management Studio using the listener DNS name (e.g., prd-listener.mydomain.com in the example) and create the distributed AG using the following script:

CREATE AVAILABILITY GROUP [distributedag] 
WITH (DISTRIBUTED) 
AVAILABILITY GROUP ON 
'cluster-ag-prd' WITH 
( 
LISTENER_URL = 'tcp://prd-listener.mydomain.com:5022', 
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT, 
FAILOVER_MODE = MANUAL, 
SEEDING_MODE = AUTOMATIC 
), 
'cluster-ag-dr' WITH 
( 
LISTENER_URL = 'tcp://dr-coip-listener.mydomain.com:5022', 
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT, 
FAILOVER_MODE = MANUAL, 
SEEDING_MODE = AUTOMATIC 
);

On the DR cluster, run the following T-SQL query to join the distributed AG:

ALTER AVAILABILITY GROUP [distributedag] 
JOIN 
AVAILABILITY GROUP ON 
'cluster-ag-prd' WITH 
( 
LISTENER_URL = 'tcp://prd-coip-listener.mydomain.com:5022', 
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT, 
FAILOVER_MODE = MANUAL, 
SEEDING_MODE = AUTOMATIC 
), 
'cluster-ag-dr' WITH 
( 
LISTENER_URL = 'tcp://dr-listener.mydomain.com:5022', 
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT, 
FAILOVER_MODE = MANUAL, 
SEEDING_MODE = AUTOMATIC 
);

We now have configured SQL Server Always On Distributed AG. Figure 8 presents the Microsoft SQL Management Studio Object Explorer view of our AG configuration:

Figure 8. View of production cluster on Object Explorer

Testing failover/failback between production and disaster recovery

We can now test the failover between the production and DR Outposts. For a controlled failover with no data loss, the Distributed AG can be switched to synchronous replication prior to the failover. The commands below will only focus on the actual failover/failback.

Failover

To check the synchronization status across production and DR nodes, run the following SQL query:

-- Run this query on the Global Primary and the forwarder
-- Check the results to see if synchronization_state_desc is SYNCHRONIZED,
-- and the last_hardened_lsn is the same per database on both the global
-- primary and forwarder 
-- If not rerun the query on both side every 5 seconds until it is the case
--
SELECT ag.name
, drs.database_id
, db_name(drs.database_id) as database_name
, drs.group_id
, drs.replica_id
, drs.synchronization_state_desc
, drs.last_hardened_lsn 
FROM sys.dm_hadr_database_replica_states drs 
INNER JOIN sys.availability_groups ag on drs.group_id = ag.group_id;

Figure 9. View of synchronization status across all AGs from Production cluster

On the primary cluster, run the following query to switch the role to secondary:
```
ALTER AVAILABILITY GROUP distributedag SET (ROLE = SECONDARY); 
```
On the DR node, execute the below SQL query to failover to the DR AG. In case of a real failure on the primary node (Primary Outpost down), this is the only query that would be required to execute:
```
ALTER AVAILABILITY GROUP distributedag FORCE_FAILOVER_ALLOW_DATA_LOSS; 
```

Let’s now run the query to check the synchronization status from the DR cluster:

Figure 10. View of synchronization status across all AGs from DR cluster

Failback

The failback to a production AG is done in a similar way as the failover. For failback, the best practice will be to ensure the data is fully in sync; therefore, enable synchronous replication first before the failback.

To ensure no data is lost, temporarily stop all SQL transactions to the DR AG and execute the SQL query below (the name of the AG might need to be updated to reflect the ones from your environment):

-- sets the distributed availability group to synchronous commit 
 ALTER AVAILABILITY GROUP [distributedag] 
 MODIFY 
 AVAILABILITY GROUP ON
 'cluster-ag-prd' WITH 
  ( 
  AVAILABILITY_MODE = SYNCHRONOUS_COMMIT 
  ), 
  'cluster-ag-dr' WITH  
  ( 
  AVAILABILITY_MODE = SYNCHRONOUS_COMMIT 
  );

Execute the SQL query below to check the synchronization status across the production and DR nodes:

-- Run this query on the Global Primary and the forwarder
-- Check the results to see if synchronization_state_desc is SYNCHRONIZED, and the last_hardened_lsn is the same per database on both the global primary and forwarder 
-- If not rerun the query on both side every 5 seconds until it is the case
--
SELECT ag.name
, drs.database_id
, db_name(drs.database_id) as database_name
, drs.group_id
, drs.replica_id
, drs.synchronization_state_desc
, drs.last_hardened_lsn 
FROM sys.dm_hadr_database_replica_states drs 
INNER JOIN sys.availability_groups ag on drs.group_id = ag.group_id;

On the DR node (which is now set as global primary), execute the SQL query below to switch the role to secondary:
```
ALTER AVAILABILITY GROUP distributedag SET (ROLE = SECONDARY);
```

On the PRD node, execute the SQL query below to failover:

ALTER AVAILABILITY GROUP distributedag FORCE_FAILOVER_ALLOW_DATA_LOSS;

Once the failover is complete, the replication between production and DR can then be switched to asynchronous.

Refer to the Microsoft documentation for additional information on failover

Conclusion

In this blog post, I explained how to configure SQL Server with HA and DR capabilities on AWS Outpost via the Local Gateway/CoIP by leveraging Microsoft SQL Server Always On Distributed AGs.

With this solution, you are able to manually failover SQL Server databases to DR should a disaster strike in your production site. You can extend this solution by adding a level of automation to trigger the failover to DR with PowerShell scripts (with the cmdlet Switch-SqlAvailabilityGroup from the SqlServer PowerShell module).

AWS can help you assess how your company can get the most out of cloud. Join the millions of AWS customers that trust us to migrate and modernize their most important applications in the cloud. To learn more on modernizing Windows Server or SQL Server, visit Windows on AWS. Contact us to start your migration journey today.

Microsoft Workloads on AWS