Implementing HA and DR for SQL Server Always-On Failover Cluster Instance using Amazon FSx for NetApp ONTAP

One of the most difficult challenges organizations face today is ensuring data availability and accessibility across their locations. Maintaining high availability and accessibility for SQL Server databases across different AWS Regions is a top data protection requirement for enterprises considering moving to the cloud.

Amazon FSx for NetApp ONTAP offers a quick way to fail over your SQL Server environment to a different AWS Region in order to meet recovery time objective (RTO) and recovery point objective (RPO) requirements. FSx for NetApp ONTAP is a storage service that allows customers to launch and run complete, fully managed ONTAP file systems in the cloud. With this service, you can build a highly-available SQL Server Always-On Failover Cluster (FCI) architecture, as explained in the SQL Server High Availability Deployments Using Amazon FSx for NetApp ONTAP post in detail. In addition to high availability, customers across different industry verticals are looking for business continuity and disaster recovery strategies for their business-critical workloads.

In this post, we explain the architecture pattern that you can follow to design a High Availability and Disaster recovery (HADR) SQL Server Failover Cluster Instance architecture. This solution spans two AWS Regions and leverages NetApp SnapMirror’s native data replication capabilities within the FSx for NetApp ONTAP file system to enable data replication.

Solution overview

We will demonstrate how to evaluate FSx for NetApp ONTAP’s SnapMirror capability for consistency, performance, and reliability to meet our business outcome. To demonstrate how we can achieve the defined RPO of < 10 mins & RTO < 1 hour, we prepare both of the environments: the primary site in US East (North Virginia), us-east-1, and a warm standby disaster recovery site (us-east-2). The architecture is designed to support recovering Microsoft SQL Server user databases running in the FSx for ONTAP volume using SnapMirror replication to meet low RPO and RTO objectives. To implement this solution, we recommend that you have a good understanding and basic familiarity of NetApp data management capabilities. The following diagram illustrates the solution architecture:

Figure 1: Multi-region FSx for NetApp ONTAP with SnapMirror replication for SQL Server DR

Prerequisites

Prepare both the primary and DR site environment as mentioned in the following in any of your preferred and FSx for NetApp ONTAP supported AWS Regions.

In our example setup, we use the US East (North Virginia and Ohio) AWS Regions.

AWS Identity and Access Management (IAM) permissions to create VPCs and the VPC Peering connection between them.
A VPC that spans two Availability Zones (AZs) and includes two private subnets to host the primary site in us-east-1.
Another VPC with single AZ and a private subnet to host the DR site in us-east-2.
An existing or new deployment of Active Directory (AD) with network access to support the Windows Failover Cluster and SQL Deployment. AD can be deployed on Amazon Elastic Compute Cloud (Amazon EC2) with AWS Launch Wizard for Active Directory or through our Quick Start for Active Directory.
Deploy the SQL Server Instance at both the primary and secondary sites. For this demo, we deploy Two-node SQL FCI Cluster at the primary site and an additional third SQL Server Node at the DR Site, as shown in the architecture diagram.
AD Domain administrator account and SQL Server service account credentials.
All three SQL Servers will join and be part of the same Windows Failover Cluster, the AD domain, and they will utilize a common AD service account for the SQL Server.
Provision an FSx for ONTAP file system with appropriate SSD Capacity as per your requirement in the primary site (us-east-1).
Provision another FSx for ONTAP Multi-AZ or Single AZ file system at the DR site (us-east-2) with equal or more than the solid state drives (SSD) capacity provisioned at the primary site.

Walkthrough

To implement the solution, complete the following high-level steps:

Deploy two-node SQL FCI Cluster using FSx for ONTAP system at the primary site (us-east-1).
Configure SnapMirror relationship with primary FSx for ONTAP system at the DR Site (us-east-2).
Simulate a real-world disaster scenario at the primary site.
Initiate the DR drill and track timestamp.
Configure SnapMirror relationship for failback.

1. Deploy two-node SQL FCI Cluster using FSx for ONTAP file system at the primary site

You can either follow the step-by-step instruction from the Manual and Automated Deployment blog post examples that follow or use AWS Launch Wizard service to deploy SQL Server FCI (Always on Failover Cluster Instance):

Manual deployment

SQL Server High Availability Deployments Using Amazon FSx for NetApp ONTAP

Automated deployment

AWS Launch Wizard now supports SQL Server deployments using Amazon FSx for NetApp ONTAP

Figure 2: Choosing a deployment model for your SQL Server Failover Cluster Instance

Figure 2: Choosing a deployment model for your SQL Server Failover Cluster Instance

2. Configure SnapMirror relationship with primary FSx for ONTAP system at the DR Site (us-east-2)

Before you begin , visit this Migrating to FSx for ONTAP using NetApp SnapMirror post to understand the details of setting up the SnapMirror replication between two FSx ONTAP file systems outlined next.

2.1. Create a disaster recovery volume on the Single-AZ FSx for ONTAP file system at the DR site.

volume create -vserver svm-dr -volume sqldr -type DP -size 8192GB -aggregate aggr1

2.2. Create cluster peering between the primary and DR FSx for ONTAP file systems.

SnapMirror uses intercluster LIFs to facilitate data transfer between the source and destination clusters.

2.2.1. Use the following command to gather the LIFs of both of the ONTAP file systems. This will be required to create cluster peering.

network interface show -role intercluster

2.2.2. Once you gathered the LIFs ip addresses, create the cluster peering request from the primary FSx for ONTAP file system using the LIFs of the DR FSx for ONTAP file system. You will need to provide a passphrase to secure the cluster peering relationship.

cluster peer create -address-family ipv4 -peer-addrs <LIF’s of DR Site >

An example command is shown next.

cluster peer create -address-family ipv4 -peer-addrs 11.0.45.160,11.0.36.120

the cluster peering request from the DR FSx for ONTAP file system

2.2.3. Then, create the cluster peering request from the DR FSx for ONTAP file system to complete the process using the LIFs of the Primary FSx for ONTAP file system. You will provide the same passphrase created in the previous step.

cluster peer create -address-family ipv4 -peer-addrs <LIF’s of Primary Site>

An example command example is shown here:

cluster peer create -address-family ipv4 -peer-addrs 10.0.3.148,10.0.16.16

2.2.4. Then, verify the peering was successful using the following command.

cluster peer show

2.3. Create SVM Peering between the primary and DR file systems

2.3.1. Create svm peering request from the DR FSx for ONTAP file system.

vserver peer create -vserver svm-dr -peer-vserver svm-prod -peer-cluster <PrimarySite 
ONTAP FilesystemID> -applications snapmirror -local-name PRIMARY_SITE

An example command is shown here.

vserver peer create -vserver svm-dr -peer-vserver svm-prod -peer-
cluster FsxId06b9ca2edc5756c9c -applications snapmirror -local-name PRIMARY_SITE

2.3.2. Accept the svm peering connection on the primary FSx for ONTAP file system.

vserver peer accept -vserver svm-prod -peer-vserver svm-dr -local-name DR_SITE

2.3.3. Verify the svm peering status.

vserver peer show

2.4. Creating the SnapMirror relationship between the primary and DR file systems.

After we establish both cluster and svm peering, the next step is to create the SnapMirror relationship on the DR FSx for ONTAP file system. You can optionally use the -throttle flag, which sets the maximum bandwidth (in Kbytes/sec) that the SnapMirror relationship can use.

snapmirror create 
             -source-path <localname for primary>:<source vol> 
             -destination-path <svm-dr>:<DR volume> 
             -vserver <DR SVM> 
             -throttle unlimited

An example command is shown here:

snapmirror create
             -source-path PRIMARY_SITE:sqlprod
             -destination-path svm-dr:sqldr
             -vserver svm-dr
             -throttle unlimited

2.5. Modify the SnapMirror relationship to use a five minute schedule.

After creating the SnapMirror relationship, we can now set up the schedule for the replication.

2.5.1. Run the following command on DR FSx for ONTAP system to list all the available schedules.

job schedule show

job schedule show list of all the available schedules

2.5.2. Run the following command to modify SnapMirror relationship and use 5min schedule.

snapmirror modify -destination-path svm-dr:sqldr -schedule 5min

2.6. Initialize the SnapMirror relationship.

2.6.1. Run the following command on DR FSx ONTAP System to initialize the SnapMirror replication.

snapmirror initialize
                   -destination-path svm-dr:sqldr
                   -source-path PRIMARY_SITE:sqlprod

2.6.2. Run the following command to check the status of SnapMirror

snapmirror show

status of SnapMirror relationship

3. Simulate a real-world disaster scenario at the primary site

To simulate load on the source database, we run the HammerDB load generating tool with 377 virtual users against a 30k data warehouse that is approximately 3.5 TB in size.

3.1. Open SQL Server Management Studio (SSMS).

3.2. Connect to the Source SQL Server database instance at Primary Site (us-east-1).

3.3. Choose New Query and create a SQL RPO tracker table within the TPCC database to record the timestamp in SQL Server Management Studio (SSMS).

Use TPCC
GO
CREATE TABLE RPOTracker (
    Date date,
    RPO time,
    Status varchar (255)
);

3.4. Next, run the script to insert the timestamp in a loop with a 10 second delay. This will help us validate the data loss once we recover database volume at the destination site.

DECLARE @DateTime datetime2(7) = GETDATE();
DECLARE @Date date = CAST(@DateTime AS date);
DECLARE @RPO time = CAST(@DateTime AS time(7));
    
WHILE 1 = 1
BEGIN
    INSERT INTO [TPCC].[dbo].[RPOTracker](Date,RPO,Status)
    VALUES(@Date, @RPO,'Success');
    
    WAITFOR DELAY '00:00:10';
        
    SET @DateTime = GETDATE();
    SET @Date = CAST(@DateTime AS date);
    SET @RPO = CAST(@DateTime AS time(7));
    END

4. Initiate the DR drill and track timestamp

Next, we run the following query at the source SQL Server to see the latest timestamp recorded in the RPO Tracker table. Moreover, the last record shows the 04:37:38 timestamp.

SELECT top(5) Date, RPO, status FROM [TPCC].[dbo].[RPOTracker]
Order By RPO desc

Figure 3: SQL Query output to track last recorded timestamp on source SQL Server at primary site

Figure 3: SQL Query output to track last recorded timestamp on source SQL Server at primary site

At 04:37:38 PM, we can either assume the DR event announcement or manually break the SQL server configuration at the primary site to simulate failure. Refer to the following whitepaper, Designing highly available distributed systems on AWS, to learn about terminologies and include ways to measure MTTR (mean time to recovery, repair, respond, or resolve) and MTBF(Mean time between failures) in your DR Drills.

At 04:37:52 PM, record the SnapMirror lag and new latest snapshot on the DR Site FSx for ONTAP filesystem. As we can see, the newest snapshot is available at 04:35:03 with the lag-time 0:2:48.

To list the lag time and newest snapshot timestamp. Refer to SnapMirror show command documentation.

snapmirror show -fields Lag-time,newest-snapshot-timestamp

the lag time and newest snapshot timestamp

At 04:38:59 PM, we successfully quiesce and break the SnapMirror relationship between the two file systems.

The snapmirror quiesce command disables future transfers for a SnapMirror relationship. If there is no transfer in progress, the relationship becomes “quiesced”. If there is a transfer in progress, it is not affected, and the relationship becomes “quiescing” until the transfer completes.

snapmirror quiesce -destination-path svm-dr:sqldr

The snapmirror break command breaks a SnapMirror relationship between a source and destination endpoint of a data protection mirror.

snapmirror break -destination-path svm-dr:sqldr

SnapMirror relationship between a source and destination endpoint of a data protection mirror

At 04:39:59 PM, once the mirror relationship is broken, LUN on the DR Volume will be available to map to the DR SQL Server Node.

Steps to map the iSCSI LUN to the DR SQL Server node:

Map the iSCSI LUN on the replicated DR volume to correct the initiator group to provide access to DR SQL Server Instance.
Once the LUN appears on the DR SQL Server instance, review file and folder permissions on the SQL Server data files.
Next, use the SQL Server Management Studio (SSMS) to connect to DR SQL Server and attach the SQL database files to bring up the SQL Database online.
Then, use the following SQL query to review the latest record of the timestamp to know how much data loss (RPO) occurred.

select top 5 * from tpcc.dbo.RPOTracker with (nolock) order by rpo desc

Figure 4: SQL Query output to track last recorded timestamp on DR SQL server after cutover

Figure 4: SQL Query output to track last recorded timestamp on DR SQL server after cutover

To summarize, there are several factors that could affect the RPO during the cutover such as the change rate at the source SQL server. Therefore, we recommend you follow the SnapMirror documentation for configuration best practices. SnapMirror replication is bound by a single TCP flow session, such as achieving 5Gbps at max bandwidth. Therefore, there can be only one SnapMirror relationship per volume. To achieve aggressive RPO and RTO, ideal replication intervals using a schedule should be configured, such as in our case where we used a five minute schedule, which is the lowest per-configured option.

5. Configure SnapMirror relationship for failback

To perform failback to the primary FSx for ONTAP system after failover, you must reverse the replication from the DR FSx for ONTAP system back to the primary ONTAP system.

5.1. Delete the SnapMirror relationship between the primary and secondary FSx for ONTAP systems. An example command is shown next:

snapmirror delete -source-path PRIMARY_SITE:sqlprod -destination-path svm-dr:sqldr

5.2. Re-establish the SnapMirror relationship from DR FSx for ONTAP (new primary). As shown in this command example:

snapmirror resync -destination-path svm-dr:sqldr

SnapMirror relationship from DR FSx for ONTAP (new primary)

5.3. Change the schedule to run every five minutes.

snapmirror modify -destination-path PRIMARY_SITE:sqlprod -schedule 5min

Cleaning up

Delete resources that you are no longer using so that you don’t incur unintended charges. You can clean up the following resources that you created for this tutorial:

Conclusion

In this post, we showed how to configure SnapMirror replication between two FSx for ONTAP file systems in a multi-region architecture to meet the disaster recovery requirement with low RPO and RTO. Using the SnapMirror feature in FSx for ONTAP makes it easy to replicate data on volumes between FSx ONTAP systems running in two different AWS Regions for disaster recovery purposes. With Amazon FSx for ONTAP, you can now implement Always on SQL Server Failover Cluster Instance (FCI) with SQL Server Standard edition to help reduce licensing cost. With this architecture, customers using SQL Enterprise edition for large-sized databases can also reduce their storage cost for both HA and DR.

For more detailed information and guidance, we encourage you to read the NetApp SnapMirror Configuration & best practices guide.