Why S&P Global chose Amazon FSx for NetApp ONTAP to achieve high availability and disaster recovery for SQL Server

Organizations have a requirement to build high availability and disaster recovery (HADR) solutions for their complex SQL Server infrastructure to prevent data loss and protect against corruption. With the rapid pace of cloud adoption, businesses across different industry verticals have realized the value of a successful proof-of-concept for any technical project that migrates existing environments to the cloud. For companies of any size, it is important to set standards, minimize risks, and conduct business and technical validation while maintaining speed for successful implementation.

S&P Global Market Intelligence is a leading provider of actionable intelligence on the global financial markets and the companies and industries that make up those markets, offering Environment Social, and Governance (ESG) solutions, deep data, and insights on critical economic, market, and business factors. S&P Global Market Intelligence has been providing essential intelligence that unlocks opportunity, fosters growth, and accelerates progress for more than 160 years. In this post, we explain how S&P Global Market Intelligence completed the technical evaluation to meet business continuity and DR goals for their Microsoft SQL Server workloads by implementing a multi-Region Amazon FSx for NetApp ONTAP architecture with a successful proof-of-concept.

Business challenge and current state of architecture

The Data Services Engineering team at the Market Intelligence division of S&P Global that manages large SQL Server estate needed a cost-optimized, cloud native solution to operate and protect their largest backend processing platform running on Microsoft SQL Server to AWS. This data processing instance has over 100 SQL Server databases with a total storage of over 100 terabytes and an annual growth rate of approximately 10%. The backend processing system serves as the source database server for all data ingestion and data distribution to various external client-facing products. Data ingestion happens via automated processes using the content ingestion UI tools. The content users who work on these content ingestion UI tools are spread all over the world. From a data distribution point of view, this system acts as a SQL Server publisher, publishing hundreds of publication databases with thousands of tables.

At the infrastructure level, both primary and DR environments have two node Windows Failover Cluster Instances (WSFC) connected to high-speed SAN Storage systems providing shared storage service to host SQL databases. The configured SAN replication between two storage systems in each datacenter adds geo-redundancy to the failover clustering solution as shown in the following figure.

Logical component of SQL Server HADR Architecture on-premises

Figure 1: Logical components of SQL Server HADR Architecture on-premises

In the event of a failover at a primary site, the WSFC service transfers ownership of the primary instance’s resources to a designated failover node within the site.

In the event of a catastrophic disaster affecting the primary datacenter, client application traffic is steered to the DR site to access the replicated database instances. The storage system at the DR site is connected to two Windows failover cluster nodes that provide access to databases on the replicated storage volume to make sure of availability across all four nodes between two datacenters. With manual failover and breaking the mirror relationship between two storage volumes, database administrators (DBAs) bring the state of the database online at the DR site.

In addition, to provide data services to numerous internal and external applications, MS transactional replication is configured with hundreds of publication databases and thousands of articles on the SQL Server Instance. The client application uses SQL Server network name within Windows failover cluster to access the database service. Therefore, domain name resolution and networking between the two sites configured is critical for seamless failover within the targeted RPO of 10 minutes and an RTO of one hour.

Technical requirements

The following prerequisites are required before continuing:

Implement multi-region SQL Server Failover Cluster Instance leveraging shared storage solution that requires storage-based replication to support their environment with numerous databases.
The storage solution deployed at both sites must support continuous, asynchronous, and bidirectional replication to achieve an RPO of 10 minutes or less and an RTO of one hour.
The storage solution must support storage capacity over 100 terabytes in size with high IOPS and throughput for their SQL Server Cluster.
Implement single highly available Windows Failover Cluster Instances across different AWS Regions that use the asymmetric shared storage to support SQL Server.
Implement resilient Windows infrastructure with Active Directory (AD) and reliable DNS services to support seamless failover for client application.

Solution overview

To deploy SQL Server AlwaysOn FCI in multi-region architecture, shared storage options along with data replication capabilities between two Regions were critical. To meet the aggressive RPO and RTO, FSx for ONTAP provides highly available and durable storage with SnapMirror replication for Cross-Region DR. As a fully managed service, FSx for ONTAP makes it easier to launch and scale reliable, high-performing, and secure shared file storage in the cloud in a cost-effective way as compared to self-managed storage.

Multi-region FSx for NetApp ONTAP with SnapMirror Replication for DR

Figure 2: Multi-region FSx for NetApp ONTAP with SnapMirror Replication for DR

High-level steps involved in the solution deployment

For this proof-of-concept, the team deployed Windows Failover Server Clustering (WSFC) architecture that spanned between the Northern Virginia (us-east-1) and Ohio (us-east-2) Regions. The team followed these steps to prepare the POC environment.

Configured VPC peering between two Regions

A VPC that spans two Availability Zones (AZs) and includes two private subnets to host the primary site in us-east-1.
Another VPC using a single AZ with a private subnet to host the DR SQL Server node at the DR site in us-east-2..

Enabled Network communication to support AD service between two Regions

An existing deployment of AD with network access to support the Windows Failover Cluster and SQL Deployment in both Regions.
Allowed security groups for SQL Server communication node-to-node and client access.

Implemented SQL Servers in both Regions

Deployed SQL Server Instances at both primary and secondary sites. For initial testing, the team used a standalone SQL Server between two sites, but later transitioned to a two-node SQL FCI Cluster at the primary site and a third node joined to the same cluster at the DR Site.
Provisioned SQL Server service account credentials with appropriate permission to create Windows failover Cluster, as well as provisioned SQL Server Failover Cluster Instance.
All three SQL Servers joined the common AD domain and used a common service account for SQL Server.
The Cluster quorum should be configured in such a way that the DR site will not participate in voting and there should be odd number of votes in the primary site.
In this architecture, there are two nodes in the Primary site, we need to configure a file share witness to make it Node and File Share Majority. For more information refer to Quorum configuration and Quorum considerations for disaster recovery configurations.

Implemented FSx for ONTAP file systems in both Regions

Provisioned FSx for ONTAP file system with appropriate SSD Capacity as per the requirement in the primary site (us-east-1).
Provisioned another FSx for ONTAP Single AZ file system at the DR site (us-east-2) with equal or more SSD capacity than provisioned at the primary site.
The two nodes in the primary Region would get their storage from the primary FSx for ONTAP file system, and the storage would be visible only to those nodes in the primary Region and not to the nodes in the secondary Region. This is called asymmetric storage configuration.
Similarly, the DR node in the secondary Region would access the storage from the secondary FSx for ONTAP file system and the storage would be visible only to the node in secondary Region. The storage in the secondary region will not be mapped to Windows nodes. Storage in the secondary Region would be in the read-only state and changed to read-write during failover.
FSx for ONTAP SnapMirror async replication is established between the two file systems and, when the failover happens to the secondary Region, replication flow is reversed from the secondary Region to the primary Region.

Note: It is important to understand that zero data loss is only possible with Synchronous replication. To implement sync replication, it requires your round-trip time (RTT) to be 10 milliseconds or less (a distance of about 150 km). For disaster recovery, async replication is ideal to overcome the distance limitation and replicate to the remote site with minimal data loss.

To Learn more about SnapMirror Synchronous and SnapMirror Asynchronous, read the NetApp blog article

How to Devise an Effective Data Protection and Disaster Recovery Approach.

Technical validation

To make sure of data consistency and availability, the team validated that they can bring the databases online and clean the secondary file system after breaking the SnapMirror relationship. They also failed over the SQL Server Instance in the middle of a huge transaction, and validated those inflight transactions rolled back successfully on the secondary file system. Each DR drills performed during the testing phase included monitoring of the MS transactional replication status and application connectivity to make sure of data accessibility for the end-users. The implementation and data validation steps are explained in the post Implementing HA and DR for SQL Server Always-On Failover Cluster Instance using Amazon FSx for NetApp ONTAP. This can help you plan your SQL Server HA and DR architecture with SnapMirror enabled multi-region FSx for ONTAP file systems.

Outcome

The internal content and data processing applications are the backbone to deliver data to our external customer facing products. Using FSx for ONTAP’s SnapMirror capability, S&P Global Market Intelligence implemented SQL Server Failover Cluster Instance architecture between two Regions to support HADR. SnapMirror replication made it easy to achieve the aggressive HADR objective RPO of 10 minutes and RTO of one hour. This improved the reliability and resiliency of their application. Native network compression also reduced bandwidth utilization accelerating data transfers between FSx ONTAP systems.

Conclusion

In this post, we presented a solution implemented by S&P Global to deploy and validate a seamless cross-region HADR capability in AWS using FSx for ONTAP for critical content and data processing applications. With FSx for ONTAP, you can unburden your storage team from the operational overhead of managing and maintaining storage systems, allowing them to focus more on testing and validating data consistency and reliability for their application users. In addition, you can explore features such as FlexClone to build lower environment for development and ETL environment for reporting and analytics, implement the SnapCenter plugin to perform space-efficient database backup and restore, implement data tiering wherever possible to drive cost savings by moving snapshots, and infrequently accessed storage blocks to the elastic capacity pool tier.

Thanks for reading this post! If you have any comments or questions, don’t hesitate to leave them in the comments section.

AWS Storage Blog

Why S&P Global chose Amazon FSx for NetApp ONTAP to achieve high availability and disaster recovery for SQL Server

Business challenge and current state of architecture

Technical requirements

Solution overview

High-level steps involved in the solution deployment

Implemented SQL Servers in both Regions

Implemented FSx for ONTAP file systems in both Regions

Technical validation

Outcome

Conclusion

Resources

Follow

Learn

Resources

Developers

Help