Architect a disaster recovery for SQL Server on AWS: Part 1
In today’s world, it’s just a matter of time before disaster happens, and when it happens it’s essential to recover your SQL Server databases and bring the systems online with minimal data loss and downtime. To respond to and recover from an outage of SQL Server database access, high availability (HA) and disaster recovery (DR) solutions are utilized. High availability efforts entail what to do to prevent an outage, disaster recovery addresses what is done to re-establish high availability after an outage. AWS can helps you lower the cost and complexity of a HA and DR solutions for SQL Server workloads.
Many customers seek our guidance to choose the right HA or DR solution for SQL Server on AWS, but there is no one-size-fits-all solution. There are trade-offs involving potential for data loss, time to recovery, operational complexity, and operational cost. In some cases, you can even combine these technologies to implement a solution that has both HA and DR capabilities.
For many general-purpose use cases, customers can work with Amazon Relational Database Service (Amazon RDS) for SQL Server which is a managed service, that makes it easier to setup, operate and scale SQL Server on AWS. Customers can also run and manage their own SQL Server databases on Amazon Elastic Compute Cloud (Amazon EC2).
In this blog series, we compare and contrast the DR solutions available for SQL Server on Amazon EC2 and help you to understand the nature of these trade-offs, and the cost and complexity of implementing DR for SQL Server workloads on AWS. We plan to write a similar series for Amazon RDS for SQL Server, you can learn more about Amazon RDS resilience in the User Guide.
This post introduces the technical terminology and DR solutions available for different SQL Server versions and editions. In part 2, we explore SQL Server backup and restore, SQL Server log shipping, and SQL Server database mirroring. In part 3, we dive deep on SQL Server Always on Failover Cluster Instances (FCIs) and Always On availability groups. Finally, in part 4, we introduce AWS Database Migration Service (AWS DMS), CloudEndure Disaster Recovery, and we wrap up all technologies discussed in the series.
We use the following terms throughout this series of posts:
- High availability (HA) – HA protects against data center, Availability Zone, server, network, and storage subsystem failures to keep your business running without downtime.
- Disaster recovery (DR) – DR is about preparing for and recovering from a disaster. Any event that has a negative impact on your business continuity or finances could be termed a disaster. This could be hardware or software failure, a network outage, a power outage, physical damage to a building like fire or flooding, human error, or some other disruptive incident.
- Recovery Time Objective (RTO) – RTO is the maximum acceptable delay (defined by your organization) between the interruption of service and restoration of service. This determines what is considered an acceptable time window when service is unavailable.
- Recovery Point Objective (RPO) – RPO is the maximum acceptable amount of time (defined by your organization) since the last data recovery point. This determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.
- Failover – The process where the service is automatically transitioned to standby node in case the primary node is no longer accessible to applications. Failover does not require any user intervention.
- Switchover – In contrast to failover, a switchover is the process where the service is intentionally transitioned from primary node to a standby node (new primary node) for testing and scheduled maintenance. Switchover does require user intervention.
- Failback – Failback is the reverse of failover or switchover once the original node recovered to healthy state and ready to resume service. Failback is always user initiated.
- Active-passive – With an active-passive cluster configuration where service is always delivered by a single node (primary node) unless there is an outage. The service transfers to dedicated standby node(s) when outage is detected. This transfer of service could be via automated failover or user-initiated switchover
- Active-active – With an active-active cluster configuration, where service is delivered by both primary and secondary nodes at the same time. In this configuration doesn’t require a failover if only one of the active nodes becomes unavailable. If one active node is unavailable, the other active nodes continues to be available.
- Pilot light – Replicate your data from one region to another and provision a copy of your core workload infrastructure. Resources required to support data replication and backup such as databases and object storage are always on. Other elements such as application servers are loaded with application code and configurations, but are switched off and are only used during testing or when Disaster Recovery failover is invoked.
- Warm standby – A warm standby maintains a scaled-down but fully functional version of your workload always running in the DR Region. Business-critical systems are fully duplicated and are always on, but with a scaled-down fleet. When the time comes for recovery, the system is scaled up quickly to handle the production load. The more scaled up the warm standby is, the lower RTO and control plane reliance. When scaled up to full scale, this is known as a hot standby.
- Cold standby – A cold standby maintains a redundancy of your primary node. The standby node is in shutdown mode and It needs to be turned on manually to keep sync with the primary node.
- Synchronous replication – Replication is copying of data changes from one database to another database. Most synchronous replication products write data to primary storage and the replica simultaneously. As such, the primary copy and the replica should always remain synchronized.
- Asynchronous replication – Asynchronous replication products copy the data to the replica after the data is already written to the primary storage. The replication process runs with minimum latency in relation to a secondary replica. If the primary is configured as asynchronous mode, it only allows manual failover. In case of a failover event, some data loss may occur.
The following table summarizes what you should expect from HA and DR.
|High Availability||Disaster Recovery|
|Goal/Intention||Retaining Service/Service Availability||Retaining Data/Service Continuity|
SQL Server version and edition requirements
SQL Server installation requirements vary based on your application needs. The different editions of SQL Server accommodate the unique performance, runtime, and price requirements of organizations and individuals. The SQL Server components that you install also depend on your specific requirements. The following tables help you understand how to make the best choice among the editions and components available in SQL Server.
The following table summarizes DR solutions by SQL Server version. We will explore these options in this blog series.
|Solution||SQL 2008||SQL 2008 R2||SQL 2012||SQL 2014||SQL 2016||SQL 2017||SQL 2019|
|SQL Server Backup and Restore||Yes||Yes||Yes||Yes||Yes||Yes||Yes|
|SQL Server Log Shipping||Yes||Yes||Yes||Yes||Yes||Yes||Yes|
|SQL Server Database Mirroring (Deprecated)||Yes||Yes||Yes||Yes||Yes||Yes||Yes|
|SQL Server Always On Failover Cluster Instances||Yes||Yes||Yes||Yes||Yes||Yes||Yes|
|SQL Server Always On Availability Groups||No||No||Yes (4 replicas)||Yes (8 replicas)||Yes (8 replicas)||Yes (8 replicas)||Yes (8 replicas)|
|SQL Server Distributed Always On Availability Groups||No||No||No||No||Yes (max 18 replicas between 2 AGs)||Yes (max 18 replicas between 2 AGs)||Yes (max 18 replicas between 2 AGs)|
|CloudEndure Disaster Recovery||Yes||Yes||Yes||Yes||Yes||Yes||Yes|
The following table summarizes DR solutions by SQL Server edition.
|Solution||HA||DR||Enterprise Edition||Standard Edition||Web Edition|
|SQL Server Backup and Restore||No||Yes||Yes||Yes||Yes|
|SQL Server Log Shipping||No||Yes||Yes||Yes||Yes|
|SQL Server Database Mirroring (Deprecated)||Yes||Yes||Yes||Yes (Full safety only)||Yes (witness only)|
|SQL Server Always On Failover Cluster Instances||Yes||No||Yes (max 16 replicas)||Yes (2 replicas)||No|
|SQL Server Always On Availability Groups||Yes||Yes||Yes (max 8 replicas)||Yes (2 replicas as Basic AG)||No|
|SQL Server Distributed Always On Availability Groups||Yes||Yes||Yes (max 18 replicas between 2 AGs)||No||No|
|CloudEndure Disaster Recovery||No||Yes||Yes||Yes||Yes|
The choice of disaster recovery solution depends on your requirements and budget, as well as on the version and edition of SQL Server product you install. In some use cases, it’s even beneficial to combine multiple DR technologies.
In this post, we discussed DR terminology, and also SQL versions and editions to implement DR capabilities with AWS. In our next post, we talk through SQL backup and restore, SQL log shipping, and SQL database mirroring technologies for DR use cases.
About the Author
Ganapathi Varma Chekuri is a Database Specialist Solutions Architect at AWS. Ganapathi works with AWS customers providing technical assistance and designing customer solutions on database projects, helping them to migrate and modernize their existing databases to AWS cloud.
Baris Furtinalar is a Senior Solutions Architect who is part of specialists in Microsoft architectures team at AWS. He is passionate about cloud computing and he believes this shift to the cloud helps companies transform their businesses, gain agility, and improve operational responsiveness. He comes from a diverse background including SQL database administration, virtualization and system security. He designed, implemented and supported Windows/SQL server deployments since 2000.