Architect a disaster recovery for SQL Server on AWS: Part 4
In this series of posts (Part 1, Part 2, Part 3 and Part 4) , we compare and contrast the disaster recovery (DR) solutions available for SQL Server on Amazon Elastic Compute Cloud (Amazon EC2) and help you understand the nature of these trade-offs, and the cost and complexity of implementing DR for SQL Server workloads on AWS. This post introduces two additional methods for implementing DR for SQL Server on AWS: AWS Database Migration Service (AWS DMS) and CloudEndure Disaster Recovery. Finally, we wrap up all the technologies we discussed in these posts to give you a holistic view of your DR options for SQL Server on AWS.
AWS Database Migration Service
AWS DMS supports live migration of data to and from AWS, including a different Region. You can use this feature to set up a separate SQL Server instance in a different Region to serve as a DR database. Although you can use other options to set up a DR site for SQL Server on EC2, such as log shipping and Always On availability groups (AGs), AWS DMS provides a low-cost, simple, all-native approach. AWS DMS manages the complexities of the migration process, including automatically replicating data changes that occur in the source database to the target database. AWS DMS supports full load migration and change data capture (CDC), thereby enabling continuous data replication. Another benefit when using AWS DMS is that unlike physical replication, which works on the database instance level, AWS DMS is a logical replication solution and you can migrate a subset of your database such as schemas or a set of tables.
RPO and RTO
The Recovery Point Objective (RPO) of the DR strategy depends on how quickly AWS DMS consumes the captured changes that are available from the source SQL Server instance and applies them to the target. In the DR strategy, the target SQL Server is active so that you could have a near-zero Recovery Time Objective (RTO).
- During ongoing replication, it’s critical to identify the network bandwidth between your source database system and your AWS DMS replication instance. Make sure that the network doesn’t cause any bottlenecks during ongoing replication.
- It’s also important to identify the rate of change and archive log generation per hour on your source database system. Doing this can help you understand the throughput that you might get during ongoing replication.
- If you intend to run multiple concurrent replication tasks, consider using one of the larger replication instances. A larger instance can be a good idea because the service consumes a fair amount of memory and CPU.
- If you plan to use ongoing replication, set the Multi-AZ option when you create your replication instance. By choosing Multi-AZ, you get high availability and failover support for the replication instance. However, this option can have an impact on performance and can slow down replication while applying changes to the target system.
- For a use case in which AWS DMS replicates the data reading changes from backup only and when you’re taking log backups on the secondary replica in the Always On AG, you need to set an extra connection attribute. This enables AWS DMS to poll all the nodes in the Always On cluster for transaction backups. For example, see the following code:
For more information about performance considerations, see Improving the performance of an AWS DMS migration.
The following diagram illustrates the architecture of this solution.
In this scenario, you create an AWS DMS instance in the AWS account. Then you provide database connection information to connect to both the on-premises and AWS database instances. Next, you select which tables, schemas, or databases you want to migrate, and AWS DMS loads the data and keeps it in sync on an ongoing basis. Finally, when you’re ready to cut over, you simply change the application configuration pointed to the new AWS database, instead of the existing on-premises database.
Failover and failback
Failover in AWS DMS is a manual process. After the source and target database are synchronized and when the application is ready for cutover, the replication task is stopped and applications are simply pointed to connect with SQL Server running on Amazon EC2.
If a failback is required, your production databases are no longer in sync with the DR databases. Any changes that occurred on the target database post-cutover are ignored or lost, or those changes have to be manually re-applied on the source database. You have to reconfigure the replication task from the DR database to source database.
This solution has the following benefits:
- Simple to use
- Migration with minimal downtime
- Supports widely used databases
- Allows replication between different SQL Server editions and versions
- Allows replication between different AWS accounts
- Ongoing replication lowers the risk of data loss
Highly resilient and self-healing
AWS DMS is a low-cost service. You only pay for the compute resources used during the migration process and any additional log storage. If you’re running a SQL Server database using Enterprise edition and have a requirement to offload read workloads to secondary read replicas, you can use AWS DMS to replicate data to a secondary database using SQL Standard edition to reduce licensing cost.
CloudEndure Disaster Recovery
You can use different options like log shipping, Always On AGs, and AWS DMS to set up a disaster recovery for SQL Server databases.
However, these options require duplication of infrastructure, operating systems, and SQL Server software, rendering it a highly expensive IT project that is also a challenge to maintain. In addition to incurring large expenses, this approach has another disadvantage: it can replicate only the database itself, not the other components of the application (critical to recovering the entire application). On the other hand, CloudEndure Disaster Recovery is a software as a service (SaaS) application that manages the replication of servers from any infrastructure to AWS for DR purposes.
CloudEndure Disaster Recovery is an agent-based solution that replicates entire virtual machines, including the operating system, all installed applications, and all databases into a staging area. The staging area contains low-cost resources automatically provisioned and managed by CloudEndure Disaster Recovery. This greatly reduces the cost of provisioning duplicate resources. Because the staging area doesn’t run a live version of your workloads, you don’t need to pay for duplicate software licenses or high-performance compute. Rather, you pay for low-cost compute and storage. The fully provisioned recovery environment, with the right-sized compute and higher-performance storage required for recovered workloads, is only launched during a disaster or drill.
RPO and RTO
CloudEndure Disaster Recovery recovers each source machine running Microsoft SQL Server databases to a single EC2 instance, with subsecond RPO and an RTO of minutes.
CloudEndure Disaster Recovery helps achieve your RPO goals by continuously replicating the machine to the staging environment and allowing the failover target to launch with the last consistent state of the machine before a disaster. CloudEndure Disaster Recovery recovers each source machine running Microsoft SQL Server databases to a single EC2 instance, with subsecond recovery point objectives (RPOs). RTO is the time it takes for a target instance to launch and become available after a disaster. It recovers with an RTO of minutes.
Testing your DR plan in action is essential, but is often neglected. Many organizations don’t test on a regular basis because their failover procedures are too complex and there are concerns that failover tests will lead to a disruption of their production environment or even data loss. Despite these concerns, it’s important to schedule regular failover tests to your DR site. With the highly automated solution of CloudEndure Disaster Recovery, you can quickly and easily conduct frequent, non-disruptive drills that launch your SQL Server databases and all other workloads in your target Region without impacting your source environment.
For more information about best practices, see Best practices for migrating Microsoft SQL Server databases to Amazon EC2 using CloudEndure Migration.
The following diagram illustrates the architecture for this solution.
In this scenario, CloudEndure Disaster Recovery continuously replicates data on your on-premises source servers to a low-cost staging area in your target Region. When you launch source servers for DR testing, you’re launching a copy of your servers in your target Region from the point in time you select. You can perform secure tests in an isolated environment in your CloudEndure Disaster Recovery Blueprint, which defines the settings and configuration of your launched test instances. CloudEndure Disaster Recovery automatically provisions the resources needed to launch your servers on AWS. To perform actual failover, you should launch the instance in recovery mode.
Failover and failback
After the source infrastructure and target Region are fully synced and in continuous replication mode, you can choose to trigger failover to recovery mode or perform a recovery test. When you trigger failover, CloudEndure Disaster Recovery uses Amazon EC2 APIs to launch new target instances into a new or existing VPC. CloudEndure Disaster Recovery also uses the EC2 APIs to attach new Amazon EBS volumes to those instances. CloudEndure Disaster Recover uses Amazon EBS snapshots, taken from the volumes in the staging area that are storing the replicated data, to create the new target volumes.
In case of actual DR, CloudEndure Disaster Recovery triggers a highly automated machine conversion process and a scalable orchestration engine that can spin up machines in the target Region within minutes. This process enables you to achieve RTO in minutes.
When a disaster is over, CloudEndure Disaster Recovery provides an automated failback process to the original source Region. Failing back transitions the workloads running in the target Region back to the source Region without data loss.
Using CloudEndure Disaster Recovery has the following benefits:
- One disaster recovery solution for all SQL Server versions and editions, third-party software, and operating systems
- Crash-consistent technology keeps data consistent 100% of the time
- No need to pre-build DB infrastructure
- Significant savings on DR
- Subsecond RPO and RTO of minutes
- Increased reliability and high tolerance for network instability
- Support of all source infrastructure, including on-premises physical and virtual infrastructure, other public clouds, and other Regions
- Enterprise-grade security
The DR machines aren’t started, and you don’t need to pay for duplicate software licenses or high-performance compute.
The staging area contains low-cost EBS volumes, which reduces the cost of provisioning duplicate resources.
The choice of disaster recovery solution depends on your requirements and budget, as well as on the version and edition of SQL Server you install. For some use cases, it’s beneficial to combine multiple DR technologies. Testing your DR technology and performing failover and failback tests regularly after these technologies are implemented in production is crucial to find and fix any issues and to make sure your desired RTO and RPO values are met.
The following table summarizes the RPO and RTO metrics that you can attain with DR technologies for SQL Server on Amazon EC2.
|DR Technology||RPO||RTO||Automatic Failover||Readable Secondaries||Failback||Ease of Access||Endpoint Change in Failover**||Resource|
|SQL Server Backup and Restore||Hours||Hours/Days||No (Restore on target SQL Server)||No||Reconfigure||Available in all SQL editions||Yes||Database level|
|SQL Server Log Shipping||Minutes/Hours||Minutes/Hours||No||Yes (with standby)||Reconfigure||All except Express edition||Yes||Database level|
|SQL Server Database Mirroring – High Performance Mode (Asynchronous)||Seconds/Minutes||Minutes||No (Simplified)*||No||Simplified*||Standard edition||No||Database level|
|SQL Server Always On Failover Cluster Instances||Zero***||Seconds/Minutes||Yes||No||Simplified*||Standard and Enterprise editions||No||Instance level|
|SQL Server Always On Availability Group – Asynchronous Commit||Seconds/Minutes||Minutes||No (Simplified)*||Yes||Simplified*||Enterprise edition||No||Database level|
|SQL Server Always On Availability Group – Synchronous Commit||Zero***||Seconds||Yes||Yes||Simplified*||Enterprise edition||No||Database level|
|SQL Server Always On Distributed Availability Groups – Asynchronous Commit||Seconds/Minutes||Minutes||No (Simplified)*||Yes||Simplified*||Enterprise edition||Yes||Database level|
|AWS DMS||Seconds/Minutes||Minutes||No||Yes||Reconfigure||All SQL Editions||Yes||Table level|
|CloudEndure Disaster Recovery||Seconds/Minutes||Minutes||No (Simplified)||No||Simplified*||All SQL editions||Yes||Block level|
*Operation does not require additional configuration or setup
**Application reconfiguration is required to connect to the database after failover
***Not guaranteed by vendor
+Crash recovery times not included—typically a few milliseconds to a few minutes depending on workloads
In this post, we discussed how you can use AWS for SQL Server to enhance your DR capabilities. We reviewed several DR technologies and use cases to achieve the desired business objectives under certain scenarios. In a future blog series, we will do the same analysis for Amazon RDS for SQL Server.
About the Author
Ganapathi Varma Chekuri is a Database Specialist Solutions Architect at AWS. Ganapathi works with AWS customers providing technical assistance and designing customer solutions on database projects, helping them to migrate and modernize their existing databases to AWS cloud.
Baris Furtinalar is a Senior Solutions Architect who is part of specialists in Microsoft architectures team at AWS. He is passionate about cloud computing and he believes this shift to the cloud helps companies transform their businesses, gain agility, and improve operational responsiveness. He comes from a diverse background including SQL database administration, virtualization and system security. He designed, implemented and supported Windows/SQL server deployments since 2000