AWS Storage Blog
Designing sustainable disaster recovery strategies
Disaster recovery (DR) is an important part of resilience and defines the process of preparing and recovering from a disaster. A disaster can be defined as any event that causes a serious negative impact to your business. How you respond to these unforeseen events has traditionally been a tradeoff between the cost of the solution, complexity of operations, amount of tolerable data loss, known as the Recovery Point Objective (RPO), and the amount of acceptable downtime, known as the Recovery Time Objective (RTO). In recent years, there has been an increased focus on incorporating sustainability into the design process of DR strategies, in order to account for the environmental impact of our decisions.
At AWS, we strive to build a sustainable business for our users and for the world we share. Our sustainability efforts includes enhancing energy efficiency, transitioning to renewable energy, reducing embodied carbon, and using water responsibly. We focus on efficiency across all aspects of our infrastructure, from the design of our data centers and hardware, to modeling the performance of our operations. By continuously improving our efficiency, we can reduce the amount of energy needed to operate our data centers. However, that is not enough, as sustainability is a shared responsibility between AWS and you, our customers. We optimize for sustainability of the cloud, by delivering efficient, shared infrastructure, water stewardship, and sourcing renewable power. You, our customers, are responsible for sustainability in the cloud by optimizing workloads and resource usage, by minimizing the amount and types of resources required to be deployed for your business needs.
In 2020, Peter DeSantis, VP of AWS Global Infrastructure, stated: “The greenest energy is the energy we don’t use.” In this blog, we review how designing and implementing a DR strategy can impact sustainability. To conserve the amount of energy used to operate your DR solution, you will need to use fewer resources, and use those resources more efficiently. We show key decision points that can improve your sustainability posture, compare and contrast the sustainability characteristics of various DR strategies, and discuss the sustainability benefits offered by AWS Elastic Disaster Recovery. At AWS, we believe that a DR strategy can both meet your business continuity needs and reduce your impact on the environment.
Note: For simplification, we consider cost to be a proxy for energy expenditure throughout this blog. By tracking and measuring cost, we can compare and analyze resource and usage efficiency of our design choices.
Integrating sustainability into DR planning
AWS provides several resources that can help you integrate sustainability into your DR planning. Among these are the design principles for sustainability in the cloud and the sustainability pillar in the Well Architected Framework. Sustainability is an important discipline that should be considered throughout your application or workloads lifecycle. When it comes to DR there are a few key decision points that can have outsized impact on your sustainability goals and outcomes.
- On-premises as opposed to cloud: The first decision point for DR is where you plan to host your recovery site. To begin this process, you should decide whether to host the recovery site on premises or in the cloud. This decision can have a large impact on achieving your sustainability goals. When designing for DR, using a recovery site hosted on premise will require you to procure, install, power, and operate the infrastructure to meet your DR objectives, even when there is no impairment or disaster present. Provisioning secondary site resources in advance can result in a poor sustainability posture, as these sites are often built for maximum capacity and load. There are several advantages to designing and hosting your secondary site on the cloud. AWS offers advanced expertise and capabilities in operating energy efficient infrastructure at scale. Another benefit is the ability to provision resources on demand, using the elasticity of the cloud. With this model, you can now create resources only when needed, such as during an actual disaster event or when performing a recovery drill.
- Region selection: You should choose AWS Regions for your DR workloads based on both business requirements and sustainability goals. When considering business requirements, you should consider proximity to the user, regulatory and compliance requirements, and the ability to withstand disruption that may impact your primary site. Your primary responsibility is to make sure that your DR strategy provides the ability to recover your workload in a recovery site if your primary site becomes unavailable. You can factor sustainability into your design process by placing DR workloads closer to Amazon renewable energy projects or AWS Regions with low published carbon intensity. In 2023, 100% of the electricity consumed by Amazon is matched with renewable energy sources.
- Multi-Availability Zone (AZ) as opposed to multi-Region: In AWS, an AZ is an isolated location within an AWS Region, with redundant power, networking, and connectivity. AZs give users the ability to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center. A DR strategy across multiple AZs within a single AWS Region can provide mitigation against disruptions such as fires, floods, and major power outages. If it is a requirement to implement protection against an unlikely event that an entire AWS Region is unavailable, then you can opt for a DR strategy that uses multiple AWS Regions. The choice to deploy a multi-AZ as opposed to a multi-Region DR strategy can impact sustainability. A multi-Region strategy will require data and application traffic to travel a longer distance and hence require a greater amount of infrastructure and power to support it.
- Other factors: You should consider the following when developing your DR strategy with sustainability in mind.
- Reducing idle resources and maximizing use: Depending on your approach to DR, you might have the opportunity to right-size resources with AWS Cost Explorer. When you decide on an instance type, you should consider the requirements of your workload, with the goal of avoiding over provisioning. There is no need to create production resources in the recovery site until an actual disaster is declared or a recovery drill is performed. Use AWS Compute Optimizer for right-sizing recommendations of workloads in your recovery site.
- Scaling resources dynamically: With certain DR approaches, such as warm standby or active-active, you can dynamically create resources to meet demand rather than statically provisioning them in advance. Consider setting up Auto Scaling to manage resource dynamically and optimizing usage based on demand fluctuations. This design choice should be made with the consideration that it would require a dependency on control plane operations to request and provision new resources. If you are using AWS serverless services, then AWS automatically scales these services based on demand, making sure that resources are only used when needed. This dynamic scaling helps organizations avoid paying for idle resources, reducing both environmental footprint and costs.
- Consider the right storage tier: Understanding and classifying your data in levels or tiers based on business importance is a crucial step for defining RPO and RTO objectives. Using different storage tiers is a way to optimize storage for sustainability. You can optimize your storage footprint by storing less volatile data on technologies designed for efficient long-term storage. In general, you can make a tradeoff between resource efficiency, access latency, and reliability when considering these storage mechanisms. Moreover, consider data that can be recreated while planning your backup plan. If your data is easy to recreate, then it is possible that you do not need to backup or replicate this data. Keep in mind that recreating the data could have its own impact on your sustainability posture. You can find detailed guidance for different Storage types in this post Optimizing your AWS Infrastructure for Sustainability, Part II: Storage
- Deployment processes: Infrastructure-as-Code (IaC) can help reduce energy consumption by automating the process of provisioning and deprovisioning infrastructure. This can help to make sure that resources are only used when they are needed, and that they are not left running idle. Use tools such as AWS CloudFormation or AWS Cloud Development Kit (CDK) to define your IaC.
Understanding the sustainability impact of DR strategies
When designing a DR plan across multiple sites, the Well Architected Framework identifies four resilience strategies that could be applicable: Backup and Restore, Pilot Light, Warm Standby, and Multi-site Active-Active. These approaches are listed in increasing order of cost and complexity and decreasing order of RTO and RPO (where lower is better). At AWS, we believe cost is a close proxy for measuring and comparing sustainability outcomes. When considering sustainability, a solution that provisions more resources in advance will also have a higher energy expenditure. To determine what resiliency strategy to use for a given workload, we recommend you start with the recovery objectives. Identify what RPO and RTO values are acceptable for this workload and choose the strategy that can meet those objectives. Furthermore, keep in mind that you can use all four of these strategies in tandem across multiple workloads in the same organization to meet business objectives and application requirements.
- Backup and restore: a strategy that provides an RPO measured in hours and typically provides an RTO of 24 hours or less. With this DR strategy, you back up your data and applications into a recovery Region. You can use automated or continuous backups to lower RPO to a few minutes and support point in time recovery (PITR), which allows you to mitigate the risks of data corruption and ransomware. In the event of a disaster, you deploy your infrastructure (using IaC to reduce RTO), deploy your code, and restore the backed-up data to recover from a disaster in the recovery Region. Backup and restore is the least complex strategy to implement, but it requires more time and effort to restore the workload, leading to higher RTO and RPO. When it comes to sustainability, backup and restore provides the advantage of requiring a limited amount of infrastructure to deploy and maintain until an actual disaster event. However, even when you minimize infrastructure by using a backup and restore approach, you still need to consider sustainability for factors such as data retention, data tiering, and the amount of data backed up.
- Pilot light: typically provides an RPO of minutes and an RTO measured in tens of minutes. With a pilot light approach, you provision a copy of your core workload infrastructure in the recovery Region (or secondary site) and replicate your data into the recovery Region. Resources required to support data replication and backup, such as databases and object storage, are always on. Other elements, such as application servers or serverless compute, are not deployed, but you can create them when needed with the necessary configuration and application code. Pilot light can help reduce carbon emissions by limiting active resources to only those needed to make sure that data is replicated and maintained in the recovery site. Pilot light greatly benefits when implemented on AWS, as infrastructure is provisioned for use on an as needed basis, compared to a secondary site located on premise where the resources are provisioned in advance. A service such as Elastic Disaster Recovery can further enhance the pilot light approach by providing better RPO and RTO compared to traditional pilot light implementations while providing an improved sustainability outcome (see following section).
- Warm standby: can provide an RPO measured in seconds and an RTO measured in minutes. With warm standby you maintain a scaled-down but fully functional version of your workload always running in the recovery site. Data is replicated and live in the recovery site. When the time comes for recovery, the system is scaled up quickly to handle the production load. The more scaled-up the warm standby is, the lower the RTO, and when fully scaled this is known as hot standby. From a sustainability point of view, you create a better outcome if you can limit the amount of infrastructure and resources that are active with low use. As a result, you should view hot standby as one of the least sustainable options because the infrastructure is fully scaled up with limited active load on the system.
- Multi-Region (multi-site) active-active: can provide an RPO near zero and an RTO of potentially zero. In this approach, your workload is deployed to and actively serving traffic from multiple AWS Regions. Users often select this strategy for reasons other than DR. You can use it to increase availability or improve performance when deploying a workload to a global audience. As a DR strategy, you can mitigate an impairment or disaster affecting one AWS Region by redirecting users from the impacted AWS Region and serving them from one of the remaining AWS Regions. Multi-site active/active is the most operationally complex of the DR strategies, and you should only select it when business requirements necessitate it. From a sustainability point of view, multi-Region active-active would potentially consume the greatest amount of infrastructure. However, there are some design choices that can make it more efficient. First, with a multi-Region active-active design, all AWS Regions are actively serving traffic, so resources are actively in use, unlike a hot standby option. In addition, the use of autoscaling mechanisms can help make sure that supply and demand are balanced to avoid waste and increase efficiency. Finally, by serving traffic closer to users you can limit the distance and infrastructure required to serve these needs.
Sustainable DR with AWS Elastic Disaster Recovery
If you are considering either a pilot light or warm standby strategy for DR, then Elastic Disaster Recovery may provide an alternative approach with improved benefits. Elastic Disaster Recovery can offer RPO and RTO values similar to warm standby, but maintain the low-cost and more sustainable footprint of pilot light. Elastic Disaster Recovery replicates your data from your primary AWS Region to your recovery AWS Region, using continuous block level data protection to achieve an RPO measured in seconds and an RTO measured in minutes. Only the resources required to replicate the data are deployed in the recovery AWS Region, which keeps costs down and drives a more sustainable approach, similar to the pilot light strategy. When using Elastic Disaster Recovery, the service coordinates and creates compute resources only when a recovery is initiated as part of failover or drill, making sure that idle resources are limited.
Elastic Disaster Recovery is designed with sustainability and cost savings in mind. With Elastic Disaster Recovery, the staging infrastructure is designed to optimize compute services. Replication servers are lightweight Amazon Elastic Compute Cloud (EC2) instances that are used to replicate data between your source servers and AWS. Replication servers are automatically launched and terminated as needed. By default, Elastic Disaster Recovery provisions a t3.small server as the replication server. The typical ratio of volumes to replication servers can go high as 15:1. Elastic Disaster Recovery attempts to use the fewest number of compute resources in order to be efficient and cost effective. From a storage perspective, Elastic Disaster Recovery defaults to automatically selecting the most cost-effective Amazon Elastic Block Store (EBS) volume disk type for each disk based on the disk size and type. Elastic Disaster Recovery allows you to select the number of days for which point-in-time (PIT) snapshots are retained. Elastic Disaster Recovery has a default retention policy of seven days, but this could be reduced. Finally, Elastic Disaster Recovery supports both multi-AZ and multi-Region approaches for DR.
In summary, Elastic Disaster Recovery provides a cloud native approach for DR to minimize downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications using affordable storage, minimal compute, PIT recovery, and lower carbon emissions.
Conclusion
Disaster recovery (DR) plays an important role within resilience by defining the process for preparing and recovering from a disaster. When choosing a DR strategy, we often consider the tradeoff between the cost of the solution, complexity of operations, amount of tolerable data loss, (measured as RPO), and the amount of acceptable downtime (measured as RTO). Adding sustainability practices into your DR plan is important for serving your users long-term and contributes to a better future. We recommend to use cost as a close proxy to measure and understand the sustainability impact of your decisions.
In this post, we reviewed the impact on sustainability when designing and implementing a DR plan. We showed how to reduce the amount of energy spent for DR by using fewer resources when possible, or using resources more efficiently. We shared key decision points that can improve your sustainability posture, compared and contrasted the sustainability characteristics of various DR strategies, and concluded by showing the sustainability benefits offered by AWS Elastic Disaster Recovery.
Now is the time to integrate sustainable practices into your resilience strategy for a more environmentally conscious future. It’s never too early to review your current or future DR needs while also considering their impact on our environment. Start by using the energy efficient infrastructure of AWS and adopting resource optimization strategies that can significantly reduce environmental impact.