AWS Cloud Operations Blog
Learn how the Flexibility of AWS Opens New Doors for Business Continuity
A guide for IT practitioners
The “criticality” of technology that impacts our day to day lives is more pertinent and broader reaching than ever before. Nowadays, we’ve become accustomed to reliability and always on systems and can see the impact on our lives when things go wrong. Therefore, to meet customer expectations in the face of uncertainty, we, as information technology (IT) practitioners, have to make the systems we build resilient. AWS offers the flexibility to build systems that meet the varying resilience requirements for each customer experience.
Building resilient technology also comes at a cost as technology disruptions are not just inconvenient, they’re costly. IDC reports $1.25Billion to $2.5Billion in annual downtime cost for the Fortune 1000 and the average cost of a critical application failure of $500K to $1Million per hour. Faster recovery from disruptions results in lower business impact cost (see Figure 1 solid line) but requires higher recovery cost (dotted line). This blog post will provide a method to understand and balance cost and other impacts with resilience requirements.
Building resilient systems starts with understanding business processes. It’s important to collaborate with business stakeholders. This will assist in understanding risks and in developing technology solutions that support the experiences customers expect. Business Continuity Planning (BCP) is a method to document business processes and create a plan to maintain the processes in the event of a disruptive incident. An important outcome of BCP is to determine a Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each system. Recovery Time Objective (RTO) is the maximum acceptable delay between the interruption of service and restoration of service. This objective determines what is considered an acceptable time window when service is unavailable. Recovery Point Objective (RPO) is the maximum acceptable amount of time since the last data recovery point. This objective determines what is considered an acceptable loss of data between the last recovery point and the interruption of service.
At Amazon, we work backwards from customer expectations to arrive at an optimal technical design to meet those expectations. The same principle applies to resilience. In this post you’ll learn to use three BCP tools to determine the proper RTO and RPO for a system. By understanding your customer’s experience and using tools such as Risk Assessment (RA), Business Impact Analysis (BIA), and System Impact Analysis (SIA), you can create technology solutions that will meet customer expectations in the face of risks such as power outages or cyberattacks.
Risk Assessment (RA)
Risk Assessment is the first step as you develop a business continuity plan, begin by creating a list of business processes with your business partners; document the sub-processes, inputs, and outputs. Include key activities or critical customer journeys. Once the processes have been documented, the next step is to conduct a RA which is a process that involves the identification, analysis, and estimated likelihood of all possible risks, hazards, and threats to each process at a high level. These risks could be natural, man-made, or environmental.
It is important to establish the nature of each risk and threat regardless of its type or category. Factors include, but are not limited to:
- Information Technology – Loss of Connectivity, Hardware Failure, Lost/Corrupted Data, Application Failure, Cyber threats
- Utility Outage – Communications, Electrical Power, Water, Gas, Steam, Heating/Ventilation/Air Conditioning, Pollution Control System, Sewage System
- Fire/Explosion – Fire (Structure, Wildland), Explosion (Chemical, Gas, or Process failure)
- Hazardous Materials – Hazardous Material spill/release, Radiological Accident, Hazmat Incident off-site, Transportation Accidents, Nuclear Power Plant Incident, Natural Gas Leak Supply
- Vendor Risk – Supplier Failure, Supply Chain Interruption
To conduct a RA you use a RA tool which includes fields for describing risks, likelihood, and impact. The US Department of Homeland Security (DHS) provides guidance and tools for conducting a RA (see Figure 2). Similar tools are available from AWS Professional Services and AWS Partners. When using the DHS RA tool you enter the business operation\process in the first column, potential hazard in the second column, and complete the remaining columns using the DHS instructions to arrive at an overall hazard rating in the last column. The output of a RA is a list of business processes and related risk data along with the overall hazard rating.
Business Impact Analysis (BIA)
The next step is the Business Impact Analysis. Now that you have a list of business processes with hazard ratings you can dive deep into each process and scenario that resulted in higher hazard ratings. You will conduct a BIA for each of the processes identified in the RA (see Figure 2, column 1).
The purpose of the BIA is to understand in more detail the potential impact of any disruption to each business process. BIA determines the potential impact of a disruption from a financial, reputation, operational, customer, and legal/regulatory standpoint for a business process. Additionally, BIA helps predict the consequences of disruption of a business operation\process and gathers information needed to develop recovery strategies.
Use a BIA tool to document the impact over time for a business process disruption. DHS provides guidance and tools for conducting a BIA (see Figure 3). You should complete a separate BIA form for each business process you entered in the risk assessment table. In the first column enter the timing/duration identified in the RA (see Figure 2, column 3), and the corresponding impacts in the second and third columns. The output is a table showing the impact over time for a business process disruption.
Once the impact of a disruption is determined, you can classify business processes into tiers ranging from mission-critical to non-critical BIA (see Figure 4). For example, processes such as online banking with significant customer, reputation, regulatory, and financial impacts might be classified as critical. Lower impact processes such as office visitor check-in might be classified as non-critical. A common approach is to sort the impacts from all the business processes and establish tiers.
The output of the BIA process is the impact of a list of business processes and corresponding impact tiers (see Figure 5).
System Impact Analysis (SIA)
Now that you have a list of business processes with tiers, you can dive deep into each individual IT system that supports a critical process. You will use the business processes and tiers from the BIA you just did as inputs into the SIA (see Figure 5).
The next step is to understand in more detail the potential impact to each business process of any disruption to an IT system. SIA determines the potential impact of a disruption from a financial, reputation, operational, customer, and legal/regulatory standpoint for an IT system. Additionally, SIA helps predict the consequences of disruption of an IT system and gathers information needed to develop recovery strategies.
Use a SIA tool to document the IT systems that support a business process, list the potential financial, reputational, operational, customer, and/or legal and regulatory impacts, and finally, present recovery options for the system (see Figure 6). You will enter the IT system in the first column and in the next four columns enter the business processes, tiers, and impacts from the BIA (see Figure 5). The columns for cost can be completed using estimated cost from the publicly-available AWS Pricing Calculator or actual cost from AWS Cost Explorer or AWS Cost and Usage Report.
The degree of resilience for an Information Technology (IT) system varies and can range from High Availability (HA) workloads with redundant components, to workloads using Disaster Recovery (DR) architecture patterns to recover in minutes to days. Workloads requiring faster recovery utilize strategies and architecture patterns that are more complex and costly. IT systems consist of many components including hardware, operating systems, code, and services comprising the application stack. It’s important to understand the dependencies between different systems and how they impact one another. For example, a data lake disruption will likely impact multiple business processes with cumulative consequential impacts. However, with BIA, the required business justification for the cost is established.
The output of the SIA is a table showing the impact over time for an IT system disruption along with the cost for varying levels of recovery. SIA lists the cost for varying levels of recovery for each IT system. For example, a database with replicas in multiple regions will have a higher cost than a single region. Note that you can substitute RTO and RPO values for recovery patterns if you wish to perform a more granular analysis (see Figure 7).
Now you can use the impact and cost data from the SIA to choose a level of recovery for any given IT system. For example, if you have a system that shows an impact of $100,000/hr, would you spend $500/hr for a pilot light DR solution or $5,000/hr for a hot standby solution? With the SIA you and business stakeholders have the information to make a decision that best fits the desired business outcomes.
How Is Resilience Different with AWS?
Up to this point the concepts and tools introduced in this blog can be applied to IT systems either on-premises or on AWS. Typically, building redundant on-premise architecture has limited recovery options and often requires obtaining and maintaining standby critical components such as UPS systems, cooling systems, and backup generators to assure operations can continue even if a component fails. Also, when creating architectures for on-premises data centers, recovery options are limited by the storage, compute, and network technologies in place. For example, RPO may be determined by what the storage vendor offers for data replication or what a datacom provider offers for latency and bandwidth.
With AWS you can tailor architectures to meet your desired customer experience. One approach is to create system tiers (see Figure 8). Each system architecture can be comprised of resources and risk mitigation techniques to meet specific RTO\RPO targets, note that these differ from business tiers. Business tiers define impact of disruption to a business process, while system tiers define the RPO\RTO for an IT system.
AWS offers many resources to create architectures that support the RPO\RTO of these tiers.
- The AWS Cloud spans more than 30 Regions which are comprised of multiple, physically separated, Availability Zones (AZ) each consisting of one or more data centers with redundant power, cooling, and connectivity. AZs are connected with low-latency, high-throughput, and highly redundant networking to allow for synchronous data replication within the Region. You can choose to build your workload multiple AZs or multiple Regions.
- Amazon Simple Storage Service (S3)is designed for 99.999999999% (11 9’s) of durability, and offers lifecycle management and S3 Cross Region Replication (S3 CRR).
- Amazon Relational Database Service (RDS) supports up to 15 replicas within a Region.
- Amazon DynamoDB is a fully managed, serverless, key-value NoSQL database with built-in redundancy and supports on-demand backup, point in time recovery, and cross-region replication with Global Tables.
- AWS Backup is a cost-effective, fully managed, policy-based service that supports backup, copy, and restore for many AWS resources.
- AWS Elastic Disaster Recovery (DRS) minimizes downtime and data loss with fast, reliable recovery of on-premises and cloud-based applications using affordable storage, minimal compute, and point-in-time recovery.
- Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service.
- Amazon Route 53 Application Recovery Controller (Route 53 ARC) gives you insights about whether your applications and resources are ready for recovery, and helps you move traffic across AWS Regions or away from Availability Zones for application disaster recovery.
- AWS Resilience Hub provides a central place to define, validate, and track the resilience of your applications on AWS.
These resources can be provisioned and configured using the AWS console, Command Line Interface (CLI), or Application Programming Interface (API). Each can be used individually or in combinations to meet the RTO\RPO you require. For example, a simple application architecture might consist of a load balancer, EC2 instances, and an RDS database deployed in a single Region (multi-AZ). If the SIA places the application in the Silver Tier then AWS Backup can be used to back up the instances and database to another Region. In the event of a disruption, you can use the AWS API to create a new load balancer, and restore the EC2 instances and RDS database in the other Region. Once the architecture is running you can use Route 53 to redirect traffic to the other Region, thus meeting your desired RPO\RTO.
As your organization becomes more mature in cloud operations, you can leverage the flexibility AWS offers to create architectures with unique combinations of resources and risk mitigation techniques to meet the specific needs of each IT system and business process. By leveraging AWS Cost Management tools, you can understand the cost to operate each IT system and determine the cost of varying levels of recovery. You can then use this information to optimize the level of recovery against the impact. For example, suppose you are running a web service that, based on the SIA, is in a bronze tier and backed up daily. A new critical business process is implemented that has a dependency on the service. Because the service now supports a critical process, the financial impact of service disruption has increased. With AWS you have many options available to mitigate the risk of disruption. Options range from increasing backup frequency to deploying a complete copy of the service in another AZ or Region. This flexibility and variety of options from AWS allows you to move away from a small set of service tiers and transition to “dialing in” the right level of recovery for each workload based on cost and impact (see Figure 9).
Conclusion
In this blog post you were introduced to the concepts of Risk Assessment, Business Impact Analysis and System Impact Analysis within Business Continuity Planning. You learned how Risk Assessment and Business Impact Analysis predicts the consequences of disruption of a business process. You also learned how System Impact Analysis gathers the information needed to develop recovery strategies. Finally, you discovered how AWS provides the flexibility to create the optimal level of recovery required in system architecture.
Call to Action
Learn how AWS Professional Services can help guide your organization through the BCP process.
Explore resilience on AWS:
- Creating a scalable disaster recovery plan with AWS Elastic Disaster Recovery
- Disaster Recovery of Workloads on AWS: Recovery in the Cloud
- Disaster Recovery (DR) Architecture on AWS blog series
About the authors: