Understand resiliency patterns and trade-offs to architect efficiently in the cloud
Architecting workloads to achieve your resiliency targets can be a balancing act. Firms designing for resilience on cloud often need to evaluate multiple factors before they can decide the most optimal architecture for their workloads. Example Corp has multiple applications with varying criticality, and each of their applications have different needs in terms of resiliency, complexity, and cost. They have many choices to architect their workloads for resiliency and cost, but which option suits their needs best? Will they have to make any sacrifices to implement one over another? How and why should they choose one pattern over another?
To help answer these questions, we’ll discuss the five resilience patterns in Figure 1 and the trade-offs to consider when implementing them: 1) design complexity, 2) cost to implement, 3) operational effort, 4) effort to secure, and 5) environmental impact. This will help you achieve varying levels of resiliency and make decisions about the most appropriate architecture for your needs.
What is resiliency? Why does it matter?
The AWS Well-Architected Framework defines resilience as having “the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload’s components.”
To meet your business’ resilience requirements, consider the following core factors as you design your workloads:
- Design complexity – Usually, the more complex your workload becomes, the more complicated your resilience requirements will be. Each individual workload component has to be resilient, and you’ll need to eliminate single points of failure across people, process, and technology elements.
- Cost to implement – Costs often significantly increase when you implement higher resilience because there are new software and infrastructure components to operate.
- Operational effort – Deploying and supporting highly resilient systems require more complex operational processes and advanced technical skills. Before you decide to implement higher resilience, evaluate your operational competency to confirm you have the required level of process maturity and skillsets.
- Effort to secure – Security complexity is less directly correlated to resilience. However, there are generally more components to secure for highly resilient systems. AWS Security best practices can help customers achieve their security objectives for such complex deployments.
- Environmental impact – An increased deployment footprint for resilient systems might increase your consumption of cloud resources. However, you can use trade-offs like approximate computing and slower response times to reduce resource consumption.
P1 – Multi-AZ
P1 is a cloud-based architecture pattern (Figure 2) that introduces Availability Zones (AZs) into your architecture to increase your system’s resilience. The P1 pattern uses a Multi-AZ architecture where applications operate in multiple AZs within a single AWS Region. This allows your application to withstand AZ-level impacts.
As shown in Figure 2, Example Corp deploys their internal employee applications using the P1 pattern. These applications are low business impact and therefore have lower requirements for resiliency.
Example Corp deploys these applications on Amazon Elastic Compute Cloud (Amazon EC2), which uses health checks to automatically detect faults. If an AZ fails, Amazon EC2 prompts an Amazon EC2 Auto Scaling group to recreate their application in another unaffected AZ.
P1 is low effort in several categories, but this comes at the expense of application recovery. If AZ is down, it will disrupt end users’ access to the application while the new resources are being re-provisioned in a new AZ. This is known as bi-modal behavior.
P2 – Multi-AZ with static stability
P2 uses multiple AZs within a Region to increase resilience, but it uses static stability to prevent bimodal behavior. P2 uses static stability systems, which remain stable and operate in one mode irrespective of changes to their operating environment.
As shown in Figure 3, Example Corp has a customer-facing website that has a lower tolerance for downtime. Any time the website is down, it could result in lost revenue. Because of this, the website requires two EC2 instances that are provisioned within two AZs. This way, if an AZ becomes impaired, the website can continue operating and does not require Example Corp to detect the fault or launch new infrastructure.
P2 must be weighed against cost concerns. P1 is less expensive because it provisions less compute capacity and relies on launching new instances in case of a failure. However, P1’s bimodal behavior might affect your customers during large-scale events.
You could go further and deploy your workload to three AZs across the Region. This will reduce costs associated with over-provisioning because you only have to provision three instances versus the four we mentioned in our earlier example.
P3 – Application portfolio distribution
The P3 pattern uses a multi-Region pattern to increase functional resilience. It distributes different critical applications in multiple Regions.
Example Corp provides banking services like credit balance checks to consumers on multiple digital channels. These services are available to consumers via a mobile application, contact center, and web-based applications. If the Region fails where the mobile application is deployed, customers can still access services via the other channels deployed in other Regions. Regional disruptions are rare, but implementing this pattern ensures your users retain access to business-critical services during disruptions.
Operating an application portfolio that spans multiple Regions requires significant operational planning and management. Isolated functional elements may depend on common downstream systems and data sources that are deployed in a single Region. Therefore, Region-wide events might still cause disruption; however, the impact surface area is significantly reduced.
P4 – Multi-AZ deployment (multi-Region disaster recovery)
Example Corp operates several business-critical services, such as the ability for consumers to make bank payments, that have very low tolerance for disruptions. Example Corp uses the following sub-patterns for these applications:
- Pilot Light – This pattern works for applications that require RTO/RPO of 10s of minutes. Data is actively replicated and application infrastructure is pre-provisioned in the disaster recovery (DR) Region. Cost optimization is a key driver here because the application infrastructure is kept switched off and only switched on during the restore event.
- Warm Standby– This pattern improves restore times significantly compared to pilot light by keeping your applications running in the DR Region but with a reduced capacity. Application infrastructure will be scaled up during a DR event but this can typically be automated with minimal manual effort. This pattern can achieve RTO/RPO of minutes if implemented correctly.
The Disaster Recovery of Workloads on AWS: Recovery in the Cloud whitepaper documents these patterns in detail.
Regional DR patterns increase deployment complexity because infrastructure changes need to be synchronized across Regions. Testing is also significantly more complex and should include scenarios such as losing a Region and traffic routing and management. Using Infrastructure as Code to automate deployments can help alleviate these issues.
P5 – Multi-Region active-active
Example Corp’s core banking and Customer Relationship Management applications have zero tolerance for Regional disruption. They use the P5 pattern for deploying these applications because it has an RTO of real-time and an RPO of near-zero data loss. This way they run their workload simultaneously in multiple Regions, which allows them to serve traffic from all Regions.
Multi-active ecosystems are generally complex because they include multiple applications that collaborate to deliver required business services. If you implement this pattern, you’ll need to consider the fact that you’re introducing asynchronous replication for data across Regions and the impact that has on data consistency.
Operating this pattern requires a very high level of process maturity, so we recommend customers gradually build towards this pattern by starting initially with deployment patterns described earlier.
In this blog post, we introduced five resilience patterns and the trade-offs to consider when implementing them. We showed you how Example Corp evaluated these options and how they applied to their business needs to help you decide on the most efficient architecture to implement.
- AWS Well Architected Framework – Resilience Pillar
- Building Resilient Well-Architected Workloads Using AWS Resilience Hub
- Disaster Recovery of Workloads on AWS: Recovery in the Cloud
- Disaster Recovery (DR) Architecture on AWS, Architecture Blog series