IT Resilience Within AWS Cloud, Part II: Architecture and Patterns

In Part I of this two-part blog, we outlined best practices to consider when building resilient applications in hybrid on-premises/cloud environments. We also showed you how to adapt mindsets and organizational culture. In Part II, we’ll provide technical considerations related to architecture and patterns for resilience in AWS Cloud.

Considerations on architecture and patterns

The AWS Well-Architected Framework defines resilience as having “the capability to recover when stressed by load (more requests for service), attacks (either accidental through a bug, or deliberate through intention), and failure of any component in the workload’s components.” Resilience is an overarching concern that is highly tied to other architecture attributes. Executive builders should center their resilience strategies around availability, performance, and disaster recovery (DR). Let’s evaluate architectural patterns that enable this capability.

1. Recalibrate your resilience architecture

Planning for resilience in on-premises environments is tightly coupled to the physical location of compute resources. On-premises resilience is often achieved using two data centers (Figure 1).

Figure 1. Two data center model for on-premises resilience strategies

Let’s break down the model in Figure 1 based on the concerns executive builders try to address in improving their environment’s resilience, as follows:

Performance: Concerns to solve latency and/or performance under heavy load.
Availability: Concerns to solve for high availability (HA) and DR.

Now let’s discuss how we can address these concerns in the AWS Cloud. First, the AWS Well-Architected Framework covers important architecture principles and related services that help you with availability and performance. Second, you can use AWS global infrastructure to improve your resilience through various deployment models.

AWS infrastructure enables resilience

AWS operates 24 Regions worldwide, each consisting of multiple Availability Zones (AZs). Each AZ has redundant resources and uses separate physical facilities located a meaningful distance apart from other AZs. AWS also offers mechanisms to move resources closer to end users, including AWS Local Zones, AWS Wavelength Zones, AWS Outposts, and over 220 points-of-presence (PoPs). AWS Direct Connect and managed VPN services provide links between the AWS network and on-premises networks (Figure 2).

Figure 2. AWS global infrastructure

Performance using AWS services

The most commonly used AWS services for reducing latency are Amazon CloudFront and Amazon API Gateway. These services cache static and dynamic content and API responses in PoPs.

Availability and disaster recovery

Availability requires evaluating your goals and conducting a risk assessment according to probability, impact, and mitigation cost (Figure 3). Do not automatically translate a DR goal into “deploy into two AWS regions.” For example, if you define DR as being able to “withstand the loss of a physical facility,” deploying multiple AZs in a single AWS Region meets that goal. Multi-AZ deployments, along with Regional services like Amazon Simple Storage Service (Amazon S3), provide more availability than a two data center footprint.

Figure 3. Risk classification matrix for infrastructure resilience

2. Classify workloads by availability tiers

Once you move away an on-premises environment, do not apply a single recovery time and recovery point objective (RTO/RPO) to your IT estate in the cloud. Instead, use the four-tier classification strategy in Figure 4 to categorize your workloads and develop a strategy by business criticality using RTO/RPO.

Figure 4. Resilience classification matrix

3. Multi-Region disaster recovery patterns

After classifying your workloads, choose an appropriate DR pattern if you need a multi-Region footprint. Backup solutions like database and Amazon S3 replication can provide RPO of a few minutes at most, but RTO will vary considerably. There are four common multi-Region DR patterns, as shown in the following sections.

Backup and restore (Tier 4)

Back up critical data to another Region. In a DR scenario, recover data and deploy your application. This option is the least expensive, but full recovery can take several hours. This pattern may incur more data loss depending on backup schedules.

Pilot light (Tier 2)

Maintain essential components like databases replicated in Amazon S3 buckets. Scaled-down application servers will be maintained in another Region and will deploy other components like in-memory caches in a DR scenario. RTO typically takes an hour, allowing time for the operations teams to detect and respond to the failure, and new infrastructure to roll out.

Warm standby (Tier 3)

Run scaled-down versions of applications in a second Region and scale up for a DR scenario. During failover, scale up resources and increase traffic to the Region. RTO typically takes an hour, allowing time for operations teams to detect and respond to a failure. This pattern works well for applications that must respond quickly but don’t need immediate full capacity.

Active-active (Tier 1)

In this pattern, you actively serve traffic from multiple Regions using either DNS (Amazon Route 53) or AWS Global Accelerator to handle multi-Region routing. This option offers the best RTO, possibly within seconds depending on the time to detect failure and redirect traffic by updating DNS. It’s also more expensive. Besides the cost of maintaining some minimal capacity as in the warm standby approach, you scale each Region to have some extra capacity.

Handling data stores is complex in this pattern, as we will discuss more in the next section.

4. Putting it all together

You’ve decoupled your business requirements for performance and availability from the legacy infrastructure footprint. So now you might choose to use Amazon CloudFront for low-latency access for end users. This may include a Multi-AZ architecture with a “backup and restore” strategy for availability (Figure 5).

Figure 5. AWS infrastructure providing performance and resilience

Considerations on critical technology domains

AWS gives you the tools to architect for the level of resilience you need. As your enterprise architects decide how to apply these tools, they should focus on three vital technical domains. We’ll illustrate each one with an anecdote from an AWS customer in the video streaming media industry.

1. Network and traffic management

Now that you’ve set up a DR site in a second AWS Region, how do you route traffic to it if there’s a failover? How do you avoid extra latency if the DR Region is farther away from your customers?

AWS provides Route 53 and Global Accelerator, services that can act as global load balancers, with fixed entry points and traffic routing between Regions.

To reduce latency for end users, you can use Global Accelerator or CloudFront. This will get customer traffic onto the AWS edge network as quickly as possible, providing more predictable latency compared to the open internet.

Our video streaming customer uses CloudFront for content delivery for improved video streaming performance. They also use Global Accelerator to onboard customer traffic to microservices deployed in two Regions.

2. Data storage

In a three-tier web application, the database tier is the most difficult to manage in an active-active scenario. Relational databases offer strong guarantees (that is, ACID transactions) that come at a cost: most relational databases funnel all writes through a single writer node.

Consider whether you can relax the consistency requirements or shard data in your data storage layer. NoSQL databases like Amazon DynamoDB with global tables will let you write data in multiple regions. However, most NoSQL databases have tunable consistency models that you can optimize for availability.

Our video streaming customer adopted DynamoDB global tables for multi-Region microservices, because they realized they could tolerate eventual read consistency.

3. Monitoring and operational response

Can you detect if your customers are seeing unusual failure rates and latency while accessing your application? Do you know why it’s happening and whether you need to initiate a failover event?

Monitoring and observability services like Amazon CloudWatch and AWS X-Ray help you see the health metrics and the flow of requests for your application. Automation tooling helps you automate response runbooks. Game days will surface problems with monitoring and response plans.

Our video streaming customer relies on AWS CloudFormation for automated infrastructure deployment, CloudWatch for monitoring, and CloudWatch synthetic monitoring to provide an end user view of the system’s performance.

Conclusion

Executive builders need guidance on how to lead their teams to implement and operate resilient cloud environments. In this blog, we gave guidance on reviewing architecture and patterns considerations as well technology domains to focus on.

AWS Architecture Blog