AWS Architecture Blog

What’s New in the Well-Architected Reliability Pillar?

The new version of the Reliability pillar for AWS Well-Architected includes expanded content across all areas of reliability. Guidance on distributed system architecture has been reorganized and expanded, and new best practices have been added as part of the Well-Architected Review. There is a sharper focus on chaos engineering with more explanation and examples. We’ve added more details on using fault isolation to protect your workloads using Availability Zones, and beyond.

In the AWS Well-Architected Tool, new reliability best practices have been added, and existing ones updated. We have completely updated the Reliability Pillar whitepaper to align to the questions and best practices found in the tool. Additionally, we added the latest guidance on implementing the best practices using the newest AWS resources and partner technologies, such as AWS Transit Gateway, AWS Service Quotas, and CloudEndure Disaster Recovery.

The whitepaper provides clearer definitions to help you better understand the relationships among reliability, resiliency, and availability. The focus remains on resiliency, and how to design this into your workloads so that they are able to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions, such as misconfigurations or transient network issues.

Launched at re:Invent 2019, Amazon Builders’ Library shares in-depth articles on how Amazon builds and runs resilient workloads. Our updated Reliability pillar draws extensively on this information, incorporating it across multiple best practices, and linking back to specific Amazon Builders’ Library articles. The AWS Well-Architected hands-on reliability labs now include Implementing Health Checks and Managing Dependencies to improve Reliability, which lets you exercise the practices demonstrated in the library’s Implementing health checks article firsthand. We expanded the suite of Well-Architected Reliability labs with new labs on data backup, data replication, and automated infrastructure deployment.

Implementing Health Checks and Managing Dependencies to Improve Reliability-2

The new Implementing Health Checks and Managing Dependencies to Improve Reliability lab shows you how to implement practices to detect dependency failures and remain resilient despite them.

Prior to this version of the Reliability pillar, we had identified three best practice areas: Foundations, Change Management, and Failure Management. In this new version, we added a fourth area:

  • Workload Architecture: Specific patterns to follow as you design and implement software architecture for your distributed systems.

This new area covers best practices related to service-oriented architecture, microservices architectures, and distributed systems. We also added these to the AWS Well-Architected Tool, so that you can review your workloads and understand if they are using these best architectural practices. Also the whitepaper content for these has been expanded, and draws on Amazon Builders’ Library articles, including Challenges with distributed systems and Timeouts, retries, and backoff with jitter.

The previous version helped you to understand the important role of Availability Zones in a reliable architecture. In the new version, we expanded on this by adding more detail on using bulkhead architectures, such as cell-based architecture (used across AWS), where each cell is a complete, independent instance of the service.

Best practices on how you implement change have always been an important part of the Reliability pillar. We now have more practical guidance on reliable deployment, including runbooks and pipeline tests. The new best practice on immutable infrastructure expands on our previous guidance on deployment automation using canary deployment or blue/green deployment.

We’ve also expanded coverage of Chaos Engineering. You can’t consider your workload to be resilient until you hypothesize how your workload will react to failures, inject those failures to test your design, and then compare your hypothesis to the testing results. While Chaos Monkey popularized the constructive use of chaos in 2010, Amazon has been purposely injecting failures since the early 2000s to increase resiliency and ensure readiness under the most adverse of circumstance. This history and experience are all the more applicable today in the cloud, where you can both design for recovery and test those designs. This is an often-overlooked best practice, but our most successful resiliency customers recognize it as a necessary and powerful tool.

This update to the Reliability pillar of the AWS Well-Architected Framework gives you and your teams the tools and information you need to understand your workload reliability. Together with the AWS Well-Architected Tool, start creating a plan today and continue to learn, measure, and improve your cloud workloads.

A huge thank you to everyone who gives us feedback on the tool and whitepapers, and a special thank you to Stephen Beck, Adrian Hornsby, Mahanth Jayadeva, Krupakar Pasupuleti, Jon Steele, and Jon Wright for their help with this update.

Learn more about the new version of Well-Architected and its pillars

Seth Eliot

Seth Eliot

As a Principal Developer Advocate, and before that Principal Reliability Solutions Architect, with AWS Seth helps guide AWS customers in how they architect and build resilient, scalable systems in the cloud. He draws on 11 years of experience in multiple engineering roles across the consumer side of Amazon.com, where, as Principal Solutions Architect, he worked hands-on with engineers to optimize how they use AWS for the services that power Amazon.com. Previously, he was Principal Engineer for Amazon Fresh and International Technologies. Seth joined Amazon in 2005 where soon after, he helped develop the technology that would become Prime Video. You can follow Seth on twitter @setheliot, or on LinkedIn at https://www.linkedin.com/in/setheliot/.