Establishing Feedback Loops Based on the AWS Well-Architected Framework Review
The AWS Well-Architected Framework helps customers build a secure, high-performing, resilient, and efficient infrastructure for their applications and workloads. The Well-Architected (WA) Tool was introduced in 2018 to help customers assess their workloads against the best practices defined by the AWS Well-Architected Framework. The report of an AWS Well-Architected Tool self-assessment gives recommendations on improvements for your workloads.
These recommendations are commonly based on establishing and continuously improving processes based on collected data, such as performance metrics or operational logs. One of the key aspects is the definition and documentation of these processes.
In the following blog post, we will show you how to improve your overall architecture through setting up Feedback Loops based on the results of the AWS Well-Architected Review.
Introduction to the AWS Well-Architected Framework and the recommendations
The AWS Well-Architected Framework is divided into five pillars: Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization. Each pillar comes with a set of General Design Principles and architectural best practices. When customers do an AWS Well-Architected Review using the AWS WA Tool, they answer a set of questions. AWS then gives recommendations to achieve the best practices based on the responses.
These recommendations are often related to setting up a business process. One example for a recommendation is under Selection: Define a process for architectural choices of the Performance Efficiency pillar. It asks customers to define a process that encourages experimentation and benchmarking with the services that could be used for their workload.
Introduction to Feedback Loops in the context of the AWS Well-Architected Framework
At a high level, Feedback Loops provide a mechanism to measure and evaluate the achievement of outcomes against expected baselines. This permits appropriate action to be taken in response to the feedback.
Today we introduce you to Feedback Loops, which consist of four steps:
- Writing documentation or playbooks, which define the requirements that the workload should fulfill. This step also reviews the manual or automatic workflows that should be part of the workload management.
- Setting up a system to gather this information. The documentation/playbooks define what is monitored and which kind of metrics or logs are required.
- The threshold values and the systems are used to create events in addition to alerts. These events initiate an automatic or manual process and are used for reporting.
- The last step is the reaction, which is documented in the playbooks. The results of these actions are used to improve the original written documentation and playbooks from Step 1. This makes it a full Feedback Loop.
The Feedback Loop is based on failed procedures and infrastructure or code changes.
Illustrating the feedback loop for the Performance Efficiency pillar of the AWS Well-Architected Framework
The AWS Well-Architected Framework Performance Efficiency pillar focuses on using compute resources effectively and efficiently as demand changes and technologies evolve. We will go over the best practices in this area and map them to the steps of the Feedback Loop.
The first questions for the Performance Efficiency pillar concentrate on the Selection process for the preferred architecture and solution:
- PERF 1: How do you select the best performing architecture?
- PERF 2: How do you select your compute solution?
- PERF 3: How do you select your storage solution?
- PERF 4: How do you select your database solution?
- PERF 5: How do you configure your networking solution?
The AWS Well-Architected Framework emphasizes the importance of having a data-driven selection process. AWS solutions architects, reference architectures, and the AWS Partner Network can be of help during the information gathering phase. For your compute solution the requirements for workload performance and cost are important, as well as keeping up with those requirements when demand changes. For storage and database solutions, one should consider the access patterns and how to choose the right solution for storing the data. The network is set up between all components and has a significant impact. Considering bandwidth, jitter, latency, and throughput for workload requirements is essential for meeting system requirements.
The majority of the selection process is researching options and choosing a solution to deploy your workload. The reasoning behind these decisions should be documented. The selection process is ongoing, so you must review the documentation constantly. This is due to changing requirements, new services, and feature launches.
Set up performance metrics for compute, storage, and database in addition to network performance, so you can verify that the requirements are fulfilled. Additionally, create a metrics dashboard for monitoring and send out alerts when thresholds are breached or errors occur. Based on that data, in the next iteration of the Feedback Loop the documentation should be adjusted. The Performance Efficiency pillar also recommends benchmarking and load-testing, ensuring that the results are as expected.
The question in this area is:
- PERF 6: How do you evolve your workload to take advantage of new releases?
This question starts with research where and when new releases are announced, followed by setting up a process within the playbook addressing news. This step is directly followed by setting up alerts or events, followed by the need to get informed on those new services and features. The results will be documented within the next iteration.
The next question in the Performance Efficiency pillar is:
- PERF 7: How do you monitor your resources to ensure they are performing?
This is already covered through setting up Feedback Loops within the Selection process. Also, a result of setting up Feedback Loops is that following recommendations for one area might also cover aspects of other, similar recommendations.
The last question is:
- PERF 8: How do you use tradeoffs to improve performance?
This concentrates on trading consistency, durability, and latency for performance efficiency. After you adjust the workload, monitor the impact of those changes using metrics. Additionally, load test your workload and check if it can withstand your new requirements. The metrics have been set up beforehand within the selection process.
Example architecture for remediation of an Amazon EC2 in a faulty state
Our example architecture consists of an Amazon Elastic Compute Cloud (Amazon EC2) instance running an application. We describe a Feedback Loop based on the recommendation of the Operational Excellence pillar, for example, “OPS 10: How do you manage workload and operations events?”
We define a faulty state for that workload and how to recover from it. In AWS, Amazon EC2 instances report host metrics to Amazon CloudWatch by default. Additionally, we set up custom metrics for the application. These metrics are used to initiate events due to a faulty state. We configure Amazon CloudWatch to forward these events to an Amazon Simple Notification Service (SNS) topic. This will notify a human if manual remediation is required. After reviewing the results of the remediation, the documentation should be updated. If this documentation is at a stage that describes a detailed step-by-step guide, you can automate the remediation in order to react to events without human intervention. In this example architecture, an AWS Lambda function is used to run the automated remediation steps.
In this article, you saw how to approach the AWS Well-Architected Review recommendations by setting up Feedback Loops. If you want to know more about the Well-Architected Framework and best practices, see AWS Well-Architected Framework Overview. Follow Get started with the AWS Well-Architected Tool, to get recommendations for your own workload, and start defining your own Feedback Loops.