AWS Cloud Operations Blog

How to perform a Well-Architected Framework Review- Part 3

In previous blog posts, we discussed the first two phases for running a Well-Architected Framework Review, or WAFR. The first phase is to Prepare and the second phase in to conduct the Review. In this blog post, we dive deep into the third phase: Improve.

WAFR-phases

Figure-1 WAFR Phases

What is the Improve phase?

At this point of reviewing your workload’s architecture against AWS best practices, you should have done the necessary preparation for your review as we discussed earlier, and completed the actual review following these recommendations. As a result, you should have identified architecture risks based on the answers you collected during the review. We call these risks High Risk Issues (HRI) and Medium Risk Issues (MRI) – more on that later. During the Improve phase, you start working on creating an Improvement Plan (sometimes called Treatment Plan) which means creating a list of these risks, understanding their impact on your business, finding solutions and, lastly, implementing these solutions according to your organization’s priority.

The following cycle shows you the main steps included in the Improve phase of WAFR. We will dive into the details of each step.

WAFR Improve steps

1- Identify risks (aka: improvement opportunities)

[ETA: 1 day]

There are two categories of risks used in the WAFR context. High Risk Issues (HRIs) and Medium Risk Issues (MRIs). HRIs are architectural and operations choices that may cause significant negative impact to a business. They may affect organizational operations, assets, and individuals. An example of an HRI on the Security Pillar is not securing your AWS account. MRIs might also impact your business negatively, but to a lesser extent than HRIs. An example of MRI on the Security Pillar is not auditing and rotating credentials periodically.

Generating HRIs/MRIs report

The first step to visually identify HRI/MRI is to generate a report that shows the risks for each workload you reviewed. The AWS Well-Architected Tool (AWS WA Tool) dashboard gives you access to your workloads and their associated HRI and MRI. You also can include workloads that have been shared with you. Using the dashboard, you can filter the issues by workload, pillar, or by severity (high or medium).

This diagram shows you an example of a dashboard with several sample workloads.

WA-Tool-dashboard

If I scroll down, I see a list of HRI/MRI. I can filter by pillar or severity. For example, this is a list HRI/MRI found on the Reliability Pillar. Once I select an improvement item, it takes me directly to the best practices associated with it from the Well-Architected Framework. From there, I can read about the recommended action I need to take to remediate the issue along with necessary resources.

To combine all these findings in one report, you can select Generate report from the WA Tool Dashboard.

I recommend that you share this report with the review team in your organization. I usually send a recap email to my customers summarizing what we did, the key findings, and the suggested improvement plan to prepare them for the next step.

2- Understand risks

[ETA: 2-3 weeks]

Before addressing a risk, it is important to understand its potential severity and impact on your business, the value it brings to your organization and the efforts by your team to implement the improvement.

  • When evaluating what the level of risk is to your business based on HRI and MRI definitions, consider asking these questions:
  • What is the likelihood of risk resulting in impact?
  • What would be the customer impact?
  • What would be the business impact as a result?
  • Can the risk be removed entirely, or only mitigated?
  • Who owns the risk?
  • Who owns the improvement work to remove or mitigate?

Having key stakeholders or business owners answer these questions will help create a list of the most important risks to focus on along with the projected time to address them.

Let me use my fictional workload to show you an example.

After I have a conversation with my team about HRI/MRI and the risks they bring to the business, I identify the following HRIs that need to be addressed.

Understand risks

3- Determine prescriptive solutions

[ETA: 4-5 weeks]

Once risks and improvement opportunities are understood in your organization’s context, you need to work with the teams to determine what the right prescriptive solution is for the risk. At this phase, each team needs to work on the HRIs found in their areas and determine a prescriptive solution to address the HRI. This step may require additional research, discussion, or building proof of concepts. It’s important not to jump to the implementation details of a solution in this phase. You will be doing that later if you decide that the HRI in question is a priority for your workload as discussed in the next step. The purpose of this step is to understand the complexity of the solution and what resources it requires so you can factor them when building the priority list in step 4.

In my example, I determine the following solutions for the three HRIs.

risks solutions

4- Implementation and tracking

[ETA: 3-6 months]

Prioritize first. No organization has unlimited time and resources. Trying to address all HRIs/MRIs identified as a result of WAFR at once might not be the right way to get the most of out WAFR. I always recommend to my customers to start with a selected number of HRIs/MRIs that have big impact on business and that are not so hard to implement. Implement the solution. Track the improvement and then iterate on that approach.

But how do you prioritize the most important items to implement?

One tool that can help you visualize solutions priority is the Eisenhower-style plot. There are different ways to use the tool. When evaluating, consider both the importance of the improvement, meaning how much value it brings to you the business; and the effort to implement the improvement, meaning hours required, complexity to implement, or headcount.

After doing the analysis, you will have a set of risks that have the most impact on your business, and at the same time, they are not complex to implement. These will be good candidates to start implementing in the first iteration.

Let’s apply this model on our example.

Reviewing the HRIs identified in our example, I determine the following.

This is how my analysis look like using the plot. After I decide on my priority to be REL1, COST1 and OPS4. I start the implementation and I repeat the process for the next set of HRIs/MRIs.

solutions prioritization considering impact/complexity

Figure-9- Solutions prioritization considering impact/complexity

Solution characteristics

When selecting a solution for an identified risk, consider the following:

  • S.M.A.R.T: Think of the solutions from SMART perspective. A good solution should have Specific outcome, should be Measured, Achievable, Relevant to the issue, and Time-bound.
  • Owners: For every solution, identify an owner.
  • Simple and not complex: Complex solutions can work but they make the improvement harder and longer. Always choose simplicity over complexity.
  • Two-way door solution: Solutions should be extensible and designed to improve and evolve over time. When possible, avoid static solutions that cannot adapt as your architecture develops.
  • Pattern-based: Target solutions that can be codified, reused, and re-shared. Don’t reinvent the wheel. Check here for some examples.

Timeline

You might be asking: What is a typical timeline to go through these the steps? There is no one answer to that. Every organization is different and have their unique challenges. However, from what I see from successful WARFs with many customers, I recommend this phase to take 90-180 days. If your list of HRIs/MRIs takes longer, I recommend that you prioritize them and come up with a shorter list so that you can start practicing the process to get some improvement. Then you can repeat on the remaining items.

Summary

In this blog post, I walked you through the steps you take to develop an improvement plan to address the HRIs/MRIs identified in your architecture as a result of conducting WAFR. Before developing the improvement plan, you need to understand and analyze the risks, prioritize them, determine their solutions and then decide on a priority approach to target the most impactful ones. I shared some tools and resources to help you achieve that. I also shared some characteristics that makes good solutions. Your next step is to talk your team about the importance of conducting a Well-Architected Framework Review (WAFR) for a few workloads in your organizations.

About the author

Ebrahim (EB) Khiyami

Ebrahim (EB) Khiyami is a Senior Solutions Architect with AWS. He is a specialist on the AWS Well-Architected Framework and an SME on Migration and Disaster Recovery. When he is not working, he’s often found playing, watching or coaching soccer.