AWS Cloud Operations & Migrations Blog

Manage workload risks using the AWS Well-Architected Tool and AWS Systems Manager

You can use the AWS Well-Architected Tool (AWS WA Tool) to identify and remediate risks in your workloads that map to the five pillars of the AWS Well-Architected Framework: operational excellence, security, reliability, performance efficiency, and cost optimization. The AWS WA Tool helps you identify and address vulnerabilities before they negatively impact your business. As the number of workloads increases, it can be a challenge to manage and prioritize which risks to address first.

By tracking all risks in a single location, you can better understand which risks are related, prioritize them, and implement the best practices to mitigate them. A single location also provides an audit trail, so you know when the best practices are implemented and risks are mitigated. This information comes in handy during a compliance audit. By automating the process of tracking these risks and updating workloads in the AWS WA Tool when you’ve implemented the best practices, you have a single source of truth. With the launch of APIs for the AWS WA Tool, you can programmatically access the tool to extend the best practices, measurements, and lessons into your own workflows.

In this post, we show you how to use the AWS WA Tool API to create OpsItems in AWS Systems Manager OpsCenter to track the best practices missing from the workloads. We automate this process by using an AWS Lambda function and use Amazon DynamoDB to maintain state and prevent duplicate OpsItems from being created. You can then view, investigate, and resolve the OpsItems in a single location and automatically update the risk status of the workload in the AWS WA Tool.

Architecture

Numbered arrows show the interaction between the services used in the solution, including Lambda, DynamoDB, OpsCenter, and Amazon SNS. The flow is described in the post.

Figure 1: Interaction between services used in the solution

  1. A Lambda function is invoked periodically using Amazon EventBridge.
  2. This function makes API calls to the AWS WA Tool to retrieve workload details, such as the number of high-risk issues (HRIs) and medium-risk issues (MRIs), missing best practices, and improvement plans.
  3. Using this information, the Lambda function creates OpsItems in OpsCenter to facilitate the tracking of risks for all workloads in an AWS account, in the AWS Region where the function is deployed.
  4. The Lambda function updates DynamoDB, which is used to maintain state so that the duplicate OpsItems are not created for the same missing best practice in a workload.
  5. After implementing the best practices, you can set the status of the OpsItem to resolved in OpsCenter. This triggers a notification to an Amazon Simple Notification Service (Amazon SNS) topic.
  6. The message is used to invoke a second Lambda function.
  7. The second Lambda function makes an API call to the AWS WA Tool to update the workload and reflect the implementation of the best practices.
  8. The function also updates the workload state maintained in the DynamoDB database.

Note

  • The implementation of the solution described in this post is for the well architected lens in a single Region only, but you can extend it to retrieve workload data for other lenses available in the AWS WA Tool from multiple Regions and accounts.
  • If some best practices do not apply to your workload, you can mark them as not applicable in the AWS WA Tool. This step is not part of this solution.
  • Depending on the number of workloads defined in the AWS WA Tool and the number of best practices missing in each workload, this solution might create numerous OpsItems. If you are only testing the solution, be sure to follow the cleanup steps at the end of the blog post to prevent unnecessary charges in your account.

Prerequisites

Deploy an AWS CloudFormation stack using the risk_management.yaml template. For instructions, check Creating a stack on the AWS CloudFormation console. The stack creates a DynamoDB table that is used to maintain state, an SNS topic that forwards messages from OpsCenter to Lambda, and an AWS Identity and Access Management (IAM) role that is used by Lambda function during run. Make a note of the keys and values from the outputs of the CloudFormation stack. You will need them later.

Define and document the workload state for one or more workloads in the AWS WA Tool. You can skip this step if you already have workloads defined and documented in the tool.

Walkthrough

Workload risk data is programmatically retrieved using the AWS WA Tool API from a Lambda function. The function uses the ListWorkloads, GetWorkload, GetAnswerListLensReviewImprovements API actions to retrieve workload risk data that is used to create OpsItems. To prevent duplicate OpsItems from being created for the same missing best practices in a workload, the Lambda function writes best practices data to a DynamoDB table to maintain state.

Create the Lambda function

  1. In the AWS Lambda console, choose Create function.
  2. Choose Author from scratch. For the function name, enter wa-risk-tracking. Choose Python 3.8 for the runtime.
  3. Under Permissions, expand Change default execution role. Choose Use an existing role and then choose wa-risk-tracking-lambda-role. This IAM role was created as part of the CloudFormation stack you deployed in the prerequisites. It provides least privilege permission for the Lambda function to make API calls to DynamoDB, AWS WA Tool, and Systems Manager. To check the permissions this role provides, choose View the wa-risk-tracking-lambda-role role on the IAM console.
  4. Choose Create function. Lambda provisions a new function that uses the IAM role.

The Author from scratch option is selected. In Function name, wa-risk-tracking is entered. Under Runtime, Python 3.8 selected from the drop-down list. Under Existing role, wa-risk-tracking-lambda-role is selected from the drop-down list.

Figure 2: Create a Lambda function

  1. Download this Lambda function package and save it locally. On the function details page, in Code source, choose Upload from and .zip file, and then upload the Lambda function package.
  2. In Runtime settings, choose Edit.
  3. Replace the value for Handler with risk_tracking.lambda_handler and then choose Save.

Configure a Lambda trigger

Create an EventBridge trigger for the Lambda function so that the function is invoked periodically to check the latest state of workloads in the AWS WA Tool and update OpsCenter.

  1. Under Function overview, choose Add trigger. Under Trigger configuration, choose EventBridge (CloudWatch Events).
  2. Under Rule, choose Create a new rule and then enter wa-risk-tracking-schedule for the rule name.
  3. For Rule type, choose Schedule expression, enter rate(12 hours), and then choose Add. To adjust the rate as appropriate for your use case, check Schedule Expressions for Rules.

The fields in the Add trigger section are completed as described in the post.

Figure 3: Add trigger

Add environment variables

Pass the SNS topic ARN used in the solution as an environment variable to the Lambda function.

  1. Choose the Configuration tab of the Lamda function and then choose Environment variables. Choose Edit and then choose Add environment variable.
  2. Under Key, enter sns_topic_arn. Under Value, use the ARN of the SNS topic from the CloudFormation stack output. Choose Save.

The fields in Edit environment variables are completed as described in the post.

Figure 4: Edit environment variables

Configure the Lambda function timeout

  1. On the Configuration tab of the Lamda function, choose General configuration and then choose Edit.
  2. During testing, we configured the function with 128 MB of memory and a timeout of one minute. It created 54 OpsItems in approximately 19 seconds. Adjust the function timeout based on the number of risks (HRIs and MRIs), missing best practices, and workloads defined in the AWS WA Tool in your account, and then choose Save.

On Basic settings, for Memory, 128 MB is entered. For Timeout, 1 min is entered. Under execution role, Use an existing role radio is selected. Under Existing role, wa-risk-tracking-lambda-role is selected from the drop-down list.

Figure 5: Edit basic settings

Test the solution

Because the Lambda function is set to run on a schedule, we can test what has been implemented so far by manually invoking the function.

  1. In the Lambda console, choose the wa-risk-tracking
  2. Choose the Test tab to create a test event.
  3. Under Template, choose hello-world and then choose Test. This will invoke the Lambda function and run the function code. Make sure that the function runs successfully before you continue.

On the Test tab, the New event option is selected. Under Template, hello-world is selected.

Figure 6: Invoke your function with a test event

  1. From the left navigation pane of the Systems Manager console, choose OpsCenter, and then choose the OpsItems

You should see a list of OpsItems with the title <HRI/MRI> – <Your workload name from the AWS WA Tool> – <Best practice missing> with the source Well-Architected. Due to eventual consistency, these OpsItems might not appear immediately. Wait a few minutes and refresh the page.

  1. Choose one of the OpsItems to view its details.

The OpsItems created by the solution are displayed in a table organized by ID, title, status (in this example, Open), source (in this example, Well-Architected), created, and updated.

Figure 7: OpsItems

On the Overview tab of the OpsItem, you will get the ARN of the workload that has the HRI or MRI risk.

 The ARN of the workload is displayed in the Related resources section.

Figure 8: Workload ARN

  1. Expand Operational data. You will find information such as the best practice missing from your workload, the pillar where the risk exists, and the link to the improvement plan that provides guidance for implementing this best practice.

Operational data for an OpsItem includes best practices missing (in this example, Processes and procedures have identified owners), risk level (HIGH), workload name, and more.

Figure 9: Operational data

This solution gives you a central place to track and manage workload risks. You can filter OpsItems based on the operational data by choosing Operational data and entering a key-value pair in JSON in the search box: {“key”:”key_name”,”value”:”a_value”}. When teams are working to implement best practices, they can set the status of the OpsItem to In progress. This way, the whole team has visibility into what is being worked on and the status of the work. It prevents duplication of effort.

Automatically update workload

When you’re reviewing workloads to find which best practices have been implemented, the AWS WA Tool should be the single source of truth. Keep the workload state updated so you have an accurate understanding of risks to your workload and business. If you’ve implemented best practices for a workload but haven’t updated the tool, your workload stakeholders won’t have the information they need to make decisions. To prevent this, we automate the process of updating the workload in the AWS WA Tool when a best practice is implemented and an OpsItem is resolved.

We use an SNS topic to trigger a Lambda function. When you resolve an OpsItem that was created as part of the solution, a notification is sent to the SNS topic. This topic invokes a Lambda function that makes an API call to the AWS WA Tool to update the workload and reflect the implementation of the best practice in the OpsItem. The Lambda function also updates state in the DynamoDB table to reflect that this best practice is no longer missing for the workload.

Create the Lambda function

  1. In the AWS Lambda console, choose Create function.
  2. Choose Author from scratch. For the function name, enter wa-update-workload. Choose Python 3.8 for the runtime.
  3. Under Permissions, expand Change default execution role. Choose Use an existing role and then choose wa-risk-tracking-lambda-role.
  4. Choose Create function. Lambda provisions a new function that uses the IAM role.
  5. Download this Lambda function package and save it locally. On the function details page, in Code source, choose Upload from and .zip file, and then upload the Lambda function package.
  6. In Runtime settings, choose Edit.
  7. Replace the value for Handler with update_workload.lambda_handler and then choose Save.

Configure the Lambda trigger

  1. Under Function overview, choose Add trigger.
  2. Under Trigger configuration, choose SNS.
  3. Under SNS topic, choose wa-risk-tracking, and then choose Add.

In Add trigger, the fields are selected as described in the post.

Figure 10: Trigger configuration

Test the implementation

  1. In the left navigation pane of the Systems Manager console, choose OpsCenter, and then choose one of the OpsItems with a source of Well-Architected.
  2. Expand Operational data and make a note of the workload name, pillar, question title, and missing best practice.
  3. Open a new browser tab and navigate to the AWS WA Tool console and then open the workload that was listed in the OpsItem.
  4. On the Overview page, in Lenses, choose AWS Well-Architected Framework.
  5. In Pillars, choose the pillar that was listed in the OpsItem.
  6. In Questions, expand Answer details for the question in the OpsItem. The best practice listed in the OpsItem does not appear under Selected choice(s).

The OPS 2 question is How do you structure your organization to support your business outcomes? Answer details section is expanded to show a bulleted list.

Figure 11: Answer details

  1. Let’s assume that you have implemented the best practice listed in this OpsItem and there is no remaining work to do. On the OpsItem details page in the Systems Manager console, choose Set status, and then choose Resolved.

Under the Set status menu, Resolved is selected. The OpsItem ARN is displayed.

Figure 12: Resolving the OpsItem

  1. Go back to the WA tool console and refresh the page.
  2. In Questions, expand Answer details for the question in the OpsItem. The best practice listed in the OpsItem should appear under Selected choice(s).

For question OPS 2, the new best practice appears under Selected choice(s).

Figure 13: Processes and procedures have identified owners

When you resolved the OpsItem, a notification is sent to the wa-risk-tracking SNS topic, which invokes the wa-update-workload Lambda function. The function updates the workload in the AWS WA Tool to reflect the best practice in the OpsItem that was implemented. With this approach, the tool will always be a single source of truth for information about your workloads and workload risks.

Cleanup

  • Open the CloudFormation console and delete the stack you created as part of the prerequisites.
  • Open the Lambda console and delete the wa-risk-tracking and wa-update-workload functions.
  • Open the EventBridge console and delete the wa-risk-tracking-schedule rule.
  • In the Systems Manager console, set the status of OpsItems with a source of Well-Architected to Resolved. Because the solution might have generated a large number of OpsItems, you can use a script to call the UpdateOpsItem API action and mark the status of OpsItems to Resolved.

Conclusion

In this post, we shared a solution you can use to create OpsItems for best practices missing from your workloads. This solution offers a centralized view of workload risks, which helps you make informed decisions about prioritization and prevent duplication of effort.

Implement this solution in your environment for better tracking of workload risks. You can modify it for use with other ITSM solutions, such as Atlassian Jira, ServiceNow, or ZenDesk.

For more information on identifying and managing workloads risks, check out:

Hands on lab

Try out the AWS Well-Architected lab on managing workload risks for a hands-on experience.

About the authors

Mahanth Jayadeva

Mahanth Jayadeva is a Solutions Architect at Amazon Web Services (AWS) on the AWS Well-Architected team. He works with customers and AWS Partner Network partners of all sizes to help them build secure, high-performing, resilient, and efficient infrastructure for their applications. He spends his free time playing with his pup, Cosmo, and learning more about astronomy. He is an avid gamer.

Andrew Robinson

Andrew is a UK-based Senior Solutions Architect at AWS. As part of the Well-Architected team, he is the Geo lead for Well-Architected across EMEA. He works with AWS Partner Network partners and customers to help them build secure, high-performing, resilient, and efficient infrastructure for their applications. Andrew has more than 14 years of experience in the tech industry working in support, consultancy, engineering, operations, and architecture in the manufacturing, retail, and public sectors.