AWS Cloud Operations Blog

Troubleshooting AWS Systems Manager patching made easy with Amazon Bedrock’s automated recommendations

Keeping your AWS infrastructure up-to-date and secure is a critical part of maintaining a robust and reliable cloud environment. AWS Systems Manager’s patching capabilities are a powerful tool in this effort, allowing you to automatically apply the latest security updates and bug fixes to your managed nodes, including Amazon Elastic Compute Cloud (EC2) instances, on-premises servers and virtual machines (VMs), edge devices, and other cloud VMs.

However, the patching process isn’t always straightforward. Failures can arise during the installation or rollout of patches, leading to failed deployments, unexpected downtime, or other operational challenges. Troubleshooting these problems can be time-consuming and complex, requiring deep expertise in AWS services, Linux/Windows systems administration, and patching best practices.

In this post, we’ll explore how Amazon Bedrock can simplify the troubleshooting process for Systems Manager patching failures. Bedrock’s automated analysis and recommendation capabilities can help you quickly identify the root causes of patching problems and implement the right solutions, saving you valuable time and effort.

Architecture overviewArchitecture for automating the process of generating recommendations related to patch operation failures using Amazon Bedrock.

Figure 1. Architecture for automating the process of generating recommendations related to patch operation failures using Amazon Bedrock.

Here is how the process works:

  1. In a member AWS account and Region, Amazon EventBridge monitors for failed Patch Manager operations.
  2. EventBridge initiates an AWS Step Function to receive and process the event details of the failed patch operation.
  3. The Step Function initiates a Run Command task on the managed node that failed the patch operation.
  4. The Run Command task gathers relevant operating system logs for patch failures and puts the log files in a central Amazon Simple Storage Service (S3) bucket.
  5. In the central AWS account, an EventBridge rule runs periodically based on a cron expression specified which invokes an AWS Lambda function.
  6. The Lambda function gathers the logs from the S3 bucket, complies the log files based on account ID, Region, and Command ID.
  7. The Lambda function then runs an inference in Amazon Bedrock for the troubleshooting logs.
  8. Bedrock generates recommendations to troubleshoot the failed patch operations. The recommendations are stored in the central S3 bucket and an email report is sent via Amazon Simple Notification Service (SNS) or Amazon Simple Email Service (SES) to the operators to review.

Prerequisites

Amazon Elastic Compute Cloud (EC2) instances, AWS Internet of Things (IoT) Greengrass core devices, on-premises servers, edge devices, and VMs must be Systems Manager managed nodes to be patched and for logs to be gathered. This means your nodes must meet certain prerequisites and be configured with the AWS Systems Manager Agent (SSM Agent). For more information, see Setting up managed nodes for AWS Systems Manager.

Additionally, your managed nodes must have access to the S3 bucket created by the CloudFormation template referenced in the walkthrough. The managed nodes need s3:PutObject and s3:PutObjectAcl to send troubleshooting logs to the S3 bucket. For an example IAM policy, see the following GitHub link:

https://github.com/aws-samples/Bedrock_Recommendations_for_Patch_Manager/blob/main/example-s3-permissions.json

{
    "Version": "2012-10-17",
    "Statement": [{
        "Sid": "VisualEditor0",
        "Effect": "Allow",
        "Action": [
            "s3:PutObject",
            "s3:PutObjectAcl"
        ],
        "Resource": "arn:aws:s3:::patch-reporting-123456789012-us-east-1/*"
    }]
}

Note: You must replace the S3 bucket in the example policy with the Amazon Resource Name (ARN) of the S3 bucket created by CloudFormation.

Patch operation failure recommendations are supported on Windows managed nodes. For Linux managed nodes, patch operation failure recommendations can be generated for Linux operating system (OS) platforms supported by the Automation runbook AWSSupport-TroubleshootPatchManagerLinux.

In this walkthrough, you can receive patch operation failure recommendations generated by Amazon Bedrock using SNS or SES. If you use SES, you must enable and configure SES with at least one identity to send email recommendations to the recipient email address. For more information, see Getting started with Amazon Simple Email Service.

For Amazon Bedrock to generate patch operation recommendations, you must enable access to Anthropic Claude 3 Sonnet. For more information, see Access Amazon Bedrock foundation models.

You must identify your AWS Organization ID, for example o-abcdefg123. You can find this value by navigating to the AWS Organization console. You will need to pass this value to the CloudFormation stack when creating resources in the central account to create a S3 bucket policy that allows managed nodes in member accounts to upload troubleshooting logs.

Walkthrough

In this post, we will walkthrough deploying one CloudFormation stack in the central account to create resources required to generate recommendations using Amazon Bedrock and a CloudFormation StackSet to create resources in member accounts and Regions to retrieve and send logs related to patching failures on managed nodes.

Deploy central account resources

First, choose one account as your central account, this can be any AWS account within your AWS Organization. From this central account, a scheduled EventBridge rule will periodically run to invoke the Step Functions workflow to generate patch troubleshooting recommendations using Amazon Bedrock. This central account will host the S3 bucket which contains the troubleshooting logs and recommendations.

Open the following GitHub page and download the patchLogExtractor-CFN.yaml file.

https://github.com/aws-samples/Bedrock_Recommendations_for_Patch_Manager/blob/main/patchLogExtractor-CFN.yaml

  1. In the central account, navigate to the AWS CloudFormation console.
  2. In the navigation pane, choose Stacks.
  3. Choose Create stack and choose With new resources (standard).
  4. On the Create stack page, for Specify template, choose Upload a template file, select Choose file, choose the patchLogExtractor-CFN.yaml file, and then choose Next.
  5. On the Specify stack details page, perform the following steps:
    1. For Stack name, enter patchLogExtractor-centralAccount.
    2. For parameters in Central Account:
      1. For Central AWS Account, choose true.
      2. For Organization ID, enter the AWS Organization ID you retrieved in the prerequisites.
      3. For EventBridgeRuleSchedule, optionally modify the EventBridge rule schedule. The default schedule is to run once a day at 12:00 UTC.
      4. For EmailService, choose SNS or SES.
        1. If you selected SNS, perform the following steps:
          1. For RecipientEmail, enter the email address to send the report using SNS.
          2. Leave SenderEmail empty.
        2. If you selected SES, perform the following steps:
          1. For RecipientEmail, enter the email address to send the report to using SES.
          2. For SenderEmail, enter the email address configured in SES to use when sending the patch troubleshooting recommendations.
      5. For S3BucketName, optionally enter a name for the S3 bucket. If no name is provided, the bucket is named similarly to patch-reporting-${AWS::AccountId}-${AWS::Region}.
    3. For parameters in Member Account(s):
      1. Leave the default value for Member AWS account, false.
      2. Leave Central S3 bucket as an empty value.
  6. Choose Next.
  7. On the Configure stack options page, leave the defaults and choose Next.
  8. On the Review and create page, select I acknowledge that AWS CloudFormation might create IAM resources, and choose Submit.

After the page is refreshed, the status of your stack should be CREATE_IN_PROGRESS. When the status changes to CREATE_COMPLETE, proceed to the next section.

Deploy member account resources

In this post, we will use service-managed permissions to create a CloudFormation StackSet from a delegated administrator account for CloudFormation. If you choose to use self-service permissions, the deployment steps may vary slightly.

  1. In the delegated administrator account for CloudFormation, navigate to the AWS CloudFormation console.
  2. In the navigation pane, choose StackSets.
  3. Choose Create StackSet.
  4. For Specify template, choose Upload a template file, select Choose file, choose the patchLogExtractor-CFN.yaml file, and then choose Next.
  5. On the Specify StackSet details page, perform the following steps:
    1. For StackSet name, enter patchLogExtractor-memberAccount.
    2. For StackSet description, optionally enter a description such as, StackSet to create resources in member account to gather patch failure troubleshooting logs.
    3. For parameters in Central account:
      1. For Central AWS Account, ensure false is selected.
      2. Leave the defaults for Organization ID, EventBridgeRuleSchedule, EmailService, RecipientEmail, SenderEmail, and S3BucketName.
    4. For parameters in Member Account(s):
      1. For Member AWS Account, choose true.
      2. For Central S3 Bucket, enter the name of the S3 bucket created in the central account. Note: You can find the name of the S3 bucket created in the central account in the Outputs tab of the CloudFormation stack.
  6. On the Configure StackSet options page, leave the defaults and choose Next.
  7. On the Set deployment options page, perform the following steps:
    1. For Deployment targets, choose Deploy to organization units (OUs), and enter the AWS OU ID(s) to deploy resources into.
    2. For Specify regions, select the Regions to deploy resources into.
    3. Leave the defaults for the other options and choose Next.
  8. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources, and choose Submit.

After the page is refreshed, the operations status of your StackSet should be CREATE. When the status changes to SUCCEEDED, proceed to the next section.

Review patch troubleshooting recommendations

Now that we have deployed resources to monitor for failed patching operations in our member AWS accounts and Regions, a summarization of troubleshooting recommendations will be generated by Amazon Bedrock and sent to the recipient email address based on the EventBridge cron expression schedule.

In the consolidated patch recommendations report email, you can find a detailed report of the patch operations that have failed across your AWS accounts and Regions with a reference to the Run Command ID. The email also provides a direct link to the recommendations file hosted in the S3 bucket created by CloudFormation.

Example email report sent to operators including troubleshooting recommendations generated by Amazon Bedrock for failed patch operations.

Figure 2. Example email report sent to operators including troubleshooting recommendations generated by Amazon Bedrock for failed patch operations.

The recommendations text file for each AWS account and Region will contain recommendations on how to resolve the patch operation failures and group the recommendations based on instances that experienced similar errors.

Example patch operations recommendations text file containing steps to resolve patch operation failures.

Figure 3. Example patch operations recommendations text file containing steps to resolve patch operation failures.

Clean-up

To delete the CloudFormation stack created in the central account, you must first delete all files in the S3 bucket created by CloudFormation. Navigate to the S3 bucket created in the central account and empty the bucket. After you emptied the S3 bucket, you can delete the stack in the central AWS account.

To delete the CloudFormation StackSet created in this post, you must first delete the stack instances and afterwards you can delete the StackSet.

Conclusion

In this post, we showed you how you can use Bedrock to generate troubleshooting recommendation steps for failed patching operations. We started by deploying a CloudFormation stack in the central account to create a Step Functions workflow to integrate a Lambda function with Bedrock to generate recommendations based on aggregated patching logs gathered from managed nodes in member accounts and Regions.

By using Bedrock, you quickly identify the root causes of patching problems and implement the right solutions, saving you valuable time and effort. The architecture discussed in this post is extensible and you can utilize a similar workflow for other errors that may occur within your environment and utilize Bedrock’s recommendations to more easily identify resolutions for failures taking place.

About the authors

Erik Weber

Erik Weber is a Sr. World-wide Specialist Solutions Architect for AWS Cloud Operations services. He specializes in AWS Systems Manager, AWS Config, AWS CloudTrail, and AWS Audit Manager. Outside of work, Erik has a passion for hiking, cooking, and biking.

Raviteja Sunkavalli

Raviteja Sunkavalli is a Senior Specialist Solutions Architect at Amazon Web Services, specializing in AWS Systems Manager and Amazon CloudWatch. He supports global customers in implementing observability solutions and streamlining their cloud operations. Outside of work, Ravi enjoys playing cricket and exploring new cooking recipes.

Ali Alzand

Ali is a Microsoft Specialist Solutions Architect at Amazon Web Services who helps global customers unlock the power of the cloud by migrating, modernizing, and optimizing their Microsoft workloads. He specializes in cloud operations – leveraging AWS services like Systems Manager, Amazon EC2 Windows, and EC2 Image Builder to drive cloud transformation. Outside of work, Ali enjoys exploring the outdoors, firing up the grill on weekends for barbecue with friends, and sampling all the eclectic food has to offer.

Ravindra Kori

Ravindra Kori is a Solutions Architect and GenAI ambassador at AWS based in Arlington, specializing in Cloud Operations and Serverless technologies. He works extensively with Enterprise and Startup segments, architecting solutions and facilitating AWS modernization and migrations. Outside of work, he finds joy in playing drums and spending quality time with family.