AWS Cloud Operations & Migrations Blog

Automate the creation of AWS Support cases using Amazon CloudWatch alarms and Amazon Bedrock

For production applications, the Mean-Time-To-Recovery (MTTR) is critical. In line with this, AWS offers Business, Enterprise On-Ramp and Enterprise support plans where AWS customers can benefit from shorter response time for cases related to production and business critical workloads. However, without having an automated way to notify AWS support, creating a case is a manual process that requires the availability of the engineers and access to an AWS account at the time of the event. This introduces a delay in AWS Support intervention which can be crucial in the recovery process.

In this post, I will share a solution to automatically create a support case triggered by Amazon CloudWatch alarms action. This solution leverages Amazon Bedrock and CloudWatch alarms to build an AI-powered automated workflow. Bedrock will utilize a dataset containing AWS Support CTI information and processes metadata from CloudWatch alarm events to offer recommendations on Incident Classification.

It is critical how soon AWS support is notified when something goes wrong in the production environment. Therefore, having a mechanism to automatically create a support case based on a CloudWatch alarm can expedite the attention needed from AWS Support. It enables you to focus your resources on addressing the business impact simultaneously while the support team is performing primary checks on the resource reported by the alarm.

Example use cases

  • Troubleshooting an intermittent issue, where you want to notify AWS as soon as the issue re-occurs for live troubleshooting.
  • Responding to Amazon Route 53 health checks you establish to monitor application health.
  • Automating the request for service-limit increase. You can create a CloudWatch alarm to notify you when you’re close to a quota value threshold.

Overview of solution

This solution is deployed as an AWS CloudFormation stack, which creates the following resources in your AWS account:

  1. An Amazon Simple Notification Service (Amazon SNS) topic that can be used as an action for CloudWatch alarm to trigger the AWS Lambda function.
  2. An AWS Lambda function that contains Python code.
  3. An Amazon DynamoDB table used to log the support cases generated by this workflow for future status checks.
  4. An SNS topic to receive notifications about AWS Support activities processed by the workflow.
  5. A Lambda execution role used by the Lambda function with all required permissions to execute the solution tasks.
High level architecture showing the deployment of SNS, Lambda function and DynamoDB

Figure 1: High level architecture for the solution

Prerequisites

The following prerequisites are necessary:

  • An AWS account.
  • A user with IAM permissions required to create Amazon DynamoDB Database, AWS Lambda layer, AWS Lambda function, AWS Lambda execution role and Amazon SNS topic.
  • The AWS account must be enrolled in Business Support, Enterprise On-Ramp, or Enterprise Support to access the AWS Support API.
  • Access to Claude 2 large language model (LLM) from Anthropic in Bedrock.

The solution works as follows:

  1. An SNS topic is added to the CloudWatch alarm action.
  2. If the alarm action is triggered, the SNS topic sends a JSON object containing details about the alarm state change to the subscribed Lambda function(2).
  3. The Lambda function runs and performs the following steps:
      1. Parsing the alarm state change details, the function extracts the information namespace, metric_name, new_state_reason, alarm_arn, and dimensions.
      2. Checking the DynamoDB table for cases related to the same alarm_arn. If an unresolved support case exists for the same alarm, the function sends a notification to Amazon SNS subscribers and includes the latest update on the support case. Otherwise, it proceeds to the next step and creates a new support case. This step is essential for preventing noisy alarms from generating duplicate cases.
      3. Performing incident classification. The AWS Support API requires ServiceCode, CategoryCode and SeverityCode variables which can vary depending on the incident. To classify the incident properly, the function leverages Anthropic Claude v2 model via Bedrock to perform AI-powered semantic search. Bedrock determines ServiceCode and CategoryCode values based on the data parsed from the alarm event. Later in Step 5 we discuss how SeverityCode is determined.
      4. Extracting alarm resource tags. The Lambda extracts alarm tags and utilizes them as additional context in support case details. This helps AWS Support to better understand how to provide the necessary assistance.
      5. Creating the support case using the incident classification values returned from Bedrock combined with the user-defined alarm’s tags.
      6. Recording the new support case in the DynamoDB table for future reference.
      7. Sending a notification to all endpoints subscribed to an SNS topic.
A support case example created by the automation

Figure 2: A support case created using CloudWatch alarm to request VPC limit increase when VPC usage reached more than 75% of available quota.

Considerations for production use

Take the following considerations into account when using in production:

Production usage: This solution is not intended to be the only means of Incident reporting. Use it in conjunction with your existing Incident Management, Reporting and Notification processes.

Encryption: It’s best practice to use encryption everywhere. In the sample code provided with this post, the SNS topics are not encrypted. When using this solution in production, encrypt these resources using AWS Key Management Service (AWS KMS).

Log retention: The sample code provided with this post has a hard-coded Amazon CloudWatch Logs retention period of 30 days. It is recommended to consider your organization data storage retention policies when using it in production.

Support case details: The sample code provided combine alarm event key values and CloudWatch alarm’s resource tags to provide more context about the situation. Adding tags such as SeverityCode, OwnerEmail, Details, ApplicationName, CallBack, and any additional shareable information will assist AWS Support to have better understanding of the impact of this event.

Custom metrics: The sample code supports CloudWatch custom metrics as well. To receive recommendation on incident classification, Bedrock requires the CloudWatch metric’s namespace, metric_name, and details parameters. It is essential that these values are descriptive to correlate the incident with the relevant service support team. You can choose a namespace value from the list of AWS services that publish CloudWatch metrics.

Walkthrough

Step 1: Deploy the solution using CloudFormation console

Follow the below steps to deploy the CloudFormation YAML template.

  1. Clone the Git repository or download the YAML file to your local directory.
  2. On the AWS CloudFormation console, choose Create a Stack.
  3. Under Prerequisite – Prepare template, select Template is ready. Under Specify template, select Upload a template file. Choose Choose file and navigate to the YAML file location in your machine, select the YAML file and choose Open.
  4. Choose Next.

    Step 1 during stack creation where the YAML file location needs to be provided

    Figure 3: Step 1 CloudFormation stack YAML file selection console view

  5. On the Create stack page, enter a Stack name.
  6. Under Parameters enter each of:
    • BedrockRegionalEndpoint – at the time of writing this blog, Bedrock has 5 regional API endpoints. To select from the available standard API endpoints, visit this link. The solution utilizes standard endpoints and not FIPS endpoints. The regional API endpoint must be in the following syntax including the HTTPS protocol:
      protocol://bedrock-runtime.region-code.amazonaws.com
    • BedrockRegion – enter the region-code based on the BedrockRegionalEndpoint you selected in the previous step.

    Step 2 is specifying stack details during the CloudFormation deployment for the solution

    Figure 4: CloudFormation stack parameters requirements console view page 2

  7. Select the defaults for the rest of the pages and choose Next.
  8. On the Review page, acknowledge and select the Capabilities warning. Choose Create stack to proceed.

    Showing step 4 of the stack deployment where user acknowledgment is required.

    Figure 5: Acknowledgment and review stack of stack deployment

Step 2: Create a Lambda layer

The SDK Python Boto3 version 1.28.57 or above is required for Lambda to access the library for Bedrock API calls. At the time of writing this blog, Lambda is not updated with the Boto3 library to support Bedrock APIs. Lambda layer can be used to include the function dependencies required for Bedrock API.

Follow the steps to create Lambda layer:

  1. Install the libraries in a package directory with the pip.
    mkdir python
    pip3 install -t ./python
    boto3==1.28.57
  2. Create a ZIP file from the installed libraries under the python directory.
    zip -r python.zip ./python/*
  3. Create a Lambda layer
      1. In the AWS Lambda console, open the Layers page and choose Create layer
      2. Enter a name and optional description for your layer
      3. Select Upload a .zip file, choose your python.zip file, and then choose Open
      4. For Compatible runtimes, select Python 3.11
      5. Choose Create

Step 3: Add the Lambda layer to the function

  1. Open the Functions page of the Lambda console.
  2. Choose the function CW-Alarms-Support-Cases to configure.
  3. Under Layers, choose Add a layer.
  4. Under Choose a layer, choose Custom layers.
  5. From the pull-down menu for Custom layers, select the layer you created in Step 2.
  6. Under Version, choose 1.
  7. Choose Add.

Step 4: Subscribe to the SNS topic that provides workflow update notifications

Once the stack deployment is complete, you can subscribe to the SNS topic and receive all notifications for the solution activities. Sign in to Amazon SNS console and select the region where the solution stack is deployed.

To subscribe a new user to the topic

  1. In the navigation pane, choose Topics.
  2. In the list of SNS topics, select the Topic name <Your-stack-name>- AlarmSupportCasesNotifications-xxxxxxxx.
  3. On the Subscriptions pane, choose Create subscription.
  4. On the Create subscription page:
      1. For Topic ARN, the selected ARN is <Your-stack-name>- AlarmSupportCasesNotifications-xxxxxxxx.
      2. For Protocol, choose Email or optionally you can use any of the other available protocols.
      3. For Endpoint, enter an email address that can receive notifications.
      4. Choose Create subscription.
  5. Check your email inbox and choose Confirm subscription in the email from AWS Notifications. The sender ID is usually no-reply@sns.amazonaws.com.
  6. Amazon SNS opens your web browser and displays a subscription confirmation with your subscription ID.

    SNS protocol endpoint selection view in SNS console under the solution SNS topic that users can subscribe to receive notifications.

    Figure 6: SNS protocol endpoint selection view in SNS console under the solution SNS topic that users can subscribe to receive notifications.

Step 5: Configure CloudWatch alarm action to send alarm events to the SNS Topic

You can add the Amazon SNS topic to existing alarms and new alarms. In the following steps, we will create a new CloudWatch alarm.

  1. Sign in to CloudWatch console. Navigate to the region where the CloudFormation stack is deployed.
  2. In the left navigation pane choose All alarms.
  3. On the alarms page, click the Create alarm button.
  4. On the Specify metric and conditions page, choose Select metric.
  5. Search the CloudWatch Metric you want to configure your alarm with.
  6. Select the check box next to the metric that you want, then choose Select metric. You can choose between any Single Metric and Metric Math.
  7. Define the condition parameters for when the alarm state will be In alarm, then choose Next.
  8. On the Configure actions page, under Notification, configure each:
      1. An alarm state trigger: Choose between In alarm, OK and Insufficient data. This will decide which new state for the alarm trigger the solution
      2. Choose Select an existing SNS topic
      3. For Send a notification, choose the SNS topic <Your-stack-name>-CreateSupportCasesTopic-xxxxxxxx
      4. Choose Next
  9. On the Add name and description page, for name, enter a name for your alarm. The alarm description is optional. However, additional details about the impact of this alarm and what business this alarm relates to help direct AWS support to the right set of actions needed to provide primary feedback.
  10. Choose Next, then choose Create alarm.
  11. Additionally, the function checks the below case-sensitive tags if configured on the alarm. Consider using these tags where applicable.
    • SeverityCode (optional): You can set this to one of the values mentioned below. Otherwise, all cases will be assigned a default severity of low:
        • low – General guidance
        • normal – System impaired
        • high – Production system impaired
        • urgent – Production system down
        • critical – Business-critical system down. This case severity is available to Enterprise On-Ramp and Enterprise plans customers only.
    • OwnerEmail (optional): By default, AWS support cases send correspondences related to case updates to the AWS account’s primary email and the operational alternate email. To include additional contacts, use this tag to receive correspondences related to the support cases created for this alarm. You can add multiple email addresses separated by commas (“,”).
    • The solution returns all additional tags added to the alarm and include them in the case message body.

Cleaning up

To delete all resources associated with this solution, you can navigate to CloudFormation in the AWS Console, select the stack you deployed in Step 1, and choose Delete.

Conclusion

The automatic workflow introduced in this blog provides a solution for promptly notifying AWS support in response of CloudWatch alarm events. With the introduction of Amazon Bedrock in this solution, we are able to leverage Generative AI’s capabilities to solve incident classification requirements by Support API at a scale.

By incorporating this approach in production environments, prompt attention and feedback from AWS support can potentially reduce the MTTR.

About the author

Amer Al Okeh

Amer Al Okeh is a Senior Technical Account Manager at Amazon Web Services (AWS). As a TAM, Amer focuses on helping Enterprise Support customers achieving their business goals by navigating technical challenges, exploring cost optimization opportunities and reaching operation excellence.