AWS Cloud Operations Blog

Respond to CloudWatch Alarms with Amazon Bedrock Insights

Overview

When operating complex, distributed systems in the cloud, quickly identifying the root cause of issues and resolving incidents can be a daunting task. Troubleshooting often involves sifting through metrics, logs, and traces from multiple AWS services, making it challenging to gain a comprehensive understanding of the problem. So how can you streamline this process and reduce the time and effort required for effective incident resolution?

This blog post introduces the Alarm Context Tool (ACT), a solution that enhances Amazon CloudWatch Alarms by providing additional context to aid in troubleshooting and analysis. By leveraging AWS services such as AWS Lambda, Amazon CloudWatch, AWS X-Ray, AWS Health and Amazon Bedrock. This solution aggregates and analyzes metrics, logs, and traces to generate meaningful insights. ACT simplifies the troubleshooting process, reduces operational costs, and enhances the observability of your AWS environment.

Key Benefits

Simplified Troubleshooting

ACT automatically collects and analyzes relevant data from various sources, including CloudWatch, X-Ray, Amazon RDS Performance Insights, and CloudWatch Container Insights. This aggregation helps identify root causes and reduce the time needed for troubleshooting. By consolidating data from different AWS services, ACT provides a comprehensive view of the system’s health and performance, enabling quicker resolution of incidents.

Cost Efficiency

By providing detailed context and insights directly within the alarm notifications, ACT helps reduce the operational overhead and costs associated with manual data collection and analysis. Operators quickly understand the issue without diving deep into multiple AWS services. This reduces the time and effort required to diagnose problems, resulting in lower operational costs, improved resource utilization and reducing meantime to recovery (MTTR).

Enhanced Observability

ACT leverages Amazon Bedrock’s generative AI capabilities to summarize findings, identify potential root causes, and offer relevant documentation links. This enhances the observability of your AWS environment, simplifying maintenance and operational tasks. The integration of AI-powered insights ensures that operators receive actionable information, enabling them to focus on resolving issues instead of sifting through logs and metrics.

Technical Architecture

The solution is built using a combination of AWS Lambda functions, CloudWatch Alarms, X-Ray tracing, and Amazon Bedrock’s AI capabilities. Here’s an overview of the architecture:

Alarm Context Tool Architecture Diagram

Figure 1: Architecture Diagram

  1. CloudWatch Alarms send a notification to an Amazon SNS topic.
  2. Lambda Function subscribes to SNS topic(s).
  3. Lambda Function aggregates data from sources including CloudWatch metrics and logs, X-Ray traces, RDS Performance Insights, Container Insights and AWS Health.
  4. Amazon Bedrock processes the aggregated data to generate summaries, insights and root cause.
  5. Amazon SES sends the processed information to the relevant stakeholders via email.
  6. X-Ray Tracing using tracer from Powertools for AWS Lambda (Python) provides detailed traces of the Lambda function execution, offering deep visibility into the function’s performance and behavior.

Example Scenario: ACT in Action

Scenario Overview

An alarm is triggered due to a failure of a CloudWatch Synthetics canary. This failure is an indication of intermittent errors or high latency for a microservice API. The ACT Lambda function is invoked to gather additional context and provide a detailed analysis of the issue. Here’s how ACT simplifies troubleshooting in this scenario:

Trace Map

Figure 2: Alarm Context Tool Trace Map

Data Collection and Analysis

When the alarm is triggered, the ACT Lambda function performs the following data collection steps:

  1. CloudWatch Metrics: The function gathers relevant metrics such as error rates, latency, and request counts from CloudWatch.
  2. CloudWatch Logs: The function collects relevant logs from CloudWatch Logs, in this case, associated with the canary run.
  3. X-Ray Traces: Detailed traces are obtained to identify the exact point of failure within the API’s execution flow. For example, the trace data shows that the films-prod-APILambdaFunction-LulvbCzjxHFb Lambda function encountered issues while querying the Movies DynamoDB table.
  4. Health Events: The function queries AWS Health for any relevant events that could potentially impact the identified services.
  5. Alarm History: The function examines the history of the alarm and in this case, it determines that this is a recurring issue.
  6. Resource Information: The function retrieves details of the identified resource: the CloudWatch Synthetics canary.
  7. Amazon Bedrock Analysis: Amazon Bedrock analyzes the aggregated data to generate a summary of findings.

Generative AI Insights

Amazon Bedrock analyzes the collected data and generates a summary of findings. In this example, a DynamoDB table is experiencing high read traffic and exceeding its provisioned throughput, which Bedrock identifies as the root cause.

Notification and Reporting

The function sends an email to the relevant stakeholders, summarizing the findings and suggesting potential solutions. The email includes:

  • Root Cause Analysis: Based on the collected data, Bedrock identifies the primary issue, such as a DynamoDB table exceeding its provisioned throughput.
  • Alarm Frequency and Immediacy: The function analyzes the alarm’s historical data to determine the frequency of its triggering, helping to identify if the underlying issue is recurring, intermittent or a one-time occurrence.
  • Potential Solutions: Recommendations such as increasing the provisioned throughput for DynamoDB, optimizing partition key design, or implementing exponential backoff in the application code.
  • Additional Metrics Analysis: Insights from related metrics, such as failed canary runs or server-side errors.
  • AWS Health Events: Upcoming maintenance events or changes that may impact the system.

Example Summary (abridged)

Root Cause Analysis
The issue appears to be related to the DynamoDB table Movies exceeding its provisioned throughput capacity. The films-prod-APILambdaFunction-LulvbCzjxHFb Lambda function encountered a ProvisionedThroughputExceededException while querying the Movies DynamoDB table.

Alarm Frequency and Immediacy
The alarm has been triggered multiple times in the past few hours, indicating a recurring issue. The frequent transitions between OK and ALARM states suggest that the problem is related to traffic spikes.

Additional Metrics Analysis

  • The Failed metric shows a value of 1, indicating a recent canary run failure.
  • The 5xx metric has a value of 1, suggesting a server-side error (502 Bad Gateway).
  • The SuccessPercent metric shows 0% for the failed canary run.

Potential Solutions
To resolve the issue, consider the following steps:

  • Increase the provisioned throughput capacity of the Movies DynamoDB table.
  • Implement partition key design best practices.
  • Implement exponential backoff with jitter in your application code.

Relevant Documentation

Conclusion

In this post, we introduced the Alarm Context Tool (ACT), a solution that enhances Amazon CloudWatch Alarms by providing additional context and insights to aid in troubleshooting and analysis. ACT leverages multiple AWS services and Amazon Bedrock’s generative AI capabilities. By doing so, it simplifies the incident resolution process, reduces operational costs, and enhances the observability of your AWS environment.

To learn more about ACT and begin using it, visit the GitHub repository and follow the setup instructions.

If you have any questions, suggestions for improvements, or encounter any issues while using ACT, please feel free to open an issue on the GitHub repository. We value your feedback and contribution to make ACT even better.

About the author:

Alex Livingstone

Alex is a Principal Solution Architect focused on AWS Observability tools including Amazon CloudWatch, AWS X-Ray, Amazon Managed Service for Prometheus, Amazon Managed Grafana, and AWS Distro for OpenTelemetry. He loves helping customers to operate in the cloud and gain insights into their applications. Find him on LinkedIn: /aelivingstone.