AWS DevOps & Developer Productivity Blog

Anomaly Detection in AWS Lambda using Amazon DevOps Guru’s ML-powered insights

Critical business applications are monitored in order to prevent anomalies from negatively impacting their operational performance and availability. Amazon DevOps Guru is a Machine Learning (ML) powered solution that aids operations by detecting anomalous behavior and providing insights and recommendations for how to address the root cause before it impacts the customer.

This post demonstrates how Amazon DevOps Guru can detect an anomaly following a critical AWS Lambda function deployment and its remediation recommendations to fix such behavior.

Solution Overview

Amazon DevOps Guru lets you monitor resources at the region or AWS CloudFormation level. This post will demonstrate how to deploy an AWS Serverless Application Model (AWS SAM) stack, and then enable Amazon DevOps Guru to monitor the stack.

You will utilize the following services:

  • AWS Lambda
  • Amazon EventBridge
  • Amazon DevOps Guru

The architecture diagram shows an AWS SAM stack containing AWS Lambda and Amazon EventBridge resources, as well as Amazon DevOps Guru monitoring the resources in the AWS SAM stack.

Figure 1: Amazon DevOps Guru monitoring the resources in an AWS SAM stack

The architecture diagram shows an AWS SAM stack containing AWS Lambda and Amazon EventBridge resources, as well as Amazon DevOps Guru monitoring the resources in the AWS SAM stack.

This post simulates a real-world scenario where an anomaly is introduced in the AWS Lambda function in the form of latency. While the AWS Lambda function execution time is within its timeout threshold, it is not at optimal performance. This anomalous execution time can result in larger compute times and costs. Furthermore, this post demonstrates how Amazon DevOps Guru identifies this anomaly and provides recommendations for remediation.

Here is an overview of the steps that we will conduct:

  1. First, we will deploy the AWS SAM stack containing a healthy AWS Lambda function with an Amazon EventBridge rule to invoke it on a regular basis.
  2. We will enable Amazon DevOps Guru to monitor the stack, which will show the AWS Lambda function as healthy.
  3. After waiting for a period of time, we will make changes to the AWS Lambda function in order to introduce an anomaly and redeploy the AWS SAM stack. This anomaly will be identified by Amazon DevOps Guru, which will mark the AWS Lambda function as unhealthy, provide insights into the anomaly, and provide remediation recommendations.
  4. After making the changes recommended by Amazon DevOps Guru, we will redeploy the stack and observe Amazon DevOps Guru marking the AWS Lambda function healthy again.

This post also explores utilizing Provisioned Concurrency for AWS Lambda functions and the best practice approach of utilizing Warm Start for variables reuse.

Pricing

Before beginning, note the costs associated with each resource. The AWS Lambda function will incur a fee based on the number of requests and duration, while Amazon EventBridge is free. With Amazon DevOps Guru, you only pay for the data analyzed. There is no upfront cost or commitment. Learn more about the pricing per resource here.

Prerequisites

To complete this post, you need the following prerequisites:

Getting Started

We will set up an application stack in our AWS account that contains an AWS Lambda and an Amazon EventBridge event. The event will regularly trigger the AWS Lambda function, which simulates a high-traffic application. To get started, please follow the instructions below:

  1. In your local terminal, clone the amazon-devopsguru-samples repository.
git clone https://github.com/aws-samples/amazon-devopsguru-samples.git
Bash
  1.  In your IDE of choice, open the amazon-devopsguru-samples repository.
  2. In your terminal, change directories into the repository’s subfolder amazon-devopsguru-samples/generate-lambda-devopsguru-insights.
cd amazon-devopsguru-samples/generate-lambda-devopsguru-insights
Bash
  1. Utilize the SAM CLI to conduct a guided deployment of lambda-template.yaml.
sam deploy --guided --template lambda-template.yaml
    Stack Name [sam-app]: DevOpsGuru-Sample-AnomalousLambda-Stack
    AWS Region [us-east-1]: us-east-1
    #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
    Confirm changes before deploy [y/N]: y
    #SAM needs permission to be able to create roles to connect to the resources in your template
    Allow SAM CLI IAM role creation [Y/n]: y
    Save arguments to configuration file [Y/n]: y
    SAM configuration file [samconfig.toml]: y
    SAM configuration environment [default]: default
Plain text

You should see a success message in your terminal, such as:

Successfully created/updated stack - DevOpsGuru-Sample-AnomalousLambda-Stack in us-east-1.
Bash

Enabling Amazon DevOps Guru

Now that we have deployed our application stack, we can enable Amazon DevOps Guru.

  1. Log in to your AWS Account.
  2. Navigate to the Amazon DevOps Guru service page.
  3. Click “Get started”.
  4. In the “Amazon DevOps Guru analysis coverage” section, select “Choose later”, then click “Enable”.

Amazon DevOps Guru analysis coverage menu which asks which AWS resources to analyze. The “Choose later” option is selected.

Figure 2.1: Amazon DevOps Guru analysis coverage menu

  1. On the left-hand menu, select “Settings”
  2. In the “DevOps Guru analysis coverage” section, click on “Manage”.
  3. Select the “Analyze all AWS resources in the specified CloudFormation stacks in this Region” radio button.
  4. The stack created in the previous section should appear. Select it, click “Save”, and then “Confirm”.

Amazon DevOps Guru analysis coverage menu which asks which AWS resources to analyze. The “Analyze all AWS resources in the specified CloudFormation stacks in this Region” option is selected and CloudFormation stacks are displayed to choose from.

Figure 2.2: Amazon DevOps Guru analysis coverage resource selection

Before moving on to the next section, we must allow Amazon DevOps Guru to baseline the resources and benchmark the application’s normal behavior. For our serverless stack with two resources, we recommend waiting two hours before carrying out the next steps. When enabled in a production environment, depending upon the number of resources selected for monitoring, it can take up to 24 hours for Amazon DevOps Guru to complete baselining.

Once baselining is complete, the Amazon DevOps Guru dashboard, an overview of the health of your resources, will display the application stack, DevOpsGuru-Sample-AnomalousLambda-Stack, and mark it as healthy, shown below.

Amazon DevOps Guru Dashboard displays the system health summary and system health overview of each CloudFormation stack. The DevOpsGuru-Sample-AnomalousLambda-Stack is marked as healthy with 0 reactive insights and 0 proactive insights.

Figure 2.3: Amazon DevOps Guru Healthy Dashboard

Enabling SNS

If you would like to set up notifications upon the detection of an anomaly by Amazon DevOps Guru, then please follow these additional instructions.

Amazon DevOps Guru Specify an SNS topic menu which enables notifications for important DevOps Guru events. No SNS topics are currently configured.

Figure 3: Amazon DevOps Guru Specify an SNS topic

Invoking an Anomaly

Once Amazon DevOps Guru has identified the stack as healthy, we will update the AWS Lambda function with suboptimal code. This update will simulate an update to critical business applications which are causing the anomalous performance.

  1. Open the amazon-devopsguru-samples repository in your IDE.
  2. Open the file generate-lambda-devopsguru-insights/lambda-code.py
  3. Uncomment lines 7-8 and save the file. These lines of code will produce an anomaly due to the function’s increased runtime.
  4. Deploy these updates to your stack by running:
cd generate-lambda-devopsguru-insights 
sam deploy --template lambda-template.yaml -stack-name DevOpsGuru-Sample-AnomalousLambda-Stack
Bash

Anomaly Overview

Shortly after, Amazon DevOps Guru will generate a reactive insight from the sample stack. This insight contains recommendations, metrics, and events related to anomalous behavior. View the unhealthy stack status in the Dashboard.

Amazon DevOps Guru Dashboard displays the system health summary and system health overview of each CloudFormation stack. The DevOpsGuru-Sample-AnomalousLambda-Stack is marked as unhealthy with 1 reactive insights and 0 proactive insights.

Figure 4.1: Amazon DevOps Guru Unhealthy Dashboard

By clicking on the “Ongoing reactive insight” within the tile, you will be brought to the Insight Details page. This page contains an array of useful information to help you understand and address anomalous behavior.

Insight overview

Utilize this section to get a high-level overview of the insight. You can see that the status of the insight is ongoing, 1 AWS CloudFormation stack is affected, the insight started on Sept-08-2021, it does not have an end time, and it was last updated on Sept-08-2021.

Amazon DevOps Guru Insight Details page has multiple information sections. The Insight overview is the first section which displays the status is ongoing, there is 1 affected stack, the start time and last updated time. The end time is empty as the insight is ongoing.

Figure 4.2: Amazon DevOps Guru Ongoing Reactive Insight Overview

Aggregated metrics

The Aggregated metrics tab displays metrics related to the insight. The table is grouped by AWS CloudFormation stacks and subsequent resources that created the metrics. In this example, the insight was a product of an anomaly in the “duration p50” metric generated by the “DevOpsGuruSample-AnomalousLambda” AWS Lambda function.

AWS Lambda duration metrics derive from a percentile statistic utilized to exclude outlier values that skew average and maximum statistics. The P50 statistic is typically a great middle estimate. It is defined as 50% of estimates exceed the P50 estimate and 50% of estimates are less than the P50 estimate.

The red lines on the timeline indicate spans of time when the “duration p50” metric emitted unusual values. Click the red line in the timeline in order to view detailed information.

  • Choose View in CloudWatch to see how the metric looks in the CloudWatch console. For more information, see Statistics and Dimensions in the Amazon CloudWatch User Guide.
  • Hover over the graph in order to view details about the anomalous metric data and when it occurred.
  • Choose the box with the downward arrow to download a PNG image of the graph.

Amazon DevOps Guru Insight Details page contains aggregated metrics. The Duration p50 metric is selected and displayed in graph form.

Figure 4.3: Amazon DevOps Guru Ongoing Reactive Insight Aggregated Metrics

Graphed anomalies

The Graphed anomalies tab displays detailed graphs for each of the insight’s anomalies. Because our insight was comprised of a single anomaly, there is one tile with details about unusual behavior detected in related metrics.

  • Choose View all statistics and dimensions in order to see details about the anomaly. In the window that opens, you can:
  • Choose View in CloudWatch in order to see how the metric looks in the CloudWatch console.
  • Hover over the graph to view details about the anomalous metric data and when it occurred.
  • Choose Statistics or Dimension in order to customize the graph’s display. For more information, see Statistics and Dimensions in the Amazon CloudWatch User Guide.

Amazon DevOps Guru Insight Details page contains Graphed anomalies. The p50 metric of the AWS/Lambda duration in displayed in graph form.

Figure 4.4: Amazon DevOps Guru Ongoing Reactive Insight Graphed Anomaly

Related events

In Related events, view AWS CloudTrail events related to your insight. These events help understand, diagnose, and address the underlying cause of the anomalous behavior. In this example, the events are:

  1. CreateFunction – when we created and deployed the AWS SAM template containing our AWS Lambda function.
  2. CreateChangeSet – when we pushed updates to our stack via the AWS SAM CLI.
  3. UpdateFunctionCode – when the AWS Lambda function code was updated.

Continuation of figure 4.4

Figure 4.5: Amazon DevOps Guru Ongoing Reactive Insight Related Events

Recommendations

The final section in the Insight Detail page is Recommendations. You can view suggestions that might help you resolve the underlying problem. When Amazon DevOps Guru detects anomalous behavior, it attempts to create recommendations. An insight might contain one, multiple, or zero recommendations.

In this example, the Amazon DevOps Guru recommendation matches the best resolution to our problem-provisioned concurrency.

Amazon DevOps Guru Insight Details page contains Recommendations. The suggested recommendation is to configure provisioned concurrency for the AWS Lambda.

Figure 4.6: Amazon DevOps Guru Ongoing Reactive Insight Recommendations

Understanding what happened

Amazon DevOps Guru recommends enabling Provisioned Concurrency for the AWS Lambda functions in order to help it scale better when responding to concurrent requests. As mentioned earlier, Provisioned Concurrency keeps functions initialized by creating the requested number of execution environments so that they can respond to invocations. This is a suggested best practice when building high-traffic applications, such as the one that this sample is mimicking.

In the anomalous AWS Lambda function, we have sample code that is causing delays. This is analogous to application initialization logic within the handler function. It is a best practice for this logic to live outside of the handler function. Because we are mimicking a high-traffic application, the expectation is to receive a large number of concurrent requests. Therefore, it may be advisable to turn on Provisioned Concurrency for the AWS Lambda function. For Provisioned Concurrency pricing, refer to the AWS Lambda Pricing page.

Resolving the Anomaly

To resolve the sample application’s anomaly, we will update the AWS Lambda function code and enable provisioned concurrency for the AWS Lambda infrastructure.

  1. Opening the sample repository in your IDE.
  2. Open the file generate-lambda-devopsguru-insights/lambda-code.py.
  3. Move lines 7-8, the code forcing the AWS Lambda function to respond slowly, above the lambda_handler function definition.
  4. Save the file.
  5. Open the file generate-lambda-devopsguru-insights/lambda-template.yaml.
  6. Uncomment lines 15-17, the code enabling provisioned concurrency in the sample AWS Lambda function.
  7. Save the file.
  8. Deploy these updates to your stack.
cd generate-lambda-devopsguru-insights 
sam deploy --template lambda-template.yaml --stack-name DevOpsGuru-Sample-AnomalousLambda-Stack       
Bash

After completing these steps, the duration P50 metric will emit more typical results, thereby causing Amazon DevOps Guru to recognize the anomaly as fixed, and then close the reactive insight as shown below.

Amazon DevOps Guru Insight Summary page displays the reactive insight has been closed.

Figure 5: Amazon DevOps Guru Closed Reactive Insight

Clean Up

When you are finished walking through this post, you will have multiple test resources in your AWS account that should be cleaned up or un-provisioned in order to avoid incurring any further charges.

  1. Opening the sample repository in your IDE.
  2. Run the below AWS SAM CLI command to delete the sample stack.
cd generate-lambda-devopsguru-insights 
sam delete --stack-name DevOpsGuru-Sample-AnomalousLambda-Stack 
Bash

Conclusion

As seen in the example above, Amazon DevOps Guru can detect anomalous behavior in an AWS Lambda function, tie it to relevant events that introduced that anomaly, and provide recommendations for remediation by using its pre-trained ML models. All of this was possible by simply enabling Amazon DevOps Guru to monitor the resources with minimal configuration changes and no previous ML expertise. Start using Amazon DevOps Guru today.

About the authors

Harish Vaswani

Harish Vaswani is a Senior Cloud Application Architect at Amazon Web Services. He specializes in architecting and building cloud native applications and enables customers with best practices in their cloud journey. He is a DevOps and Machine Learning enthusiast. Harish lives in New Jersey and enjoys spending time with this family, filmmaking and music production.

Caroline Gluck

Caroline Gluck is a Cloud Application Architect at Amazon Web Services based in New York City, where she helps customer design and build cloud native Data Science applications. Caroline is a builder at heart, with a passion for serverless architecture and Machine Learning. In her spare time, she enjoys traveling, cooking, and spending time with family and friends.