AWS DevOps Blog
Anomaly Detection in AWS Lambda using Amazon DevOps Guru’s ML-powered insights
Critical business applications are monitored in order to prevent anomalies from negatively impacting their operational performance and availability. Amazon DevOps Guru is a Machine Learning (ML) powered solution that aids operations by detecting anomalous behavior and providing insights and recommendations for how to address the root cause before it impacts the customer.
This post demonstrates how Amazon DevOps Guru can detect an anomaly following a critical AWS Lambda function deployment and its remediation recommendations to fix such behavior.
Solution Overview
Amazon DevOps Guru lets you monitor resources at the region or AWS CloudFormation level. This post will demonstrate how to deploy an AWS Serverless Application Model (AWS SAM) stack, and then enable Amazon DevOps Guru to monitor the stack.
You will utilize the following services:
- AWS Lambda
- Amazon EventBridge
- Amazon DevOps Guru
Figure 1: Amazon DevOps Guru monitoring the resources in an AWS SAM stack
The architecture diagram shows an AWS SAM stack containing AWS Lambda and Amazon EventBridge resources, as well as Amazon DevOps Guru monitoring the resources in the AWS SAM stack.
This post simulates a real-world scenario where an anomaly is introduced in the AWS Lambda function in the form of latency. While the AWS Lambda function execution time is within its timeout threshold, it is not at optimal performance. This anomalous execution time can result in larger compute times and costs. Furthermore, this post demonstrates how Amazon DevOps Guru identifies this anomaly and provides recommendations for remediation.
Here is an overview of the steps that we will conduct:
- First, we will deploy the AWS SAM stack containing a healthy AWS Lambda function with an Amazon EventBridge rule to invoke it on a regular basis.
- We will enable Amazon DevOps Guru to monitor the stack, which will show the AWS Lambda function as healthy.
- After waiting for a period of time, we will make changes to the AWS Lambda function in order to introduce an anomaly and redeploy the AWS SAM stack. This anomaly will be identified by Amazon DevOps Guru, which will mark the AWS Lambda function as unhealthy, provide insights into the anomaly, and provide remediation recommendations.
- After making the changes recommended by Amazon DevOps Guru, we will redeploy the stack and observe Amazon DevOps Guru marking the AWS Lambda function healthy again.
This post also explores utilizing Provisioned Concurrency for AWS Lambda functions and the best practice approach of utilizing Warm Start for variables reuse.
Pricing
Before beginning, note the costs associated with each resource. The AWS Lambda function will incur a fee based on the number of requests and duration, while Amazon EventBridge is free. With Amazon DevOps Guru, you only pay for the data analyzed. There is no upfront cost or commitment. Learn more about the pricing per resource here.
Prerequisites
To complete this post, you need the following prerequisites:
- An AWS account. For this post, we utilize the account number 111111111111. We will conduct AWS Serverless Application Model (AWS SAM) stack operations and monitoring in this account.
- Access to your local terminal with the AWS SAM command line interface (CLI) installed.
- Access to your local terminal with the git CLI.
- AWS credentials for enabling the AWS SAM CLI to make calls to AWS Services on your behalf. In this post, AWS SAM needs access to AWS CloudFormation.
- An Integrated Development Environment (IDE) of choice installed on your local machine.
Getting Started
We will set up an application stack in our AWS account that contains an AWS Lambda and an Amazon EventBridge event. The event will regularly trigger the AWS Lambda function, which simulates a high-traffic application. To get started, please follow the instructions below:
- In your local terminal, clone the
amazon-devopsguru-samples
repository.
- In your IDE of choice, open the
amazon-devopsguru-samples
repository. - In your terminal, change directories into the repository’s subfolder
amazon-devopsguru-samples/generate-lambda-devopsguru-insights
.
- Utilize the SAM CLI to conduct a guided deployment of
lambda-template.yaml
.
You should see a success message in your terminal, such as:
Enabling Amazon DevOps Guru
Now that we have deployed our application stack, we can enable Amazon DevOps Guru.
- Log in to your AWS Account.
- Navigate to the Amazon DevOps Guru service page.
- Click “Get started”.
- In the “Amazon DevOps Guru analysis coverage” section, select “Choose later”, then click “Enable”.
Figure 2.1: Amazon DevOps Guru analysis coverage menu
- On the left-hand menu, select “Settings”
- In the “DevOps Guru analysis coverage” section, click on “Manage”.
- Select the “Analyze all AWS resources in the specified CloudFormation stacks in this Region” radio button.
- The stack created in the previous section should appear. Select it, click “Save”, and then “Confirm”.
Figure 2.2: Amazon DevOps Guru analysis coverage resource selection
Before moving on to the next section, we must allow Amazon DevOps Guru to baseline the resources and benchmark the application’s normal behavior. For our serverless stack with two resources, we recommend waiting two hours before carrying out the next steps. When enabled in a production environment, depending upon the number of resources selected for monitoring, it can take up to 24 hours for Amazon DevOps Guru to complete baselining.
Once baselining is complete, the Amazon DevOps Guru dashboard, an overview of the health of your resources, will display the application stack, DevOpsGuru-Sample-AnomalousLambda-Stack
, and mark it as healthy, shown below.
Figure 2.3: Amazon DevOps Guru Healthy Dashboard
Enabling SNS
If you would like to set up notifications upon the detection of an anomaly by Amazon DevOps Guru, then please follow these additional instructions.
Figure 3: Amazon DevOps Guru Specify an SNS topic
Invoking an Anomaly
Once Amazon DevOps Guru has identified the stack as healthy, we will update the AWS Lambda function with suboptimal code. This update will simulate an update to critical business applications which are causing the anomalous performance.
- Open the
amazon-devopsguru-samples
repository in your IDE. - Open the file
generate-lambda-devopsguru-insights/lambda-code.py
- Uncomment lines 7-8 and save the file. These lines of code will produce an anomaly due to the function’s increased runtime.
- Deploy these updates to your stack by running:
Anomaly Overview
Shortly after, Amazon DevOps Guru will generate a reactive insight from the sample stack. This insight contains recommendations, metrics, and events related to anomalous behavior. View the unhealthy stack status in the Dashboard.
Figure 4.1: Amazon DevOps Guru Unhealthy Dashboard
By clicking on the “Ongoing reactive insight” within the tile, you will be brought to the Insight Details page. This page contains an array of useful information to help you understand and address anomalous behavior.
Insight overview
Utilize this section to get a high-level overview of the insight. You can see that the status of the insight is ongoing, 1 AWS CloudFormation stack is affected, the insight started on Sept-08-2021, it does not have an end time, and it was last updated on Sept-08-2021.
Figure 4.2: Amazon DevOps Guru Ongoing Reactive Insight Overview
Aggregated metrics
The Aggregated metrics tab displays metrics related to the insight. The table is grouped by AWS CloudFormation stacks and subsequent resources that created the metrics. In this example, the insight was a product of an anomaly in the “duration p50” metric generated by the “DevOpsGuruSample-AnomalousLambda” AWS Lambda function.
AWS Lambda duration metrics derive from a percentile statistic utilized to exclude outlier values that skew average and maximum statistics. The P50 statistic is typically a great middle estimate. It is defined as 50% of estimates exceed the P50 estimate and 50% of estimates are less than the P50 estimate.
The red lines on the timeline indicate spans of time when the “duration p50” metric emitted unusual values. Click the red line in the timeline in order to view detailed information.
- Choose View in CloudWatch to see how the metric looks in the CloudWatch console. For more information, see Statistics and Dimensions in the Amazon CloudWatch User Guide.
- Hover over the graph in order to view details about the anomalous metric data and when it occurred.
- Choose the box with the downward arrow to download a PNG image of the graph.
Figure 4.3: Amazon DevOps Guru Ongoing Reactive Insight Aggregated Metrics
Graphed anomalies
The Graphed anomalies tab displays detailed graphs for each of the insight’s anomalies. Because our insight was comprised of a single anomaly, there is one tile with details about unusual behavior detected in related metrics.
- Choose View all statistics and dimensions in order to see details about the anomaly. In the window that opens, you can:
- Choose View in CloudWatch in order to see how the metric looks in the CloudWatch console.
- Hover over the graph to view details about the anomalous metric data and when it occurred.
- Choose Statistics or Dimension in order to customize the graph’s display. For more information, see Statistics and Dimensions in the Amazon CloudWatch User Guide.
Figure 4.4: Amazon DevOps Guru Ongoing Reactive Insight Graphed Anomaly
Related events
In Related events, view AWS CloudTrail events related to your insight. These events help understand, diagnose, and address the underlying cause of the anomalous behavior. In this example, the events are:
- CreateFunction – when we created and deployed the AWS SAM template containing our AWS Lambda function.
- CreateChangeSet – when we pushed updates to our stack via the AWS SAM CLI.
- UpdateFunctionCode – when the AWS Lambda function code was updated.
Figure 4.5: Amazon DevOps Guru Ongoing Reactive Insight Related Events
Recommendations
The final section in the Insight Detail page is Recommendations. You can view suggestions that might help you resolve the underlying problem. When Amazon DevOps Guru detects anomalous behavior, it attempts to create recommendations. An insight might contain one, multiple, or zero recommendations.
In this example, the Amazon DevOps Guru recommendation matches the best resolution to our problem-provisioned concurrency.
Figure 4.6: Amazon DevOps Guru Ongoing Reactive Insight Recommendations
Understanding what happened
Amazon DevOps Guru recommends enabling Provisioned Concurrency for the AWS Lambda functions in order to help it scale better when responding to concurrent requests. As mentioned earlier, Provisioned Concurrency keeps functions initialized by creating the requested number of execution environments so that they can respond to invocations. This is a suggested best practice when building high-traffic applications, such as the one that this sample is mimicking.
In the anomalous AWS Lambda function, we have sample code that is causing delays. This is analogous to application initialization logic within the handler function. It is a best practice for this logic to live outside of the handler function. Because we are mimicking a high-traffic application, the expectation is to receive a large number of concurrent requests. Therefore, it may be advisable to turn on Provisioned Concurrency for the AWS Lambda function. For Provisioned Concurrency pricing, refer to the AWS Lambda Pricing page.
Resolving the Anomaly
To resolve the sample application’s anomaly, we will update the AWS Lambda function code and enable provisioned concurrency for the AWS Lambda infrastructure.
- Opening the sample repository in your IDE.
- Open the file
generate-lambda-devopsguru-insights/lambda-code.py
. - Move lines 7-8, the code forcing the AWS Lambda function to respond slowly, above the
lambda_handler
function definition. - Save the file.
- Open the file
generate-lambda-devopsguru-insights/lambda-template.yaml
. - Uncomment lines 15-17, the code enabling provisioned concurrency in the sample AWS Lambda function.
- Save the file.
- Deploy these updates to your stack.
After completing these steps, the duration P50 metric will emit more typical results, thereby causing Amazon DevOps Guru to recognize the anomaly as fixed, and then close the reactive insight as shown below.
Figure 5: Amazon DevOps Guru Closed Reactive Insight
Clean Up
When you are finished walking through this post, you will have multiple test resources in your AWS account that should be cleaned up or un-provisioned in order to avoid incurring any further charges.
- Opening the sample repository in your IDE.
- Run the below AWS SAM CLI command to delete the sample stack.
Conclusion
As seen in the example above, Amazon DevOps Guru can detect anomalous behavior in an AWS Lambda function, tie it to relevant events that introduced that anomaly, and provide recommendations for remediation by using its pre-trained ML models. All of this was possible by simply enabling Amazon DevOps Guru to monitor the resources with minimal configuration changes and no previous ML expertise. Start using Amazon DevOps Guru today.