AWS Partner Network (APN) Blog
How Blue People Detected Application Anomalies Using Insights from Amazon DevOps Guru
By Harish Vaswani, Principal Cloud Application Architect – AWS
By Aldo Lares, CTO – Blue People
Blue People |
Amazon DevOps Guru is a machine learning (ML)-powered service that detects abnormal application behavior and provides insights about the anomalous behavior. These insights are supported with metrics and events related to the anomaly and recommendations to help address and mitigate the anomalous behavior.
This post describes how Blue People used Amazon DevOps Guru’s insights to identify the root cause for a non-responsive application that was otherwise hard to detect.
Blue People is an AWS Advanced Tier Services Partner that provides software development, nearshoring, recruiting, and consulting services.
Problem Overview
Blue People was developing applications for its customer that was deployed on Amazon Elastic Compute Cloud (Amazon EC2) instances.
The customer started reporting an issue where end users were having difficulty loading some of the pages of an application. However, when the application development team at Blue People tested this locally and in the environment where users had the issue, they were able to get successful responses.
The team also tried to get more information by looking at Amazon CloudWatch logs and attempted to reproduce the issue by creating a replica of the EC2 instance and deploying the application to the replica instance.
After not having much success in detecting the root cause or reproducing the issue, Blue People decided to enable Amazon DevOps Guru in the customer environment.
Detecting the Anomaly Using Amazon DevOps Guru
Amazon DevOps Guru can be enabled by choosing to monitor all Amazon Web Services (AWS) resources in the current account, or specifying a coverage boundary using AWS CloudFormation stacks or tags.
Blue People enabled Amazon DevOps Guru for all supported resources within the customer’s account. After enabling Amazon DevOps Guru, the service generated a reactive insight for the anomalous behavior.
Figure 1 – Reactive insight overview.
The reactive insight showed the overview of the insight generated by the anomalous behavior with relevant timestamps.
It also provided information about the type of the issue in its name: ApplicationELB HTTPCode_ELB_5XX_Count. This gave an indication there were 5XX errors being generated by the Elastic Load Balancing (ELB) service.
Figure 2 – Aggregated metrics and relevant events.
The aggregated metrics section within the Amazon DevOps Guru console showed the ELB was generating HTTP 504 errors for an associated resource, with the timeline view showing the span of time of the error generation. HTTP 504 errors are gateway timeout errors which can occur when the backed application server instances fail to respond within the configured ELB idle timeout limit.
After gaining insight about the error and the specific timespan of when Amazon DevOps Guru reported the anomaly, Blue People discovered the backed EC2 instances were being rebooted manually through the console at recurring time intervals, thus causing the issues of users accessing the applications.
The timeline view in the aggregated metrics section gave Blue People a good visual of the pattern it could investigate further to determine the root cause of the problem.
Resolving the Anomaly
Amazon DevOps Guru attempts to create recommendations when it detects anomalous behavior. In this case, Amazon DevOps Guru recommended troubleshooting errors and health check failures in the Elastic Load Balancer.
Figure 3 – Remediation recommendations.
After learning about the root cause of the issue, Blue People made changes to the ELB configuration to ensure the customer’s applications were highly available. The team also enabled Amazon DevOps Guru on all resources within the customer account and set up notifications to be notified when any anomaly occurs for the resources.
Once the issues creating the anomalies were resolved, Amazon DevOps Guru reported a healthy tile for the ELB service.
Figure 4 – Service health.
Conclusion
This post described how Blue People used Amazon DevOps Guru’s reactive insights to learn about an application’s anomalous behavior in a customer account and identify its root cause.
Blue People benefitted from Amazon DevOps Guru’s full-picture view about the anomaly and information about the relevant events, as well as the metrics associated with the issue. The service also provided recommendations on what actions the user should take to mitigate the behavior.
Since enabling Amazon DevOps Guru was quick and did not need any configuration, Blue People was able to identify the root cause and resolve the issue quickly, thereby reducing its mean time to resolve (MTTR) metric.
Blue People – AWS Partner Spotlight
Blue People is an AWS Advanced Tier Services Partner and IT company providing software development, nearshoring, recruiting, and consulting services.