AWS Machine Learning Blog

Delivering operational insights directly to your on-call team by integrating Amazon DevOps Guru with Atlassian Opsgenie

As organizations continue to adopt microservices, the number of disparate services that contribute to delivering applications increases, driving the scope of signals that on-call teams monitor to grow exponentially. It’s becoming more important than ever for these teams to have tools that can quickly and autonomously detect anomalous behaviors across the services they support. Amazon DevOps Guru uses machine learning (ML) to quickly identify when your applications are behaving outside of their normal operating patterns, and may even predict these anomalous behaviors before they become a problem. You can deliver these insights in near-real-time directly to your on-call teams by integrating DevOps Guru with Atlassian Opsgenie, allowing them to immediately react to critical anomalies.

Opsgenie is an alert management solution that ensures critical alerts are delivered to the right person in your on-call team, and includes a preconfigured integration for DevOps Guru. This makes it easy to configure the delivery of notifications from DevOps Guru to Opsgenie via Amazon Simple Notification Service (Amazon SNS) in three simple steps. This post will walk you through configuring the delivery of these notifications.

Configuring DevOps Guru Integration

To start integrating DevOps Guru with Opsgenie, complete the following steps:

  1. On the Opsgenie console, choose Settings.

  1. In the navigation pane, choose Integration list.
  2. Filter the list of built-in integrations by DevOps Guru.

  1. Hover over Amazon DevOps Guru and choose Add.

This integration has been pre-configured with a set of defaults that work for many teams. However, you can also customize the integration settings to meet your needs on the Advanced configuration page.

  1. When you’re ready, assign the integration to a team.
  2. Save a copy of the subscription URL (you will need this later).
  3. Choose Save Integration.

Creating an SNS topic and subscribing Opsgenie

To configure Amazon SNS notifications, complete the following steps:

  1. On the Amazon SNS console, choose Topics.
  2. Choose Create topic.
  3. For Type, select Standard.
  4. For name, enter a name, such as operational-insights.
  5. Leave the default settings as they are or configure them to suit your needs.
  6. Choose Create Topic.
  7. After the topic has been created, scroll down to the Subscriptions section and choose Create subscription.
  8. For Protocol, choose HTTPS.
  9. For Endpoint, enter the subscription URL you saved earlier.
  10. Leave the remaining options as the defaults, or configure them to meet your needs.
  11. Choose Create subscription.

Upon creating the subscription, Amazon SNS sends a confirmation message to your Opsgenie integration, which Opsgenie automatically acknowledges on your behalf.

Opsgenie is now ready to receive notifications from DevOps Guru, and there’s just one thing left to do: configure DevOps Guru to monitor your resources and send notifications to our newly created SNS topic.

Setting up Amazon DevOps Guru

The first time you browse to the DevOps Guru console, you will need to enable DevOps Guru to operate on your account.

  1. On the DevOps Guru console, choose Get Started.

If you have already enabled DevOps Guru, you can add your SNS topic by choosing Settings on the DevOps Guru Console, and then skip to step 3.

  1. Select the Resources you want to monitor (for this post, we chose Analyze all AWS resources in the current AWS account).
  2. For Choose an SNS notification topic, select Select an existing SNS topic.
  3. For Choose a topic in your AWS account, choose the topic you created earlier (operational-insights).
  4. Choose Add SNS topic.
  5. Choose Enable (or Save if you have already enabled the service).

DevOps Guru starts monitoring your resources and learning what’s normal behavior for your applications.

Sample application

For DevOps Guru to have an application to monitor, I use AWS CodeStar to build and deploy a simple web service application using Amazon API Gateway and AWS Lambda. The service simply returns a random number.

After deploying my app, I configure a simple load test to run endlessly, and leave it running for a few hours to allow DevOps Guru to baseline the behavior of my app.

Generating Insights

Now that my app has been running for a while, it’s time to change the behavior of my application to generate an insight. To do this, I deployed a small code change that introduces some random latency and HTTP 5xx errors.

Soon after, Opsgenie sends an alert to my phone, triggered by an insight from DevOps Guru. The following screenshot shows the alert I received in Opsgenie.

From this alert, I can see that there is an anomaly in the latency of my random number service. Choosing the InsightUrl provided in the alert directs me to the DevOps Guru console, where I can start digging into the events and metrics that lead to this insight being generated.

The Relevant events page shows an indicator of the events that occurred in the lead-up to the change in behavior. In my case, the key event was a deployment triggered by the update to the code in the Lambda function.

The DevOps Guru Insights page also provides the pertinent metrics that can be used to further highlight the behavior change—in my case, the duration of my Lambda function and the number of API Gateway 5xx errors had increased.

Resolving the error

Now that I’ve investigated the cause of the anomalous behavior, I resolve it by rolling back the code and redeploying. Shortly after, my application returns to normal behavior. DevOps Guru automatically resolves the insight and sends a notification to Opsgenie, closing the related alert.

To confirm that the application is behaving normally again, I return to the Insights page and check the pertinent metrics, where I can see that they have indeed returned to normal again.

If you plan on testing DevOps Guru in this way, keep in mind that the service learns the behavior of your app over time, and a continual break and fix cycle in your app may eventually be considered normal behavior, no longer generating new insights.

Conclusion

Amazon DevOps Guru continuously analyzes streams of disparate data and monitors thousands of metrics to establish normal application behavior. It’s available now in preview, and the Atlassian Opsgenie integration for Amazon DevOps Guru is also available to use now. Opsgenie centralizes alerts from monitoring, logging, and ITSM tools so Dev and IT Ops teams can stay aware and in control. Opsgenie’s flexible rules engine ensures critical alerts are never missed, and the right person is notified at the right time via email, phone, SMS, or mobile push notifications.

Sign up for the Amazon DevOps Guru preview today and start delivering insights about your applications directly to your on-call teams for immediate investigation and remediation using Opsgenie.

 


About the Author

Adam Strickland is a Principal Solutions Architect based in Sydney, Australia. He has a strong background in software development across research and commercial organizations, and has built and operated global SaaS applications. Adam is passionate about helping Australian organizations build better software, and scaling their SaaS applications to a global audience.