Optimizing alarm lifecycle with Amazon CloudWatch Metrics Insights alarms

Do you have entire fleets of dynamically changing resources that you are struggling to easily monitor and set alarm on? Do you have a ton of dangling alarms that you are paying for and that is cluttering your view? Are you looking for a simplified way to create alarms that automatically adjusts to resources that come and go?

This blog post will walk you through a recommended, cost-efficient approach using Amazon CloudWatch to reduce the risk of maintaining alarms on discontinued AWS resources, as well as the risk of leaving new AWS resources unmonitored.

This approach reduces the risk of alarms on obsolete or discontinued metrics and resources, or low-value alarms that you would otherwise be paying for and also the risk of cluttering your view in the CloudWatch dashboard. Alarms created using Metrics Insights queries have lower operational overhead and cost for an aggregate alarm, due to its simplicity and single definition. It automatically adjusts to AWS resources that come and go, which reduces the risk of dangling alarms.

Our previous blog post provides an automation solution on how you can identify low-value alarms and delete them. In this blog post, we will talk about how you can set up dynamic alarms that consistently monitor fast-moving environments and alert you when anomalies are detected.

Amazon CloudWatch Metrics Insights alarms enable customers to alarm on entire fleets of dynamically changing resources with a single alarm using standard SQL queries. CloudWatch Metrics Insights offers fast, flexible, SQL-based queries. By combining CloudWatch alarms with Metrics Insights queries, you can now set up dynamic alarms that consistently monitor fast-moving environments and alert you when anomalies are detected.

Common customer use cases

We are going to walk you through two common use cases where you need your alarms to quickly adapt to resource changes and where you may find it challenging to maintain your alarms manually. Both use cases will show you how alarming on Metric Insights queries can help in those situations.

In the first use case, the alarm will trigger when requests to any Amazon DynamoDB table in your account exceeds the provisioned read capacity units for that table and result in a throttled event. In the second use case, the alarm will trigger when any Amazon ECS cluster in your account generates a HTTP 5XX response code.

Use case 1: Detecting DynamoDB throttling

Let’s consider a common use case where you want to monitor the read throttled events across all DynamoDB tables in your account. This can happen when your DDB receives a higher volume of read requests than what you have provisioned. This may lead to your application becoming unresponsive, or block new users or transactions.

One of the common ways to implement this monitoring is to aggregate the individual ‘ReadThrottleEvents’ metrics by using a metric math expression, and alarming on the result of that metric math expression.

The caveat with this approach is that, say a new DynamoDB table is added, the math expression is not automatically updated, leading to the risk you leave a blind spot over new DynamoDB tables and miss out errors on those newly added resources. Here, the user is expected to manually update his math expression and add the metric reported by the new DynamoDB table. Similarly, when DynamoDB tables are removed, the math expression needs to manually updated. Besides, if you need to aggregate more metrics than what a single metric math expression allows, you may need to create two metric math based alarms instead of just one.

With Metric Insights alarms, you can set alarms using Metric Insight queries that monitor multiple resources without having to worry if new resources are spun up or if existing resources are deleted. In our above example, when a new DynamoDB table is added, Metric Insights alarm is able to dynamically adapt to the change without requiring any manual interruption from the user.

Use case 2: Respond to 5XX errors in ECS clusters

Let’s consider another use case where you want to get alerted when any ECS cluster in your account generates a HTTP 5XX response code.

In general, to do this, you need to first create a metric math expression which sums up individual ‘HTTPCode_Target_5XX_Count’ metrics reported for each ECS cluster and then finally set alarm on the result of this math expression.

The caveat with this approach is that, say a new ECS cluster is added, the math expression is not automatically updated, leading to the risk you leave a blind spot over new ECS instances and miss out errors on those newly added resources. Here, the user is expected to manually update his math expression and add the metric reported by the new ECS cluster. Similarly, when ECS clusters are removed, the math expression needs to manually updated.

With Metric Insights alarms, you can set alarms using Metric Insight queries that monitor multiple resources without having to worry if new resources are spun up or if existing resources are deleted. In our above example, when a new ECS cluster is added, Metric Insights alarm is able to dynamically adapt to the change and alert when the alarm breaches the threshold without requiring any manual interruption from the user.

Figure 1: Metric Insights – query builder

Solution overview

This solution creates Metrics Insights alarms for the above discussed use cases. It provisions a Metrics Insights alarm ‘DDBReadThrottleAlarm’ to monitor and alarm on ‘ReadThrottleEvents’ metric and similarly it provisions ‘ECSTarget5XXAlarm’ to monitor and alarm on ‘HTTPCode_Target_5XX_Count’ metric. You can configure the threshold values to alarm while launching the below AWS CloudFormation template. The solution also provisions an SNS topic to notify in case of an alarm, and you can configure the email address as part of the launch process. This solution can be extended to other AWS services or metrics relevant to your use case.

Deploying the solution

This solution and associated resources are available for you to deploy into your own AWS account as an AWS CloudFormation template.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
Existing Amazon DynamoDB tables and Amazon ECS clusters

What will the CloudFormation template deploy?

The CloudFormation template will deploy the following resources into the AWS account:

Amazon CloudWatch Metrics Insights alarms
- DDBReadThrottleAlarm – Monitors ReadThrottleEvents metric and alerts when read throttled event is generated in any DynamoDB table in this account
- ECSTarget5XXAlarm – Monitors HTTPCode_Target_5XX_Count
  metric and alerts when any ECS cluster in this account generates a HTTP 5XX response code
- This CloudFormation template can be modified to use any metric of your choice
Amazon SNS Topic
- AlarmNotificationTopic – Send an email notification when alarm is triggered

How to deploy the CloudFormation template

Download the yaml file.
Navigate to the CloudFormation console in your AWS Account.
Choose Create stack.
Choose Template is ready, upload a template file, and navigate to the yaml file that you just downloaded.
Choose Next.
Give the stack a name (max. length 30 characters
For parameter ‘EmailToNotifyForAlarms’ enter the email address to notify for alarms, and for parameters ‘DDBReadThrottleThreshold’ and ‘ECSTarget5XXThreshold’ enter the respective alarm threshold values based on your use case.
Choose Submit.
Wait for the stack creation to complete.

Costs

There is a cost associated with using this solution based on number of DynamoDB tables and ECS clusters in your account and region. Metrics Insights query alarms incur costs for each metric analyzed by the query, please refer to CloudWatch pricing page. Refer to Amazon SNS pricing page for notification pricing details.

Cleanup

If you decide that you no longer want to keep the CloudWatch Metric Insights alarms and associated resources, you can navigate to CloudFormation in the AWS Console, choose the stack (you will have named it when you deployed it), and choose Delete. All of the resources created by that stack will be deleted.

Should you want to add these CloudWatch Metric Insights alarms back in at any point, you can create the stack again from the CloudFormation yaml.

Conclusion

You can use this solution to create high-value alarms that can monitor thousands of resources with a single alarm that dynamically adjusts alarm configuration for fleeting resources as they come and go. Customers no longer have to do operational maintenance of their alarms to clean up alarms on obsolete or discontinued metrics and resources due to the simplicity and single definition of CloudWatch Metric Insights alarms.

AWS Cloud Operations & Migrations Blog