Elevating Your AWS Observability: Unlocking the Power of Amazon CloudWatch Alarms

Organizations commonly leverage AWS services to enhance the observability and operational excellence of their workloads. However, often it is unclear the actions that teams should take when observability metrics are delivered to them, it can be difficult to understand which metrics need action to remediate and which ones are simply noise. For example, if an alarm takes 10 minutes or longer to trigger/inform, it delays the actions that your team can take to remediate the underlying issue. An ideal solution to this, is creating accurate alarming and faster time to signal to prevent a downstream network outage. You want to force a faster reaction of the alarm to minimize application downtime. Due to some implementation/architecture limitations, metric data may always be ingested in CloudWatch with a two-minute delay, so the alarm never initiates.

Do you use Amazon CloudWatch Alarms to monitor AWS resources and take automated actions when a metric breaches a predefined threshold? Do you alarm on metrics, on logs, combine alarms and take specific actions when alarms trigger? Do you have use cases where you would need to create alarms based on metric math expression, or on metrics insights query or on a connected data source? If so, this blog will help you understand best practices around creating, managing and using alarms at scale.

This blog post will cover general use cases for alarm recommendations, also deep dive into specific use cases like missing data scenarios and alarm configuration to warn faster for faster time to signal of the alarm.

The post will cover the following parts:

Common Alarm Recommendations, that apply to all Amazon CloudWatch Alarm configurations
Tuning Existing Alarms
Recommended alarm configuration when data is missing
Recommended alarm configuration to warn faster
Creating Dynamic alarms using Metric insight alarms
Cleaning up low value alarms

Alarm recommendations

If you quickly want to setup CloudWatch alarms and follow best practices for monitoring use Alarm recommendations on CloudWatch console. CloudWatch provides out-of-the box alarm recommendations. These are CloudWatch alarms that we recommend that you create for metrics that are published by other AWS services. These recommendations can help you identify the metrics that you should set alarms for to follow best practices for monitoring. The recommendations also suggest the alarm thresholds to set. Following these recommendations can help you not miss important monitoring of your AWS infrastructure. You can use metrics section of the CloudWatch console, and select the alarm recommendations filter toggle. You can also use the console to download infrastructure-as-code alarm definitions for recommended alarms, and then use this code to create the alarm in AWS CloudFormation, the AWS CLI, or Terraform. Figure 1 showcases Alarm recommendations toggle available in CloudWatch metrics console.

Alarm recommendations in CloudWatch Metrics console

Figure 1: Alarm recommendations in CloudWatch Metrics console

Tuning your alarm

When you create an alarm, you specify period, evaluation periods (N), data points to alarm (M) settings to enable CloudWatch to evaluate when to change the alarm state. The main benefit M/N settings gives is the liberty for customer to evaluate the alarm state change on ‘M’ datapoint rather than all ‘N’ datapoint. With M/N settings, customer can decide how many datapoints need to be considered for state change. Please note that alarm will always need N datapoint to calculate the alarm state. If there are less than N datapoints in this window, alarm will fill up the extra datapoint using treat missing data settings.

CloudWatch Alarm evaluation

Figure 2: CloudWatch Alarm evaluation

The M/N settings does not let false transition of alarms when the metrics are received on CloudWatch with a delay. The delayed metrics can contribute to false representation of metric value in metric repository. This false transition can be prevented by M/N settings. This is why we recommend setting M<N such as 2/3 instead of 3/3. Most of the time, the recent datapoint will have the issue of delayed metrics. So, we can exclude that most recent datapoint by M/N settings.

For example, consider an alarm with settings as:

Metrics: MyMetric
Threshold: >50
Period: 60(sec)
Statistic: SUM
Evaluation Period: 3
M / N: 2 / 3

For example – Below are the possible windows returned to alarm:

1) [12, 13, 40, 50, 60, 90, 10, 20] ==> Though there are additional datapoints than what configured (3), alarm will try to make state decision on recent N datapoints. In this case N is 3. Alarm will see recent three datapoint which are 90, 10, 20. Here alarm would not trigger into alarm state as it should need 2 datapoints as breaching. But here only 1 datapoint is breaching

2) [12, 13, 40, 50, 60, 90, 100, 20] ==> State will be alarm as we have 2 datapoints breaching among recent three datapoints.

If there more than M datapoints breaching, then alarm will still transit to Alarm state.

Alarm configuration when data is missing

Treat missing data setting will significantly influence how long alarm takes to transition into ALARM state in the event of delay or your service not publishing datapoints to CloudWatch due to the service being down. For each alarm, you can specify CloudWatch to treat missing data (TMD) points as: notBreaching, breaching, ignore and missing. The default behavior is missing. Missing data feature can be useful for various scenarios like if the missing data behavior indicates a hazard, then you should treat missing data as breaching. If you do not care about missing data and its absence is a good thing then you can set treat missing data not breaching or ignore as well. Treat missing data will only come into picture when there are not enough datapoints(N) to decide the state of alarm. If there are only x datapoints (x is less than N), then alarm will consider the N – x datapoint for “treat missing data settings”.

Alarm configuration to warn faster

If you are looking for more accurate alarming and faster time to signal, then the common root cause for this to not occur is missing data. i.e., when the metric’s datapoints are not received by the alarm, because they are always late or because the emitting service/app/resource is down. They could be late due to your metric data always ingested in CloudWatch with a delay. This causes the metric datapoints to be backfilled after the evaluation for that time period is already complete, causing the alarm to never flip even though the backfilled datapoints are breaching.

You can use metric-math to handle missing data in the alarms itself. Metric math (FILL, repeat) can be used to reiterate the last known value, it is handy when you have a delay. Metric math (FILL, breaching value) can be used if you want to force a faster reaction of the alarm when there is a downtime.

Let’s review a few use cases with suggested configurations to address these:

Use case 1:

Your EC2 instance is down which leads to missing data points, and your alarm configuration is as below:

Metric: EC2/CPUUtilization
Threshold: >80
Period: 60(sec)
Statistic: AVG
Evaluation Period: 3
M / N: 2 / 3
Treat Missing Data : Breaching

With the configuration shown, despite TMD set to “Breaching”, it would take the alarm 7 minutes to transition into ALARM state. This may not work for critical workloads as early detection of incidents and recovery is important to business and end customer experience.

Solution: we recommend using metric math (FILL, breaching value) to force a faster transition into ALARM state when there are missing datapoints. For example, the math expression FILL(m1,90), fills the missing values of the CPUUtilization metric with value 90. With this configuration the alarm transitions to ALARM state in 2 minutes compared to the above TMD option which takes 7 minutes.

Metric: EC2/CPUUtilization ## FILL(m1,90)
Threshold: >80
Period: 60(sec)
Statistic: AVG
Evaluation Period: 3
M / N: 2 / 3
Treat Missing Data: Breaching

Use case 2:

Your EC2 instance has breaching datapoints, but takes too long to notify and go into ALARM state.

Metric: EC2/CPUUtilization ## FILL(m1, REPEAT)
Threshold: >80
Period: 60(sec)
Statistic: AVG
Evaluation Period: 3
M / N : 2 / 3
Treat Missing Data: Breaching

The above configuration even with TMD set to “Breaching”, it would take the alarm 7 minutes to transition into ALARM state. This may not work for critical workloads.

Solution: we recommend using metric math (FILL, REPEAT) to force a faster transition into ALARM state when there are breaching datapoints. For example, the math expression FILL(m1, REPEAT), fills the breaching values of the CPUUtilization metric. With this configuration the alarm transitions to ALARM state in 2 minutes compared to the above TMD option which takes 7 minutes.

Use case 3:

Your metric data is always ingested into CloudWatch with a 2 min delay, so the alarm never flips.

Solution: Setting a higher M/N helps in this situation. For example, setting the M/N as 3/7 instead of 3/5 helps account for the delayed datapoints that are backfilled after 2 minutes

All the above use case solutions can be implemented at scale using AWS CloudFormation or AWS Cloud Development Kit (CDK) / Terraform by metric math expression and alarm creation process can be automated.

Dynamic alarms using Metric insight alarms

You can create an Amazon CloudWatch Metrics Insights alarm on entire fleets of dynamically changing resources across accounts with a single alarm using standard SQL queries. By combining CloudWatch alarms with Metrics Insights queries, you can now set up dynamic alarms that consistently monitor fast-moving environments and alert you when anomalies are detected. Couple of common use cases to use Metrics Insights alarms will be to trigger alarms when requests to any Amazon DynamoDB table in your account exceeds the provisioned read capacity units for that table and result in a throttled event, trigger an alarm when any Amazon ECS cluster in your account generates a HTTP 5XX response code. These use cases have been automated to optimize alarm lifecycle.

Clean up low value alarms

Cleaning up low value or mis-configured alarms will help reduce CloudWatch alarms spend. if you have thousands of Amazon CloudWatch alarms across AWS Regions and want to quickly identify which ones are low-value alarms or misconfigured alarms across regions. You can automate the alarm cleanup at scale to save costs by removing low value or mis-configured alarms.

Summary

In this post, you learned essential tips and strategies for reliable monitoring using CloudWatch alarms. We covered general use cases for alarm recommendations and dove deep into specific use cases like missing data scenarios and alarm configuration to warn faster for

To learn more review AWS Observability best practices, AWS One Observability workshop, AWS re:invent Observability videos.

AWS Cloud Operations Blog