Why did my CloudWatch alarm trigger when its metric doesn't have any breaching data points?

Last updated: 2021-03-23

My Amazon CloudWatch alarm changed to the ALARM state. When I check the alarm metric, I don't see any breaching data points. However, the event history for the alarm shows the breaching data point. Why did my CloudWatch alarm trigger when its metric doesn't have any breaching data points?

Short description

CloudWatch alarms evaluate metrics based on data points available at a specific moment. Each subsequent alarm evaluation might use different aggregated data points because new values continue to flow into the CloudWatch metric. You might be unable to see a breaching data point that triggered your alarm if that data hasn't flowed into the metric yet. When you review the event history later, you can see the complete set of data points, which have now flowed into the metric.

Resolution

Find a breaching data point

To find a breaching data point in your CloudWatch alarm metric's graph, change the Statistic to Maximum/Minimum.

Example alarm configuration:

  • Standard resolution alarm (evaluates the metric every minute)
  • Metric is CPUUtilization
  • Threshold is 65%
  • Statistic is Average
  • Period is 60 seconds
  • Evaluation Period is 1
  • Detailed Monitoring is enabled for the monitored Amazon Elastic Compute Cloud (Amazon EC2) instance

When the example alarm evaluation period 12:00:00 - 12:01:00 UTC starts, the following values were received by the metric:

Sample-1: 12:00:07 UTC, numeric value: 89.76470588235294
Sample-2: 12:00:11 UTC, numeric value: 27.926666666666664
Sample-3: 12:00:19 UTC, numeric value: 54.57142857142857
Sample-4: 12:00:35 UTC, numeric value: 95.473333333333336

The average of those values is 66.934, which breaches the threshold of 65%. This breach triggers a change to the ALARM state. The alarm's event history lists the aggregated values exceeding the threshold as the reason for the state change.

When the alarm is evaluated again later, additional values have flowed in for the minute 12:00:00 - 12:01:00 UTC. For example:

Sample-1: 12:00:07 UTC, numeric value: 89.76470588235294
Sample-2: 12:00:11 UTC, numeric value: 27.926666666666664
Sample-3: 12:00:19 UTC, numeric value: 54.57142857142857
Sample-4: 12:00:35 UTC, numeric value: 95.473333333333336
Sample-5: 12:00:37 UTC, numeric value: 15.18181818181819
Sample-6: 12:00:41 UTC, numeric value: 10.26490

The average including the new values is 48.864, which doesn't breach the threshold of 65%. The alarm now changes to the OK state. The alarm's event history lists the aggregated values being below the threshold as the reason for the state change.

You might not see the breaching data point in your CloudWatch metric's graph now, even though the alarm triggered. If you view the CPUUtilization metric's graph, the Average is listed as 48.864 (not 66.934). All relevant samples for evaluation have now flowed into the metric.

If you change the CloudWatch metric graph's Statistic to Maximum, you can see the breaching data point 95.473 at 12:00:00 UTC.

Note: If your alarm is configured to trigger when data falls below the threshold, change the CloudWatch metric graph's Statistic to Minimum.

Configure an "M out of N" alarm

To prevent an alarm from changing to the ALARM state, configure an "M out of N" alarm where Evaluation Period and Datapoints to Alarm have different values. This configuration makes alarms evaluate more aggregated data points and changes the alarm state only if at least a certain number of data points (M) is breaching in a given set of data points (N). For more information, see Create a CloudWatch alarm based on a static threshold and Configuring how CloudWatch alarms treat missing data.

Example alarm configuration:

  • Standard resolution alarm (evaluates the metric every minute)
  • Metric is CPUUtilization
  • Threshold is 65%
  • Statistic is Average
  • Period is 120 seconds
  • Evaluation Period is 2 out of 3
  • Detailed Monitoring is enabled for the monitored Amazon EC2 instance

Note that the example alarm configuration is similar to the previous example. However, the evaluation period checks 2 out of 3 available data points before triggering the alarm. The period is also reduced because of the increased evaluation period.

When the alarm period starts at 12:00:00 UTC, the following values were received by the metric:

Sample-1: 12:00:07 UTC, numeric value: 89.76470588235294
Sample-2: 12:00:11 UTC, numeric value: 27.926666666666664
Sample-3: 12:00:19 UTC, numeric value: 54.57142857142857
Sample-4: 12:00:35 UTC, numeric value: 95.473333333333336

CloudWatch looks for data points that are older than 12:00:00 UTC because of the increased evaluation period:

11:58:00 UTC, Average=41.874304539920
11:59:00 UTC, Average=5.230773650991253
12:00:00 UTC, Average=66.93403361344538

The aggregated data point at 12:00:00 UTC breaches the threshold. However, the alarm remains in the OK state and doesn't change to the ALARM state. This behavior happens because only one out of three data points breach the threshold, whereas twp out of three are required to trigger the alarm.


Did this article help?


Do you need billing or technical support?