Why is my CloudWatch alarm in INSUFFICIENT_DATA state?

Last updated: 2020-03-26

My Amazon CloudWatch alarm is in INSUFFICIENT_DATA state. How can I find out what's causing this?

Resolution

The INSUFFICIENT_DATA state can indicate any of the following:

  • An Amazon CloudWatch alarm just started
  • The metric is unavailable
  • There's not enough data for the metric to determine the alarm state

When your alarm is unexpectedly in the INSUFFICIENT_DATA state, review the troubleshooting steps below for some of the most common causes.

Normal metric behavior

An alarm in INSUFFICIENT_DATA state might simply reflect the normal behavior of a metric. There are two types of metrics based on how they are pushed to CloudWatch: period-driven and event-driven. Some services send periodic data points to their metrics, but specific metrics might have periods without data points. For example, the CPUUtilization metric of an EC2 instance has a data point every period. However, if you stop the instance, the service doesn't push any data points to it. Another example is the HTTPCode_ELB_5XX_Count metric for an Application Load Balancer. The service sends data points when there's an error (or an event). If there are no errors during a period, then the result is an empty dataset (rather than a zero value).

If an alarm is monitoring a metric that has no data points during a given time by design, the state of the alarm is INSUFFICIENT_DATA during those periods. To force the alarm to be in ALARM or OK states, configure how the alarm treats these periods without data points.

Incorrect CloudWatch alarm settings

Each metric is defined by a namespace, a name, and up to ten dimensions. When retrieving a data point, you must specify a timestamp (and optionally, a unit). If you provide an incorrect value for one of these parameters, CloudWatch attempts to retrieve a metric that doesn't exist. The result is an empty dataset.

Note: Data points are usually pushed to a metric with a single unit, but you aren't required to specify the unit when creating an alarm. If you don't specify a unit, you don't encounter issues related to incorrect unit configurations. However, if the data points in your metric have multiple units, it's a best practice to use the correct unit.

Use the DescribeAlarms API to get a complete list of parameters for your monitored metrics. You can compare this with the ListMetrics output. Check the parameters for:

  • Misspellings and improper use of uppercase and lowercase letters (metrics are case sensitive)
  • Incorrectly specified dimensions or units

Incorrectly configured alarm periods

You can configure an alarm to retrieve data points with your desired frequency. However, you might get undesired states if the alarm uses a lower period than the period used by the service (or source) to send the data points to the metric. To avoid unwanted INSUFFICIENT_DATA states, it's a best practice to configure the alarm's period to be equal to the period in which the data points of the metric are pushed. You can also use the M out of N settings for the alarm.

Delayed delivery of data points

Depending on the data points that are sent to CloudWatch, you might experience unexpected INSUFFICIENT_DATA states on an alarm monitoring a metric.

For example, you have a custom application that sends data points from software deployed in an EC2 instance to a custom metric. To avoid losing data, you configure the application to retry failed API calls. Due to an external factor (for example, a modification of the VPC settings), the instance loses connectivity with CloudWatch. In this scenario, your environment still generates data, but the data points being sent are failing.

If you have set up a standard alarm, the alarm evaluates the metric every minute. During evaluation, the alarm retrieves the latest data points from the configured metric. During this period without connectivity, the alarm is still evaluating the metric. Because the data points aren't sent, the alarm can't retrieve any data points for those evaluation periods, triggering an INSUFFICIENT_DATA state.

After recovering connectivity, the application sends the backlog of data points, each one with its own timestamp. Because the data points are sent after this delay, the alarm can now retrieve the recent data points based on the period and evaluation period that you specified on it, behaving as expected again. At this point, you no longer see blank spaces in the metric, because the data points are now stored in CloudWatch. However, because the alarm has already evaluated that time frame, the alarm history still shows a message similar to:

   [...] 
   "stateValue": "INSUFFICIENT_DATA",
   "stateReason": "Insufficient Data: 2 datapoints were unknown.",
   [...]

If you don't want the alarm to be in INSUFFICIENT_DATA state, you can change how the alarm treats missing data.


Did this article help you?

Anything we could improve?


Need more help?