Extending and exploring alarm history in Amazon CloudWatch – part 1

Alarm history data can be invaluable in diagnosing trends, impacts and root causes for issues in your application. In this two-part blog series, we will demonstrate how to move beyond the standard 14 day alarm history, and turn your Amazon CloudWatch alarm state changes into logs and metrics that you can graph on your CloudWatch dashboards, and see alongside your other observability data.

In this first post, we focus on creating the log and metric data, and in part 2 we will look at some useful ways to include this data on CloudWatch dashboards.

Alarm History through the AWS Management Console

CloudWatch alarms come with a 14-day history. This is available in the CloudWatch console under Alarms by selecting any of your alarms. Figure 1 shows an example history with the initial alarm creation and alarm state changes (e.g., from “In alarm” to “OK”). This history also shows any configuration updates (e.g., changing the evaluation period) and actions (e.g., an SNS notification)

CloudWatch alarm history information showing the initial alarm creation and State updates as the alarm changes between In alarm and OK.

Figure 1: CloudWatch alarm history widget showing recent events

You may want to see the alarm history trends, or see this data alongside other CloudWatch data, in order to be able to diagnose trends, impacts, and root causes. Typically only those who must take immediate action receive alarm data, but it is also useful for those troubleshooting, planning future developments, or understanding user experience. Including alarm data on your dashboards opens up the data to a wider audience and broader utilization.

What are we going to do?

We will create the following:

A CloudWatch alarm.
An Amazon EventBridge rule to capture the alarm state changes and create a CloudWatch log event.
A metric filter on the CloudWatch log group in order to create numerical metric data.
A CloudWatch dashboard with your desired data displays (in part 2).

Create the CloudWatch alarm

We will create an alarm that triggers on LOW CPU on an Amazon Elastic Compute Cloud (Amazon EC2) instance. This is an artificial example, but it is used because it is easily set up and triggered for exploration purposes.

Alarms are created from metrics, so we must first generate some metrics. Create an Amazon EC2 instance and start the instance. Note the instance ID, as you will utilize it in the next step.

For guidance on how to do this, see Step 1: Launch an instance in Tutorial: Get started with Amazon EC2 Linux instances. You don’t need to be able to connect to your instance, you just need to be able to Stop and start your instances in order to trigger the CloudWatch alarm.

In the CloudWatch console, choose Alarms > “In alarm” or “All alarms” from the CloudWatch menu. From here, choose Create alarm.
Choose Select Metric > EC2 > Per-Instance Metrics. Select CPUUtilization for your instance (with the instance ID noted above), and choose Select Metric. If you don’t see any Metrics under EC2, or don’t see your instance, return to the Amazon EC2 console and check that the instance is running. The first metric data point may take a few minutes to generate.
Complete the fields as detailed below and shown in Figure 2.

Metric
- Namespace, Metric name and InstanceId are prefilled based on your metric selection
- Statistic: Average
- Period: 5 minutes
  - By default, CPUUtilization will report a value every five minutes
- Conditions
  - Threshold type: Static
  - Whenever CPUUtilization is… Lower
  - than… 50
Additional configuration
- Datapoints to alarm: 1 out of 1
- Missing data treatment: Treat missing data as good (not breaching threshold)
  - This allows the alarm to show as OK when the instance is stopped.

CloudWatch alarm configuration for Average CPUUtilization

Conditions set to a Static threshold of lower than 50%.

Figure 2: CloudWatch alarm with metrics and conditions defined

Choose Next in order to move to the Configure actions step.

You can set notification actions, such as sending you an email through an SNS topic, but here choose Remove to have no Notifications. Leave all other defaults, and choose Next.

Name your alarm EC2-low_CPU, and choose Review and Create alarm.

When you stop your instance, the alarm will have no data. Therefore it will be in an OK state, and when the instance is running, the low CPUUtilization will trigger the In Alarm state. We will test this after creating the EventBridge rule.

For more details on alarm states, and the different properties you set when creating an alarm, look at the documentation on Using CloudWatch alarms and Create an alarm based on a static threshold.

Create the EventBridge rule

The EventBridge rule will capture alarm state changes for all CloudWatch alarms. This rule has an action that will create a corresponding CloudWatch log event.

In the EventBridge console, choose Create rule, and give it the name Alarm_History-multiple_alarms.
Complete the fields as detailed below and shown in Figure 3.
- Define pattern
  - Select Event Pattern
  - Event matching pattern
    - Pre-defined pattern by service
  - Service provider: AWS
  - Service name: CloudWatch
  - Event type: CloudWatch Alarm State Change

- Select targets
  - Target: CloudWatch log group
  - Log group name: /aws/events/alarms/
    - Specify any log group name you wish

Leave all other settings as default, and Create the EventBridge rule.

EventBridge rule configuration for a predefined pattern for CloudWatch Alarm State Change

A target of CloudWatch log group /aws/events/alarms/.

Figure 3: EventBridge rule with pattern and targets defined

Create the metric filters

How you define your metrics from your logs will depend on what you want to achieve. In this case, we want a single metric per alarm that contains the state of the alarm – a 1 as it goes to the ALARM state, and a 0 as it goes into the OK state.

This requires two metric filters – one to capture each state. Both metric filters will write data to the same metric.

In the CloudWatch console, navigate to Log Groups and select/aws/events/alarms/
Choose the Metric filters tab, and Create metric filter for the ALARM state with the values detailed below and shown in Figure 4.

Filter pattern: {$.detail.alarmName=* && $.detail.state.value="ALARM"}
Filter name: Alarm-history-state-alarm
Metric namespace: CWAlarms
Metric name: state
Metric value: 1
Dimensions:
- Dimension Name: alarmName
- Dimension Value: $.detail.alarmName

Choose Next and Create metric filter.

For more details and examples, see the AWS documentation on Creating metric filters.

Specify the filter pattern

Metric filter configuration for a filter detecting ALARM state, for metric named state in a namespace of CWAlarms. The metric value is 1 and has a dimension for alarmName.

Define dimensions, enter Dimension name and Dimension value.

Figure 4: Metric filter with pattern, metric details, and dimensions defined

This filter pattern will match all ALARM state events within this log group. Setting a filter pattern lets us create dimensions. The values of the alarmName dimension ($.detail.alarmName) are extracted from each log event. For more information on filter patterns, see the AWS documentation on Filter and pattern syntax.

Create a second metric filter for the OK state in the same log group (/aws/events/alarms/). This time, the metric filter is for the OK state with a metric value of 0.

Choose the Metric filters tab and Create metric filter with the following values:

- Filter pattern: {$.detail.alarmName=* && $.detail.state.value="OK"}
- Filter name: Alarm-history-state-ok
- Metric namespace: CWAlarms
- Metric name: state
- Metric value: 0
- Dimensions:
  - Dimension Name: alarmName
  - Dimension Value: $.detail.alarmName

Trigger the CloudWatch alarm and EventBridge rule

Change the state of your CloudWatch alarm by stopping and starting your instance. When the instance is stopped, the alarm state should be OK and be in ALARM when the instance is running.

Note: Amazon EC2 metric data reports every five minutes by default, so you may have to wait up to five minutes until a data point is generated by your instance.

Check the status of your CloudWatch alarm in the CloudWatch Console by choosing Alarms > All Alarms and selecting EC2-low_CPU. Figure 5 shows the alarm overview screen with a time graph of the alarm metric and threshold. A red/green colored bar along the bottom shows the alarm state over time. The current state is displayed in the top right of the graph. Refresh this screen until you see the alarm change state.

CloudWatch alarm showing a timechart of the CPUUtilization metric value, along with the change of state as a colored bar.

Figure 5: CloudWatch alarm showing the time graph of the alarm state

Check your logs

Each time the CloudWatch alarm changes state, the EventBridge rule will capture this and create a CloudWatch log event under the /aws/events/alarms/ log group.

Check that log events are being created in the CloudWatch Console by choosing Logs > Log groups > /aws/events/alarms/. Select the latest log stream to view the event.

You will see multiple log streams created as your alarm changes state. One way to see the data from these streams together is to utilize CloudWatch Logs Insights. From the CloudWatch console, select Logs Insights. In the Select log group(s) dropdown, select the log group /aws/events/alarms/, and replace the query with the following

fields @timestamp, detail.alarmName, detail.state.value, @message

And choose Run query.

Figure 6 shows the query and results returned. If your search returns no results, then utilize the time picker to change the time period that you are searching over, and choose Run query again.

Figure 6: Results from a log insights query

Check your metrics

In the CloudWatch console, navigate to Metrics > All Metrics, and select your metric Namespace (CWAlarms) and dimension (alarmName). You will see a metric for each alarmName. Select the checkbox beside each metric in order to plot the results.

CloudWatch metrics showing a timechart of the metric against two different alarms. The timechart shows the metric value for the EC2-low_CPU alarm, which changes between values of 1 and 0.

Figure 7: CloudWatch metrics console showing the time graph created from a metric filter

Pricing

This example utilizes Amazon EC2, EventBridge, and CloudWatch logs and metrics resources.

Utilizing EventBridge is free, as all events are between AWS services. A t2.micro Amazon EC2 instance may fall within your free tier, along with the CloudWatch use (logs, metrics, and alarms) (Free Tier: https://aws.amazon.com/free/ )

For more information, see the pricing pages for each service:

Amazon EC2 on demand: https://aws.amazon.com/ec2/pricing/on-demand/
CloudWatch: https://aws.amazon.com/cloudwatch/pricing/
EventBridge: https://aws.amazon.com/eventbridge/pricing/

Cleanup

To avoid charges in your account, delete the resources that you created.

Amazon EC2 instance: see the documentation for Terminate your instance.
EventBridge rule: see the documentation on Disabling or deleting an Amazon EventBridge rule.

In part 2, we will utilize the data from the CloudWatch logs and metrics in order to create dashboards, so you may wish to keep this data to use then.

To delete your CloudWatch log groups, navigate to Logs > Log groups and select the appropriate group. From the Actions dropdown, select Delete log group(s).

Metrics cannot be deleted, but they will expire based on the retention schedule, as explained in the FAQ What is the retention period of all metrics.

Conclusion

This post has demonstrated how to utilize an EventBridge rule in order to create CloudWatch Logs and metrics from a change in state of your CloudWatch alarms.

In part 2, we will utilize this data and show you some ways to include it in your CloudWatch dashboards, by using alarm widgets, log insights queries, and metric math expressions. We will explore widgets that would support understanding the impact of different alarms and the correlation with other CloudWatch data.

AWS Cloud Operations & Migrations Blog