Extending and exploring alarm history in Amazon CloudWatch – part 2
To diagnose trends, impacts, and root causes, you may want to see trends in alarm history or visualize this data alongside other CloudWatch data. Typically, only those who must take immediate action will receive alarm data, but it is also useful to those who are troubleshooting, planning future developments, or understanding user experience. Including alarm data on your dashboards opens it up a wider audience and broader utilization.
This post will utilize the logs and metrics created in part 1, and show you ways to include this data in your CloudWatch dashboards. Figure 1 shows the dashboard we will create in order to provide an alarm overview. You may want to utilize individual widgets alongside other data in dashboards that focus on specific applications or workflows.
The example dashboard here shows data from additional alarms in order to demonstrate what a more populated dashboard may look like. If you want to utilize multiple alarms with the widgets described in this post, then you should send all log events to the same log group (/aws/events/alarms/), and all alarm metrics to the same namespace (CWAlarms) with a dimension of alarmName. For details on how to do this, see part 1 of this series.
Figure 1: Example CloudWatch dashboard for alarms.
This post will continue to work with the EC2-low_CPU alarm data from part 1. If you have not generated any data, then turn on your Amazon Elastic Compute Cloud (Amazon EC2) instance, leave it running for 10-15 minutes, and turn off the Amazon EC2 instance. Do this a few times in order to generate some data.
Create a dashboard
- In the CloudWatch console, choose Dashboards. Choose Create dashboard, type the name alarm-history, and select Create Dashboard.
- At this point, you will see a pop-up to choose the widget type that you want to add. Choose Cancel in order to get an empty dashboard.
For more details, see the documentation on Creating a CloudWatch dashboard.
Current alarm state widgets
Figure 2 shows the CloudWatch Alarm status widget, including the current state of the alarm.
Figure 2: Alarm state dashboard widget.
- From your dashboard, choose Add widget, select the Alarm status type, and click Next.
- Select the checkboxes beside the alarms that you want to include, and click Create widget.
- In Select a dashboard select the alarm-history dashboard from the dropdown.
- In Customize the widget title add a widget title of Current alarm state.
- Choose to Add to dashboard.
You can have more than one alarm per widget, and more than one alarm status widget per dashboard. This lets you group alarms as they make sense to your users and the dashboard context.
For more details, see the documentation on Add an alarm widget to a CloudWatch dashboard.
The next two widgets show summaries of log data in a table format. To create widgets using the log data, we must have an understanding of the log format, as well as what fields are present.
- In the CloudWatch console, select Logs Insights. In the Select log group(s) dropdown, select /aws/events/alarms/. Replace the query with the following, and then select Run query.
If your search returns no results, then utilize the time picker in order to change the time period you are searching over, and then choose Run query again.
- Select the arrow on the left side of a row in order to see the full event.
Figure 3 shows an example event. The event is in the JSON format, and you can see an expanded view of the event with field names on the left, as well as their associated values on the right.
Figure 3: Log Insights search result showing full JSON log event including field names and values.
When was the last alarm? (Last ALARM state widget)
Here we want to show the last occurrence of the ALARM state within our search period. Figure 4 shows the last time our EC2-low_CPU alarm state changed to In Alarm.
Figure 4: Latest time each alarm went into the alarm state.
- In the CloudWatch console, choose Logs Insights. In the Select log group(s) dropdown, select /aws/events/alarms/. Choose an appropriate time period. Replace the query with the following, and then select Run query.
Log Insights queries contain one or more query commands separated by a pipe character (|). The first command in this search (filter) finds all events with ALARM state. The next command finds the most recent of these (latest) events, and then displays the timestamp and alarm name. See the documentation for more information on CloudWatch Logs Insights query syntax.
- Select Run query in order to see the results.
- Choose Add to dashboard. In Select a dashboard select the alarm-history dashboard from the dropdown. In Customize widget title add a title of Last ALARM state. Choose to Add to dashboard.
Now you can choose the position and size of your widget on your dashboard. For more information, see the documentation on Move or resize a graph on a CloudWatch dashboard.
- Choose Save dashboard in order to keep this widget on the dashboard.
As you add widgets to your dashboard, note that the search time does not come from the widget you added, but from the time picker on the top right of the dashboard. If you do not see results in your widgets, then modify this accordingly. For more details see Change the time range or time zone format of a CloudWatch dashboard.
How often are the alarms occurring? (Top 5 alarms activated widget)
In this widget, we want to show the top five alarms and how often they have fired. Figure 5 shows an example widget. These high level stats can help us make decisions regarding which alarms to prioritize work with respect to understanding the root cause and resolution.
Figure 5: Insights Query results for the top five alarms.
- In the CloudWatch console, choose Logs Insights. In the Select log group(s) dropdown, select /aws/events/alarms/. Choose an appropriate time period. Replace the query with the following, and select Run query.
The filter command in this search finds every event with ALARM state. The stats command counts how many occurrences there are for each alarm name. The sort command orders the results from largest to smallest (desc). Finally, the limit command returns the first five results.
- Select Run query in order to see the results.
As with the previous example, add this widget to your dashboard, position it as desired, and then save the dashboard.
How long were the alarms active? (Alarm duration widget)
In this example, we have a single metric with a value of 1 when the state changes to alarm, and a 0 when it changes to OK.
By finding the time difference between these data points, we can find the alarm state duration. Figure 6 shows an example of the completed widget, including the number of minutes in the alarm state.
Figure 6: Alarm duration.
First, we must select the metric to use for our alarm state.
- In the CloudWatch console, navigate to Metrics > All Metrics. Select your metric Namespace (CWAlarms) and dimension (alarmName) and you will see a metric for each alarmName. Choose one of your metrics by selecting the appropriate checkbox on the left. It will show in the time chart.
- Select the Graphed metrics.
Utilizing metric math, we first find the time difference between the current data point and the previous data point, regardless of the point values.
- Choose Math expression > Start with empty expression. In the Details column, enter the following expression and click the tick to the right of the expression.
The data on the time chart will show the time difference in seconds. The example in Figure 7 shows a first data point of ~600 s, or 10 minutes. This is the time between the data points at 11:25 and 11:35.
Note: The m1 in this expression refers to the Id given to the original metric that you added. If your metric has a different value in the Id column, then use this in place of m1 in the expression.
Figure 7: DIFF_TIME applied to our metric data.
- The next step is to keep only the data when the metric is in the alarm state. In other words, the time between a 0 value point (OK state) and the previous point (value 1, state ALARM). This means keeping the values when the data point is 0, and throwing away the value for data points of value 1. This can be done with the following math expression:
Modify the existing DIFF_TIME expression, or create a new one with this expression. Charting this expression will show only the values that are between an ALARM (1) state and an OK (0) state (Figure 8).
Figure 8: (1-m1)*DIFF_TIME(m1) applied to our metric data.
- Finally, we must get a total of the time in the alarm state, so we modify the expression to
You could also modify the results from seconds into minutes, or any appropriate time unit, as shown in Figure 9.
Note: These expressions only calculate time between state changes from In Alarm to OK during the dashboard search time. If the alarm has reported data, but not changed state, then a value of 0 will show. If the alarm has reported no data, then two dashes will show.
Figure 9: Result of applying running_sum and converting to minutes for our metric data.
To learn more about metric math, see the Using metric math documentation.
- Here we want to display how long an alarm is active for, so we want to see a single value for each alarm. Clear all checkboxes apart from the one beside the final expression, and choose the Graph options Select a Widget type of Number.
You can add the expressions for multiple alarms in the same numeric widget.
- Choose the Actions dropdown and Add to dashboard as before.
When are the alarms occurring? (Alarm state widget)
Figure 10 shows a time chart dashboard widget displaying a value of 1 when the state is ALARM, and a value of 0 when the state is OK. This timechart display can be useful for seeing patterns against other alarms, or against other metric or log time series data.
Figure 10: Timechart showing alarm state.
- As before, start by selecting the metric to use for the alarm state. In the CloudWatch console, navigate to Metrics > All Metrics. Select your metric Namespace (CWAlarms) and dimension (alarmName), and you will see a metric for each alarmName. Choose one of your metrics by selecting the appropriate checkbox on the left. It will show in the time chart.
- Select the Graphed metrics tab.
As the data stands, it shows data points for 1 and 0. Depending on when your data points occur, they may or may not be connected with a line. Therefore, we will utilize a math expression to ensure that we see connected data points, and thereby a continuous line showing the alarm state at any point in time.
- Choose Math expression > Start with empty expression. In the Details column, enter the following expression, and then click the small tick on the right of the expression.
- Check the box on the far left of this metric in order to display it on the chart. Select the label entry and change it in order to show something more meaningful, such as the alarm name.
- Uncheck the box beside your original series (labelled state) in order to remove it from the chart.
Figure 11 shows the console view after these steps.
Figure 11: A timeseries showing the alarm state.
- Click the pencil icon beside the chart title on the top left in order to change the chart title to Alarm state.
- Choose the Actions dropdown and Add to dashboard as before.
You can modify this widget to include other metrics, or you can position it alongside other widgets in order to correlate alarm occurrences with other data. For more information on how to edit existing metrics widgets, see the Documentation on Edit a graph on a CloudWatch dashboard.
This example utilized CloudWatch logs, metrics, and dashboard resources. For details, see Amazon CloudWatch pricing.
To avoid charges to your account, delete the resources that you created.
- CloudWatch dashboard: From the alarm-history dashboard, choose Actions > Delete dashboard, and select Delete.
- CloudWatch log groups: In the CloudWatch console, navigate to Logs > Log groups and select the appropriate group. From the Actions dropdown, select Delete log group(s).
- CloudWatch Metrics cannot be deleted, but it will expire based on the retention schedule explained in the FAQ, What is the retention period of all metrics.
If you created an Amazon EC2 instance, or EventBridge rule in order to follow along with part 1, then remember to delete these as well.
- Amazon EC2 instance: see the documentation for Terminate your instance.
- EventBridge rule: see the documentation for Disabling or deleting an Amazon EventBridge rule.
This post demonstrated how to create dashboard widgets from alarm history logs and metrics. We created an alarm widget, widgets from log insights queries, and widgets from metric data utilizing metric math expressions.