Monitoring your IoT fleet using CloudWatch

Introduction

In this blog we will show you how to monitor your IoT (Internet of Things) fleet and alert when conditions reach or exceed normal thresholds that you consider to be normal operational limits. We will go through the steps to setup Amazon CloudWatch dashboards based on AWS IoT metrics, create alarms from metrics, and then query log data to gain additional insights into your fleet activity or assist in troubleshooting.

As per the AWS Well-Architected Framework, after you implement your workload, you must monitor its performance so that you can remediate any issues before they impact your customers. Monitoring metrics should be used to raise alarms when thresholds are breached.

Amazon CloudWatch is a monitoring and observability service that provides you with data and actionable insights to monitor your workload, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events from workloads that run on AWS and on-premises servers.

The steps we will follow to setup our IoT monitoring are:

Prerequisites
Enable logging for AWS IoT Core
Create a CloudWatch dashboard
Create custom IoT alarms and metrics
View custom metrics and create alarms
Troubleshoot an issue that created the alert

Prerequisites

The steps in this blog will use the AWS Command Line Interface (AWS CLI) for demonstration purposes. To configure basic settings please refer to the AWS CLI Quick setup.

Enable logging for AWS IoT Core

You can configure AWS IoT logging by using the AWS Management Console, AWS Command Line Interface (AWS CLI), or the AWS IoT Core API. In order to configure logging for specific thing groups or specific devices you will need to use the AWS CLI or the IoT Core API. For additional information, please refer to IoT Device Logging.

Use the set-v2-logging-options command to set the logging options for your account.

aws iot set-v2-logging-options \ --role-arn <logging-role-arn> \ --default-log-level <log-level>

Log-levels:
ERROR – Any error that causes an operation to fail. Logs include ERROR information only.
WARN – Anything that can potentially cause inconsistencies in the system, but might not cause the operation to fail. Logs include ERROR and WARN information.
INFO – High-level information about the flow of things. Logs include INFO, ERROR, and WARN information.
DEBUG – Information that might be helpful when debugging a problem. Logs include DEBUG, INFO, ERROR, and WARN information.
DISABLED – All logging is disabled.

In addition to thing groups, you can also log targets such as a device’s client ID, source IP, and principal ID.

Thing group
Device level (Client ID, Source IP, principal ID)

Using the set-v2-logging-level command to configure resource-specific logging.

aws iot set-v2-logging-level \
–log-target targetType=THING_GROUP, targetName=<thing_group_name> \
–log-level <log-level>

In the following example the logging level for a single device with the clientID beta_device123 has been set to DEBUG:

aws iot set-v2-logging-level \
--log-target "{\"targetType\":\"CLIENT_ID\",\"targetName\":\"beta_device123\"}" \
--log-level DEBUG

Adjusting the log-level to INFO or DEBUG for specific thing groups is useful when deploying new features or firmware. After a successful deployment you drop the log-level back to your standard logging level.

AWS IoT Monitor dashboard

The Monitor dashboard allows you to view CloudWatch metrics for AWS IoT from all devices registered in your AWS account. In this dashboard below, you can see the number of messages published and received by your devices aggregated by protocol, type, and direction, the number of messages published by your devices over time, and other metrics. It is available in the monitor section of the AWS IoT Core page in the AWS console.

Image of default IoT monitor

Create a CloudWatch dashboard

Amazon CloudWatch (CloudWatch) dashboards are customizable home pages in the CloudWatch console that you can use to monitor your resources in a single view, even those resources that are spread across different Regions.

When you interact with AWS IoT, the service sends the following metrics to CloudWatch every minute.

Rule metrics
Message broker metrics
Device provisioning metrics
Fleet indexing metrics

Consider setting up a dashboard that includes metrics that are important to your deployment. For example, if you have a fleet with a steady messaging rate along with an expected number of IoT Rule executions you could have them as part of your dashboard. Another dashboard could display the status of IoT jobs deployed to your fleet showing the number of queued jobs, successful jobs, and failed jobs. As your fleet changes over time, the dashboards may need to be updated to reflect the new status or metrics for your fleet.

In the CloudWatch metrics picture below you can see an examples of AWS IoT metrics. For this specific region the CloudWatch is retrieving metrics for the number of devices successfully registered, certificates creation related metrics, and the number of times devices failed to provision due to a client side or server side error.

CloudWatch metrics showing claim certificates and other IoT metrics

Now, below you can see how a user can browse through the available metrics and add it to a dashboard. In this case, select the specific certificate Id, then click Add to dashboard using the drop-down Actions menu at the top right.

This displays how many Things have been registered successfully at the selected time period window of one hour. You can change this to different interval period of time.

How to add metrics to the dashboard by selecting the metric in lower pane then clicking add to dashboard using drop-down action menu at the top right.

Below is a custom made dashboard created in Amazon CloudWatch with the basic metrics for your fleet of IoT devices. This can be used by the operations teams to understand the status of the fleet.

Custom CloudWatch dashboard with metrics to monitor your fleet.

Create custom AWS IoT metrics

AWS IoT sends many standard metrics and dimensions to CloudWatch. You can also search and filter the log data coming into CloudWatch Logs by creating one or more metric filters. Metric filters define the terms and patterns to look for in log data as it is sent to CloudWatch Logs. CloudWatch Logs uses these metric filters to turn log data into numerical CloudWatch metrics that you can graph or set an alarm on.

In the Amazon CloudWatch Console, in the left side navigation pane, expand Logs and select Log Groups. Click on the AWSIotLogsV2 log group.

In the below picture you can see the AWSIoTLogsV2 Log group detail.

CloudWatch Log groups selecting AWSIotLogsV2

Click the Search Log Group button and type the search string below into the filter events dialog box and choose 30 minutes as the time period. For more information on how to query log events in CloudWatch refer to Search log data using filter patterns – Amazon CloudWatch Logs.

Type the following into the search box.
{ $.eventType = RuleExecution && $.status = Failure }

Image showing where to add the search query

Below we have an example of several failed rule executions over the past 30 minutes.

Image showing the results of the log group query

Click on the Create Metric Filter button to create a metric based on this search pattern. In the picture below, I am naming the filter something descriptive and also creating a new metric namespace called MyIotApplication.

As you add more metrics you can group them by application, workload, projects, or any grouping that your organization has chosen. The metric value is set as one, as it is sent to the metric filter each time the filter pattern matches. Other fields are left at default. Click the Create button to continue. Once this metric filter has been created you can use it just as you would other CloudWatch metrics by adding it to a graph, dashboard, or creating an alarm.

Create metric filter dialog box

View custom metrics and create alarms

Now we navigate to the CloudWatch console and select All Metrics from the left side menu. In the main window select the Browse tab and you will see the new custom namespace you just created. In this example it is called MyIotApplication. Click on the MyIotApplication link then the Metric with no dimension. The output below shows the number of failed rule executions over the time period selected.

Graph showing failed rule executions over time.

By clicking on the Graphed Metrics tab and you will see more details about the metric and allow you to change the type of graph, time period, and more. You can add this metric to your dashboard or a new dashboard using the Action drop down button at the top right of the screen.

You can also click on the bell icon to create an alarm. Let’s do that now.

Showing the alarm icon used to create an alarm from this metric

After clicking the bell icon, you will see the following form which will allow you to make adjustments to the alarm configuration.

Form that allows you to adjust the metric alarm configuration

This allows you to set thresholds for your alarm. We will set a threshold of two or greater within five minutes. Under Additional configuration leave these at default. This allows the alarm to go back to a normal state if the threshold value drops below the alarm threshold for one period of time—in this case five minutes.

Click Next and select an Amazon Simple Notification Service (Amazon SNS) topic that will receive the notification when in an ALARM state. This assumes you have already configured an Amazon SNS topic. Additional information can be read in Creating an Amazon SNS Topic.

Form showing configuration options for the alarm notifications

Select a name for your alarm and click Next.

Form to set the name and description of the alarm

On the next screen click on the Create Alarm if everything looks correct. Your alarm is now created.

You can test your alarm and notifications by setting the alarm to ALARM using the AWS CLI. Below we show the commands to set the alarm to ALARM and back to OK.

aws cloudwatch set-alarm-state --alarm-name IoTRuleExecutionFailures --state-reason "testing alarm" --state-value ALARM

aws cloudwatch set-alarm-state --alarm-name IoTRuleExecutionFailures --state-reason "testing alarm" --state-value OK

Troubleshoot an issue that created an alert

Next, we show the steps you could perform when you have an alarm or device behavior that is outside of expected thresholds. CloudWatch Logs and CloudWatch Logs insights provide powerful search functions to assist you in identifying possible root causes of the alarms or unexpected behaviors.

AWS IoT logs are stored in the AWSIotLogsV2. If you navigate to this log group you will see the events stored in various log streams. By clicking on the Search log group button you can filter these logs by query and time frame.

Search log group button

If you are expecting rules to be evaluated and actions to be performed you can verify this by filtering for event type RuleMatch as follows:

{ $.eventType = RuleMatch }

If you want to search for any rules that have failed in the past 30 minutes I would add the following to the filter field. This example shows how to filter for RuleExecution events, but only the events that have a status of failure:

{ $.eventType = RuleExecution && $.status = Failure }

CloudWatch logs Insights enable you to interactively search and analyze your log data in CloudWatch logs. You can perform queries to help you more efficiently and effectively respond to operational issues. If an issue occurs you can use CloudWatch Logs Insights to identify potential causes and validate deployed fixes.

CloudWatch Log Insights can be used as a tool to help you understand patterns of usage. Let’s take an example where you want to list the top 10 topics being published to.

fields @timestamp, topicName
| stats count(*) as numPublishIn by (eventType="Success") , topicName
| sort numPublishIn desc
| limit 10

Top 50 Publishers by clientID

filter eventType="Publish-In"
| stats count(*) as numPublishIn by clientId
| sort numPublishIn desc
| limit 50

CloudWatch Log Insights allows you to build queries against the logs. If you received a connection throttle alert you may want to run a query to show the clients with the highest number of connections during the past 30 minutes:

filter eventType="Connect"
| stats count(*) as NumConnections by clientId
| sort NumConnections desc
| limit 20

Log Insights showing the number of connections by clientID over time

Conclusion

Monitoring is a continual process. As your fleet and features evolve, you will need to update your monitoring goals and thresholds. By monitoring your fleet and setting alarms you can be proactive instead of reactive. In this blog post you learned how AWS provides different options to monitor your fleet of devices. We showed using both automated tools and manual monitoring tools where you enable logging, create dashboard with the built-in metrics, create custom metrics and alarms. Finally, we showed how to troubleshoot when the device behavior changes.

Further reading can be found here on Amazon CloudWatch Designing and implementing logging and monitoring with Amazon CloudWatch and additional IoT Best Practices Blogs here.

Authors

Steve Krems is a Specialist Solution Architect for IoT at Amazon Web Services (AWS). Prior to this role, Steve spent 18 years in the semiconductor industry in Information Technology management roles with a focus on cloud migration and modernization.

Sunitha Eswaraiah is a Solutions architect at AWS supporting customers in Nordics. She has 10+ years of experience as a data engineer and as a back end developer. Now she started to specialize in building IoT solutions. Prior to AWS, she worked for Gartner and Verizon in India.

The Internet of Things on AWS – Official Blog