Detecting Bugs in Deployed IoT Devices Using Percepio DevAlert
By Johan Kraft, PhD, CEO and Founder at Percepio
When developing Internet of Things (IoT) device software, it’s hard to know if all bugs have been found. As in all software development, testing can show the presence of bugs but not prove their absence.
It’s not uncommon that bugs elude detection and remain in the deployed device software. This may cause problems for end users related to data integrity, device availability, battery life, and the general user experience.
The average cost of fixing bugs in an embedded device’s first year of service alone can run in the hundreds of thousands of dollars. Percepio DevAlert works with AWS IoT Core to alert developers when an error is first detected, and provides visual trace diagnostics to identify the root cause.
In this post, I will walk you through the DevAlert process and how it works with AWS IoT Core. Percepio is an AWS Partner Network (APN) Advanced Technology Partner specializing in visual trace diagnostics for embedded and IoT device software developers.
The Cost of Bugs in Embedded Devices
Like all software, firmware in IoT devices contains bugs. Sometimes more of them, sometimes fewer; but they’re always there—even in production firmware and deployed devices.
It’s estimated that embedded software developers typically introduce 50-100 bugs per 1,000 lines of code during development, of which about 3-5 bugs are missed during testing. These bugs typically elude detection during testing and remain in the production firmware.
This happens because it’s practically impossible to test every execution scenario and code path in multi-threaded device firmware. Furthermore, the effort needed to find the last remaining bug tends to increase exponentially.
Although you can never know if all bugs have been found, at some point you need to ship your product.
With many thousands or even millions of devices running your firmware, you can run up against the laws of probability. Some of your customers will stumble upon the remaining bugs and suffer from unexpected errors in the device, no matter how deep the bugs were hidden during testing.
You can only hope that customers report those problems as soon as possible. If they provide sufficient information, you can reproduce and fix the bugs.
All too often, however, customers do not report those errors. Instead, they restart the device and hope it doesn’t happen again. When you do get an error report from a customer, you’ll often have to settle for a vague descriptions like “the screen went blank,” leaving you with no idea of where to begin searching for the bug.
High Service Costs in the First Year Alone
A recent survey from market research firm VDC Research found that embedded development projects needed 79 patches on average during their first year of deployment.
The average cost to debug, correct, and deploy was more than $5,000 per patch. That’s a total of several hundred thousand dollars per project in support costs, just in the first year.
More importantly, many of these patches are fixes to bugs that have been negatively affecting customers for a long time. The cost per patch mentioned in the VDC report does not account for damaged customer relations, poor product reviews, and lost sales, which can have an even bigger impact on the business than the direct cost of fixing the bugs.
Percepio DevAlert Reports Errors When They Occur
The introduction of cloud-based IoT platforms such as AWS IoT Core offers embedded developers a new way to deal with missed bugs in production firmware.
AWS IoT Core lets connected devices interact with cloud applications and other devices. It supports billions of devices and trillions of messages, and can process and route those messages to AWS endpoints and to other devices reliably and securely.
With AWS IoT Core, your applications can keep track of and communicate with all your devices, all the time, even when they aren’t connected.
Percepio has developed a cloud service called Percepio DevAlert that immediately informs developers of errors detected in their firmware.
Hosted by AWS IoT Core, Percepio DevAlert collects error reports from devices in deployment or field testing and notifies the developers the first time a new error is reported. The reports include detailed software traces that show what happened in the firmware just before the error occurred.
Once developers correct the error, they can automatically distribute the patch via the cloud as over the air (OTA) updates, which is both faster and more reliable than asking customers to manually download and update the device firmware.
A manual firmware update requires a process that many end users are often unfamiliar with and therefore unlikely to do.
This process involves three Percepio components:
- Percepio DevAlert Firmware Agent
- Percepio DevAlert Classification Engine
- Percepio Tracealyzer
Percepio DevAlert Firmware Agent
The eyes and ears of Percepio DevAlert is the DevAlert Firmware Agent, a compact software library that device developers embed in their RTOS-based IoT application.
This agent acts somewhat like a flight recorder, responsible for two important firmware monitoring tasks—keeping a trace of recent software events, and providing a way for error-handling code in the application to report any detected errors, whether related to software or hardware.
When the application detects and reports an error, such as a failed assert condition, the report bundles a few pieces of information:
- Symptoms describing the error, including an error code from the device and other diagnostic data relevant to the application developers, such as important state variables and register values selected by the device developers.
- A trace showing the most recent software events before the reported error, which provides context and makes it easier for developers to analyze and fix the problem.
This error report, called an Alert, is then uploaded to the developer’s AWS account through the existing AWS IoT Core connection.
In case the connection isn’t available at the moment (due to the reported error, for example), the Alert can instead be saved to flash memory and uploaded once the connection is restored.
Percepio DevAlert Classification Engine
Alerts are first stored in an Amazon Simple Storage Service (Amazon S3) bucket in the developer’s account. The symptoms are then forwarded to the DevAlert Classification Engine, a fully managed service hosted by Percepio, running on AWS. All trace data remains in the device developer’s account.
The DevAlert Classification Engine looks at error codes and any other symptoms provided by the Firmware Agent and notifies the developers in case of a new unique issue; i.e., a new combination of symptoms.
Duplicate Alerts will not generate any notification, by default. This approach avoids flooding developers with notifications even if many devices report the same problem.
Developer notifications are provided via the Amazon Simple Notification Service (SNS), typically via email. You can configure the notification rules in the DevAlert console.
The DevAlert Classification Engine keeps statistics on all reported issues and stores them in a searchable database, making it possible to view the number of affected devices and which firmware versions are affected.
How the Percepio Components Work Together
The illustration below shows the flow of information in Percepio’s DevAlert solution and how it works with DevAlert Classification Engine.
Figure 1 – Percepio DevAlert architecture.
The numbers in the illustration correspond to the numbers in the sequence:
- DevAlert Firmware Agent sends an Alert to the developer’s account in AWS IoT Core. It typically uses MQTT over Transport Layer Security (TLS) instead of the less secure TCP internet protocol. The Alert is sent across AWS Basic Ingest, which allows devices to send device data to AWS services without incurring messaging costs.
- An AWS IoT Rule is triggered and specifies the actions described in steps 3 and 4.
- The Alert is stored in the developer’s cloud account in an Amazon S3 bucket.
- The AWS Lambda function “Alert Sender” is activated and retrieves the symptom data.
- Symptom data (but not the trace data) is submitted to the Percepio DevAlert service, using a unique endpoint URL provided by Amazon API Gateway and a Cross-Account Role.
- Amazon API Gateway authenticates the request and invokes the DevAlert Classification Engine, which groups the raw Alerts into unique Issues.
- All Issues are stored in an Amazon DynamoDB (NoSQL) database for further processing and analytics.
- If a new Issue is detected, or if a custom notification rule is triggered, a notification is created using SNS.
- All Alerts are also stored in an Amazon S3 bucket for archival.
- When a developer receives a notification about a new Issue, they can use Percepio Tracealyzer to view the reported symptoms (steps 11, 12, and 13 in the illustration) and retrieve the trace (step 14) for detailed analysis.
Percepio Tracealyzer: Visual Trace Diagnostics to Fix the Bug Fast
When developers receives a notification from the DevAlert Classification Engine, they can access the corresponding trace in Tracealyzer, Percepio’s desktop software for visual trace diagnostics.
Tracealyzer provides detailed views and overviews explaining what happened in the firmware just before the error occurred.
Tracealyzer is also a stand-alone product you can use during the development phase to leverage debug connections such as Segger J-Link, and STLINK from STMicroelectronics. A physical debug connection allows for using Tracealyzer in streaming mode and record long traces, several minutes or hours, if needed.
DevAlert uses the event recording in “snapshot mode,” meaning the trace data is continuously written to a ring-buffer in device RAM but only uploaded when an Alert is generated.
The encoding of the trace data is very compact in snapshot mode, allowing the DevAlert Firmware Agent to keep a history of between 500-1,000 events using only about 5 KB of RAM.
You can configure the trace buffer size to suit the target system requirements. If RAM is scarce, you may reduce the size of the trace buffer to fit the DevAlert Firmware Agent into an existing application, but you will have to make do with shorter traces. You may also increase the buffer size to obtain longer traces, if you have RAM to spare.
Tracealyzer can display many different views of the firmware, from high-level architecture overviews generated from the trace data, to detailed timelines and information about task switches, interrupts, and semaphores.
These views are connected so that clicking on a data point or time interval in one of the views highlights the corresponding location in other open views, allowing the developer to study a sequence of events from several angles simultaneously. This allows for explorative, top-down analysis which is essential for system-level debugging.
The overviews help developers spot anomalies, and connected views simplify drill-down to the corresponding events.
Figure 2 – Visual trace diagnostics in Percepio Tracealyzer.
Since traces reported by DevAlert only include the most recent events, the offending code can be located quickly, often near the end of the trace.
If the error is not obvious from this information alone, you will at least know roughly what was going on in the device when the error occurred. This makes it easier to reproduce the issue for more detailed debugging, either by inserting additional Tracealyzer logging or by using a traditional debugger.
Once the bug is found and fixed, developers can use the OTA update capabilities in AWS IoT Core to deploy updated firmware via the cloud as OTA updates. This is both faster and more reliable than asking customers to manually download and update the device firmware.
A manual firmware update requires a process that many end users are often unfamiliar with and therefore unlikely to do.
DevAlert’s Built-in Security
Percepio DevAlert protects customer data by using AWS best practices for authentication and encryption. It does not expose any additional attack surfaces, as it relies on already available and secure communications protocols, such as MQTT over TLS.
Developers may use the built-in software tracing to log any information of relevance in the firmware. Since this may include sensitive information, the DevAlert Classification Engine has been designed to operate without needing to analyze the trace data. This trace data never leaves the developer’s AWS account.
To do its job, the DevAlert Classification Engine only needs the symptom data that is explicitly reported by the developer’s code via the DevAlert Firmware Agent. This data consists only of numeric key-value pairs. On its own, without the original source code or DevAlert configuration, it is meaningless.
Developers can use the DevAlert console to define alert types and symptoms with descriptive labels, generate unique numeric codes for them, and download the definitions as a C header file for inclusion in the DevAlert Firmware Agent.
With Percepio DevAlert and AWS IoT Core, device developers can deploy complex IoT firmware with confidence, get an accurate view of the real-world behavior of their production firmware, and provide OTA updates when needed.
Together, these products can reduce support costs for IoT devices, minimize the number of users affected by missed bugs, and improve the overall customer experience.
To learn more about Percepio DevAlert, visit percepio.com/devalert.
The content and opinions in this blog are those of the third party author and AWS is not responsible for the content or accuracy of this post.
Percepio – APN Partner Spotlight
Percepio is an APN Advanced Technology Partner specializing in visual trace diagnostics for embedded and IoT device software developers. Percepio DevAlert works with AWS IoT Core to alert developers when an error is first detected, and provides visual trace diagnostics to identify the root cause.
*Already worked with Percepio? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.