How to Solve Tricky Embedded IoT Challenges with Insightful Analysis
Editor’s note: This is the second of a two-part series on the APN Blog. Read Part 1 >>
By Johan Kraft, CEO at Percepio
In this post, we will dig deeper into Amazon FreeRTOS and take a closer look at the communication stack for AWS IoT Core.
You will also learn how this communication can be analyzed using Percepio Tracealyzer, ensuring your system doesn’t generate unnecessary data traffic. We’ll present an example where two minor bugs sometimes coincided, causing the system to generate 10 times more data than expected, and how the cause of such issues can be identified.
AWS IoT Core Connectivity in Amazon FreeRTOS
The Amazon FreeRTOS communication stack for AWS IoT Core has four layers, where the top three layers (Shadow, MQTT, and Secure Sockets) are hardware independent.
Figure 1 – Amazon FreeRTOS communication stack, when used with the NXP OM40007 IoT kit.
Layer 4 – Device Shadow
At the highest abstraction level, you will find the Shadow API. This allows for updating the AWS IoT Device Shadow, which is the cloud representation of an individual device including its last reported values.
For example, in the Smart Home demo described in the previous article, the Device Shadow includes the latest temperature reading from a sensor and the state of indoor light sources (on or off). The Device Shadow can be accessed even if the device is not connected, and any changes to the Device Shadow will be applied as soon as the device connects again.
To access and update the Device Shadow from your Amazon FreeRTOS application, you first need to ensure that shadow service has been initialized using SHADOW_ClientCreate() and SHADOW_ClientConnect(). You will update the device shadow using SHADOW_Update(). Documentation is found in aws_shadow.h and in the Amazon FreeRTOS Developer Guide.
The Device Shadow contains both the reported and desired values, and Amazon FreeRTOS allows for receiving notifications from the Amazon Web Services (AWS) Cloud in case they differ. This occurs if the desired value is changed, like if a user turns the bedroom lights on via a mobile app connected to AWS. This can also occur if the reported data deviates from the desired value, such as the indoor temperature decreasing during a cold night.
To receive such notifications, the application should register a callback function for the delta event using the SHADOW_RegisterCallbacks() function. This event is generated by the cloud-side shadow service after a shadow update has been accepted and there’s a difference between the reported and desired values. There are also a few other callbacks that can be used, as described in aws_shadow.h.
Note that the shadow callback functions run in the scope of a system task and therefore need to be pretty fast to avoid disturbing the shadow service. You should not do any real processing in the callback function, but instead alert an application task (via a message queue, for example) that does the actual processing of the event.
The Device Shadow is expressed in JSON format, and an example of a Device Shadow is shown in Figure 2. JSON is not very common within embedded software, perhaps since it may not seem efficient for embedded software developers. However, JSON is very common within general web development and AWS Cloud services, and has many benefits in that domain.
Figure 2 – A Device Shadow in JSON format.
A simple way to generate JSON in a C project is to use the standard library function snprintf() and put all the static JSON formatting within the format string, as demonstrated in Figure 3.
Note that the whole Device Shadow does not need to be sent on every update. Perhaps you have 20 fields in your Device Shadow, but if only one of a few fields have been updated then it’s sufficient to send the affected fields affected by the updates. This reduces the amount of data that needs to be sent.
Figure 3 – Generating JSON with snprintf().
You don’t need to define the Device Shadow JSON structure beforehand in the AWS IoT Core cloud services, since this is created on the first call to SHADOW_Update().
Layer 3 – MQTT
The Shadow API uses Message Queuing Telemetry Transport (MQTT) to communicate with AWS IoT shadow service. MQTT is an ISO standard publish-subscribe protocol, designed for small devices and low bandwidth, and therefore quite suitable for IoT applications.
Amazon FreeRTOS contains an MQTT client that communicates with an AWS IoT cloud service, the Message Broker for AWS IoT. Whenever the client has new data to publish, it sends an update message to the broker. The broker then notifies all other clients that subscribes to this topic, such as AWS Cloud services or other Amazon FreeRTOS devices.
The messages are identified by the MQTT topic. For Device Shadows, the topic follows the structure $aws/things/<ThingName>/shadow/<Event>, and thus identifies both the device (ThingName) and the shadow event that has occurred. Examples of such events include update (i.e. publish), update/accepted (response on accepted update), get, and delete.
There are about a dozen such topics described in the AWS IoT Developer Guide, Shadow MQTT Topics.
Layer 2 – Secure Sockets
Below the Amazon FreeRTOS MQTT client, you will find the Secure Sockets API, a TLS implementation that provides secure Socket-based communication. This is implemented for all boards supported by Amazon FreeRTOS and provides a common, hardware-independent API.
This API is defined in aws_secure_sockets.h and provides functions like SOCKETS_Connect, SOCKETS_Send, and SOCKETS_Recv (receive). It’s pretty straightforward to use, but the underlying TLS implementation is complex. This means that a Secure Socket connection is quite costly in terms of RAM and processing time.
Fortunately, since an Amazon FreeRTOS system only needs to communicate with the MQTT broker, a single Secure Sockets connection is sufficient and this saves a lot of precious RAM.
If you’re not familiar with embedded software, you may find it strange that a few TLS connections would have a noticeable impact on RAM usage. However, Amazon FreeRTOS typically runs on microcontrollers with as little as 128 KB of RAM, and a single TLS connection can easily consume tens of kilobytes.
Layer 1 – Wifi Stack
Finally, below the Secure Sockets layer you will find a board-specific driver that communicates with the Wifi interface. In the NXP OM40007, this is a separate Qualcomm QCA4004 device that provides stack offloading that the Wifi and TCP/IP stacks run on the QCA4004 device instead of in Amazon FreeRTOS.
This way, there’s more processor time and memory available for the application.
Analyzing AWS Communication with Percepio Tracealyzer
Percepio Tracealyzer allows you to look inside your Amazon FreeRTOS software at runtime. It’s a complement to a traditional debugger that provides better means for debugging, validation, and profiling of multi-threaded embedded software, like Amazon FreeRTOS, and without stopping the execution.
This is all implemented in software using code instrumentation, so no special tracing hardware is required to use Tracealyzer. Our secret sauce is the advanced visualization, offering more than 30 views of the runtime behavior, showing the execution of tasks, interrupt handlers, API calls, and custom logging calls added in the code.
In the latest version of Tracealyzer, we have added support for tracing the Secure Sockets layer, so you can see when data is sent and received in the trace. To learn how to enable this kind of tracing in Amazon FreeRTOS, see Percepio Application Note PA-027.
This can be studied in several ways. For instance, in the more detailed views like the trace view shown in Figure 4 and the object history view in Figure 5, you can see the individual Secure Sockets API calls.
Figure 4 – Secure Sockets API calls shown the vertical trace view.
If double-clicking on a label, the Object History view (Figure 5) shows you all the operations on this particular socket, using a list view with additional information. Here, you can also see the total accumulated number of bytes sent and received.
Figure 5 – Secure Sockets API calls shown in the Object History view.
The Secure Socket events can be abstracted into high-level overviews, such as the Communication Flow graph in Figure 6, and the I/O Channel Graph. The Communication Flow shows the interactions between tasks and shared objects, such as sockets. As expected, we can see there’s only one socket in this system and that the MQTT task is using it.
Figure 6 – The Communication Flow graph, zoomed in on a Socket object.
The I/O Channel Graph shows the amount of data sent or received on a communication interface, like a Secure Socket connection. In Figure 7, we can see two instances of this graph in the right part of the screen—one showing data sent, and the other showing data received.
This shows not only the amount of data, but also a profile of the AWS IoT communication over time. This can be used to find relevant sections in the trace, like the initial spike when the connection was established.
Figure 7 – Tracealyzer showing the vertical trace view (left) and two I/O Channel graphs (right).
Example: Identifying Excessive Data Traffic
In Figure 7, each bar of the I/O Channel graphs represents a 360 ms window at this zoom level, so this shows the system is sending about 300 bytes/second and receiving about 1000 bytes/s on average, during the latter part of the trace.
This may not sound like much, but the system is only supposed to send a temperature reading every 10 seconds, so this can’t be right. The system appears to work, but normally it doesn’t send and receive this much data. What has happened here?
There is a lot to gain by answering that question before deploying this to a large customer base. Remember that you pay for the AWS IoT data packets, so the cost is minuscule during development. However, when multiplied with thousands or millions of devices, such issues can become rather costly. Moreover, if you device is battery-powered, issues like this will have a big impact on battery life.
We need to figure out why the communication is so intense in this case, and Tracealyzer can point us in the right direction. In the trace view, we see that the MQTT task runs intensively, about 4 times per second, and every time it sends and receives a few hundred bytes. This seems like way too much, compared to the 10 second updates we had intended.
When looking closer in the trace, we can see the MQTT task receives queue messages that seem to trigger the transmissions. To see what task sent these messages, we can select a relevant part of the trace and generate a Communication Flow graph, showing the runtime interactions during this period, as shown in Figure 8.
Each rectangle represents a task while the other shapes are different kinds of kernel objects, such as message queues. The arrows indicate the data flow.
Figure 8 – Communication Flow graph showing a selected part of the trace.
There’s a lot going on here, so let’s filter the graph to get a better view of the MQTT task. By clicking on the MQTT task, we highlight the direct dependencies and all message queues, semaphores, event groups the MQTT task uses directly. Moreover, by right-clicking on the MQTT node, we can select different options for pruning the graph.
Figure 9 – Right-click on the MQTT task in the Communication Flow to prune the graph.
We then select “Show Connected Only (2 steps)”, which reduces the graph so that only nodes within two steps from the MQTT task are shown. The result can be seen in Figure 10.
Figure 10 – The pruned Communication Flow graph, focusing on the MQTT task.
We can now see the MQTT task only receives input from a message queue (the one in the left), where the AWS-LED task is the only sender. Despite the name, this is an application task and seems to be responsible for this behavior.
When looking at the source code of the AWS-LED task, we see it uses the Shadow API to update the AWS Device Shadow. And like most tasks, this contains an event processing loop that runs on every queue message. We also see it receives messages from the MQTT task, so there’s a circular dependency.
With some help from the MCUXpresso debugger and the console log shown in Tracealyzer (as described in the previous article), the problem turned out to be caused by two bugs that coincided to produce this behavior.
The first problem was that the SHADOW_Update function was accidentally called whenever an update/delta event was received when the AWS shadow service notified about a difference between the desired and reported value.
Normally, this only resulted in a few extra updates, which were hardly noticeable. However, a second bug caused the application to report an incorrect value to the device shadow under certain circumstances. This caused the AWS shadow service to respond with an update/delta event after each update event (since the value was still not the desired one), which generated a new update event (due to the first bug).
As a result, the application got stuck in a feedback loop, much like a never-ending ping-pong game between the device and the AWS shadow service, until the device was eventually restarted. The device however seemed to work as normal.
After fixing these bugs, Tracealyzer now shows the expected behavior, as seen in Figure 11. The AWS communication is now less than 10 percent of what we had before, and the profile seems to match our intended 10 second updates. This would make a big difference on both the traffic costs and device battery life.
Figure 11 – After fixing the bugs, the AWS IoT communication is as expected.
In this post, we have presented the Amazon FreeRTOS communication stacks for AWS IoT Core in more detail and demonstrated how this communication can be analyzed using Percepio Tracealyzer.
We presented an example where bugs in the application code sometimes caused the application to send and transmit over 10 times more data than required. Since the problem only occurred under special circumstances and didn’t cause the device to fail, the problem might elude system testing.
If this was deployed to a large fleet of customer devices, you can be sure that many devices would experience this problem sooner or later, and then get stuck in this feedback loop.
Detecting abnormal data traffic can be done from the cloud side using AWS services, but on the cloud-side you can’t see why the firmware behaved in this manner. With Tracealyzer you can see both the AWS communication and application behavior in the same trace and find the cause.
Percepio – APN Partner Spotlight
Percepio is an APN Standard Technology Partner. Tracealyzer provides unprecedented insight into the run-time world of FreeRTOS systems. Solve problems in a fraction of the time otherwise needed, develop more robust designs to prevent future problems, and find new ways to improve software performance.
*Already worked with Percepio? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.