AWS Startups Blog

IoT Primer: Well Behaved Small Things

By Brett Francis, Solutions Architect, AWS

I have searched all over the world but failed to find a virtue better than being well behaved.
– Rumi

Welcome to the second post in the series of primers designed to help you achieve great solutions by bringing together a collection of “small things” within the Internet of Things (IoT) domain.

As mentioned in the first post in this series, the Pragma Architecture has four layers. But since telemetry is most frequently the first purpose of a small thing, it is often the first challenge to creating IoT solutions (beyond hacking on a device directly). Therefore, this post and the next dive into key considerations of the Pragma Architecture’s Speed Layer and specifically some challenges that when left unaddressed turn telemetry into a flood.

Telemetry — Remotely Determine what a Device Senses

Many people logically summarize the requirement for gathering telemetry simply as “just gather the sensor data and send it in for use.” At first glance the requirement seems small and unimposing. As a result, we’ve seen many customers start aligning their first telemetry solution with the following high-level diagram.incorrect telemetry data diagram

In this diagram, the “persistent stream” frequently starts as a relational database, a NoSQL store, or an actual stream-processing solution. The Small Thing in the diagram might use HTTPS to send telemetry data to that stream on a schedule.

In the real world, such an approach struggles because it doesn’t address the following questions:

  • What should the device do with data if the device can’t send it to the stream?
  • What happens if all the devices report at the same time and create massive peaks in reporting? What about a hundred, a thousand, or even ten thousand devices?
  • How does one gather data when HTTPS is too heavy for our resource-constrained small thing?
  • How can a solution simultaneously provide more local network connections for devices in a global fleet?
  • How can the solution scale globally but retain a centralized management and analysis capability spanning all the devices in the fleet?

These real-world struggles are useful since they form key considerations that must be addressed by any small thing telemetry architecture.

To start addressing these challenges and creating a solution that moves past them all, it is beneficial to categorize the questions into “device-oriented” and “cloud-oriented”, with the first two being device-oriented and the last three being cloud-oriented. Before we get to the cloud-oriented portion of our telemetry solution, we need to dive into the first two questions and discuss what it means to have a fleet of devices that are well behaved.

Logging Algorithms Matter

Network communication is always intermittent but, due to resource or environmental constraints, network communication for small things is likely to be even more intermittent. It’s like the space shuttle versus tractor comparison in the first article: The space shuttle has a globally spanning, dedicated, communications network with built-in redundancy. But the tractor is a small thing because the environment in which it operates provides a sparse or less-capable network, forcing the solution shape to change as a result.

This example of intermittent and less-capable network communication also raises a key solution consideration that has nothing to do with the cloud platform but instead has to do with what the small thing does when the network is absent.

When the device can’t just “send it in,” your device’s logging behavior becomes very important. Logging algorithms observed across many solutions generally fall into three categories:

Three categories of logging algorithms

  • FIFO = Straightforward to implement. This algorithm’s data arrives from stage left and exits stage right when the allocated local storage is full. Examples include operations measurements and general-purpose telemetry.
  • Culling = Good for retaining absolute point values at a loss of curve smoothness. This algorithm’s data arrives stage left, and once local storage has been filled beyond a “culling level,” some sweeper logic then removes every other (or every Nth) sample.
  • Aggregate = Good for ever-increasing counters where the detailed shape of the curve is not as important as the minimum, maximum and average values over a period of time. This algorithm’s data conceptually arrives from stage left and performs aggregation on the stored values once records have filled the storage past an “aggregation point.” Examples include kWh, insolation, flow, CPU, temperature, wind speed, etc.

All of these algorithms help meet the challenges of intermittent networks through the use of local storage. Using these algorithms in your solution can improve resilience. But perhaps most importantly, thinking about your logging approach and how it affects the sensed data when the network is unavailable can help avoid future network issues. Even then, although important, these algorithms are just the first steps toward ensuring devices are well behaved in your telemetry solution.

Two Important Telemetry Attributes

Two more key characteristics further define what it means to have well behaved devices. These characteristics are device-specific unique IDs and early time-stamping. By instilling these into your device thinking, you’ll gain improved flexibility of your telemetry solution from proof-of-concept through achieving global scale.

Device-specific Unique ID
A critical early piece of every telemetry solution is that each small thing should have a truly unique ID within the solution.

This unique ID might be a serial number read from a chip on the device, the MAC address of the device when a device’s MAC is static, or some other relatively static, retrievable identifier. In this author’s experience, granular IDs, associated with device data collected and aggregated on the cloud, provides more overall solution flexibility as well as better horizontal scaling capabilities.

Put another way, by requiring a device to have a more granular unique ID, one can leverage descriptive information stored in a cloud-based device registry to layer additional aggregate behaviors across a fleet of devices. This distinct addressability of a single device also benefits global fleet analytics, centralized management of a fleet, pairing a small thing with other domains of control, and the next topic in this series: commands. Additionally, unique IDs at the point of a single device align well with the hash keys used in NoSQL and stream-processing services; these highly scalable services have much value in the IoT space.

Early “Time-stamping”
In addition to the unique ID, the creation and delivery of a timestamp early in any telemetry solution is also greatly beneficial. In this author’s experience having the small thing be the creator of the timestamp provides the most benefit for customer solutions. However, since the small thing might not even have a clock, the next best place to inject a timestamp is at the absolute earliest arrival of data from the small thing. This might be a cloud-based protocol gateway or the server-side solution itself. Also, don’t bother with time zones in your timestamp; just use Coordinated Universal Time (UTC). Time zones are a server-side rendering problem that a unique ID and a device registry cover. (We’ll talk about a device registry in a later post.)

When Waves Align a Solution Struggles

When a small thing uses a local logging algorithm to gather up data from sensors, it has started producing consistent drops or even a trickle of data into the larger solution.

When that small thing exists alongside 100,000 of its buddies, any flocking behavior will turn those 100,000 trickles into constructive interference that the entire solution must process.

The simplest and actually most powerful strategy to keep the trickles from overwhelming the solution is to ensure nonsynchronous behavior of the devices. Even though there are many challenges with scaling a device fleet, the most common mistake with the biggest impact in sensing solutions is that all the devices “report in every 15 minutes.” This often results in spikes every quarter hour on the clock. The number is not always “15,” but anything consistent and statically configured on the device often results in the following:

  • At 100 devices no one will care and just be happy things work.
  • At 1,000 devices the solution starts experiencing issues that usually get fixed, and …
  • At 5–10,000 devices the solution might have such untenable peaks as to make a device or solution redesign required. But by now there is already a significant fleet and customer base in existence that needs to remain operational.

Simply by randomizing the start times of each device’s reporting interval, the trickles produce a smoother stream, not a turbulent one. A simple approach to getting this behavior is that the device starts its reporting interval only after it awakes and a random duration has passed. Please keep this in mind when designing your small thing because with this simple shift in behavior you will have avoided many a late night for your hardware engineers, firmware engineers, solution developers, and administrators.

Heading to Telemetry in the Cloud

Now that we have some well behaved small things, in the next post we leave the realm of specific device behaviors and head solidly back to the cloud-oriented challenges of building a telemetry solution.

May your success be swift and your solution scalable.