Build resilient IoT device applications that remain active using the AWS IoT Device SDKs

Introduction

In this blog post, we provide recommendations on how you can build resilient Internet of Things (IoT) device applications using AWS IoT Core, AWS IoT Device SDKs, and MQTT protocol. These recommendations cover: managing your MQTT client, publishing and reception of messages, initiating the device application process, setting up the network connection, performing software updates, and integrating hardware features for resilience.

Arguably, all IoT device applications will experience scenarios that can lead to a loss of service. Some examples are: loss of, or unstable network connectivity, loss of power, faults in your own software, device hardware faults, server-side disconnects, and authentication errors.

As an IoT device application builder, it is your responsibility to build your applications to be resilient to failure scenarios, so that you can avoid or mitigate any loss of service. When you deploy your device applications at the edge, on-site intervention can be impractical or impossible.

The goal of resilience is to make sure your IoT device application remains active and performs as per specification. If the application is not active, it will not be able to mitigate against failure. A resilient device application can seamlessly restore service quickly.

To help illustrate the recommendations, we first describe a basic IoT device application built on AWS IoT. Then we describe how you can incrementally apply the recommendations to the device application. When building your own device application, you can decide which recommendations to adopt, and when. You can achieve resilience early and increase resilience over time.

Time to read	8 minutes
Learning level	Advanced (300)
Services used	AWS IoT Core AWS IoT Device Management AWS IoT Device SDKs

Building a basic IoT device application

You can build a basic MQTT-based IoT device application using AWS IoT technologies. At a minimum, your application will need to support:

Approach for provisioning with AWS IoT Core.
Configuration with your AWS IoT Core endpoint address.
Configuration of credentials to connect to that endpoint address.
Integration with an MQTT client that matches your chosen protocol, programming language and runtime environment.
Connection to AWS IoT Core using the MQTT client and correct protocol (MQTT or MQTT over WebSocket).
Subscription to MQTT topics, publish messages, and receive messages.

We recommend that you integrate your device application with an AWS IoT Device SDK and use the MQTT client from your chosen SDK. The AWS IoT Device SDKs have resilience features built-in and closely integrate with AWS IoT Core resilience functionality (see later).

See the tutorial Connecting a device to AWS IoT Core by using the AWS IoT Device SDK for a full guide on building a basic IoT device application with the AWS IoT Device SDK.

After you have built your IoT device application, you can upload it to an edge device and run it. If you have correctly configured the application (with your endpoint & credentials) it will connect to AWS IoT Core and be able to publish and receive messages.

So far, so good. You have built a basic IoT device application and it is working. However, what if something bad happens? What if the network connection is lost? Or if the MQTT broker refuses the connection because of an authentication error? What if your application crashes?

If your device application does not specifically handle negative scenarios, it is likely to exit, leading to loss of service. This is where the following recommendations help.

Recommendations:

1) Manage your MQTT connection

AWS IoT Core, the AWS IoT Device SDKs, and the MQTT protocol, were built with resilience in mind. After your MQTT client has established a connection with AWS IoT Core, your device application can publish and receive MQTT messages, despite transient connectivity interruptions.

To fine-tune the configuration of the MQTT client, you can setQuality of Service (QoS) on message delivery, or configure MQTT keep-alive, but you will need to do additional development work to achieve full resilience to negative scenarios.

Here are some techniques for managing the MQTT connection for your IoT device application:

Technique	Description
Take advantage of AWS IoT Core and MQTT resilience features	Carefully read the documentation for your MQTT client (e.g. AWS IoT Device SDK) and the AWS IoT Core MQTT protocol connections. The following AWS IoT Core and MQTT features may help your device application achieve greater resilience. Persistent sessions – When your client reconnects after being temporarily disconnected, AWS IoT Core persistent sessions will restore topic subscriptions, and deliver messages published to your client with QoS 1. Retained messages – AWS IoT Core retained messages can deliver messages published to your client when it comes online, even after a significant period offline. Last Will and Testament (LWT) – AWS IoT Core LWT can deliver a message if your client disconnects abruptly, and your cloud application can act on this message. QoS – If your device application publishes messages with QoS 1, you will be able to check for success or failure of message delivery, and your application can react accordingly.
Encapsulate the MQTT client	In your device application software, encapsulate the MQTT client and fully control the life-cycle of the client, along with anything else required to create, configure, and start the client. After the client is fully encapsulated, you can create, configure, use, and ultimately destroy the client, multiple times, whilst your application is active.
Handle MQTT client events	Configure your device application to listen to MQTT client events, and act on them (see later). Useful events include: connect, disconnect, error, interrupt, and resume.
Track the MQTT connection state	Maintain a flag which tracks state of the MQTT connection. Use the connect, disconnect, interrupt, and resume events for this. Adapt how your device application manages subscriptions and messages when there is no connection (see the next recommendation).
Recover from server-side disconnects	An MQTT broker might decide to disconnect your MQTT connection, and you should expect this to happen. This includes the AWS IoT Core Message Broker. Your device application should be ready to handle disconnects whenever and as often as they happen. However, in practice, MQTT connections should remain open for many days or weeks.
Recover from authentication failure	Do not assume that an authentication failure is fatal to your device application. Some authentication failures could be temporary, such as when the server-side policy is not yet active. Be sure that your application recovers if an authentication failure prevents connection (see technique on connection health checks).
Handle MQTT client errors / exceptions	Catch all MQTT client errors and exceptions. Observe which are fatal, and which are warnings or transient, and adapt accordingly. If the connection becomes unusable, disconnect the connection.
Perform connection health checks on interval	On interval, check the health of your MQTT connection, and remediate. For example: If the credentials are missing, check again later. If there is no MQTT client, try to create one. If there is no MQTT connection, try to create one. If the MQTT connection is not connected, try to connect it.
Define strategy for connection retries	When retrying connection attempts, use an exponential backoff strategy. This can protect against excessive connection attempts when multiple clients are affected by the same underlying issue.

2) Manage MQTT subscriptions and message flow

When your main device application logic wants to publish a message, or is expecting to receive a message, the low-level resilience of the MQTT connection should not be a concern. By adopting a modular approach to your application design, your main application logic, and the MQTT client can be treated as separate concerns which are loosely coupled.

To enable this separation of concerns, you can introduce a software layer between the main device application logic, and the logic which manages the MQTT connection. This layer can buffer outbound messages until the connection is available, and it can verify that subscriptions for inbound messages are configured correctly, regardless of the state of the underlying MQTT client or connection.

If you decide to buffer outbound messages in your device application, you should consider how this will work when publishing messages using the AWS IoT Device SDK. Your application should track the success or failure of each message publish attempt, and use this to update the message buffer in your application. If your application is publishing messages with QoS 1, then you can expect the SDK to buffer those messages when the connection is momentarily offline. To help guide your implementation, refer to the documentation for your chosen AWS IoT Device SDK. Check how to use the SDK to publish messages with QoS 1, and how to receive the associated PUBACK response.

3) Manage your IoT device application process

Now that your IoT device application is internally resilient, you can shift focus to the environment your application runs in.

The specific runtime environment your IoT device application will run in might vary according to your requirements, but the following resilience techniques remain important for all types of runtime environment.

Technique	Description
Process management (PM)	Instead of managing your application process yourself, try to use well-known process management software. Examples include PM2 or Docker.
Graceful start up and shut down	All operating systems have mechanisms for starting up and shutting down applications. Your application should integrate with these mechanisms, in a way that is idiomatic to the operating system your application is deployed to. In particular, choose the correct runlevel for your application, so that any resources your application depends on are available, and for your application to start and stop at the appropriate moment.
Operating system signals	Operating systems can signal your application. Your application should respect these signals and react accordingly. For instance, if the operating system signals that your application should exit, then the application can tidy up resources before exiting. An example resource to tidy up would be to gracefully end the MQTT connection, and to flush any buffered messages to local storage.
Application logging and metrics	Your application should log useful operational information. If there are negative scenarios to which your application should react, then logging the details of these can be helpful to verify that your application is resilient. Logging can also help you to learn of scenarios that you have not yet mitigated against.

4) Manage your network connection

If there is no network connectivity on the device your IoT device application cannot establish an MQTT connection. Ensuring the network connection is carefully configured and managed, to achieve maximum connection uptime, is an important part of ensuring your device application is resilient to negative scenarios.

We recommend that you do not try to implement network connectivity resilience yourself, because this requires significant implementation, testing, and on-going maintenance effort. You can instead use existing solutions that are known to work. As an example, many systems come with the Network Manager and Modem Manager packages pre-installed. These packages work together to keep devices connected to networks and will mitigate against negative scenarios. You can configure connection failure fallback strategies to select an alternative network.

If you are using cellular networks for your network connectivity you might be able to take advantage of advanced features offered by your provider, such as roaming between networks. On the cloud-side, you might be able to inspect and analyze the connectivity status of your device fleet, and adjust device connectivity options for maximum resilience. Some vendors give you the capability to signal your devices, which you can use to perform recovery if your device application is stuck (such as initiating a remote boot).

5) Manage your software updates

The ability to remotely update your IoT device application and device software is an important factor to support resilience in your IoT application.

An IoT device application is rarely finished when you deploy it to devices for the first time. You will need to deploy new features and bug fixes to your application with a software update. Similarly, the operating system on your devices will likely need updates, and it is especially important that you can rapidly deploy security fixes.

You can build a software update capability using the AWS IoT Device Management Jobs. You can use this to define remote operations that can be sent to and run on your devices in an agent device application that you create. When you implement software updates, you are likely to create an agent device application that runs separately from your main device application. This agent application also needs to be designed for resilience, similar to your main application.

6) Enable device hardware resilience features

Check if your IoT device integrates technology that may assist with resilience, such as a watchdog timer or a UPS device.

If your device has a watchdog timer, then you can configure the watchdog to take action if your device becomes unresponsive or develops a fault, such as rebooting the device.

If your device is powered via an uninterruptible power supply (UPS) device, you might be able to configure it to signal your device application when the power supply will be lost. Your device application can initiate an ordered shutdown, or notify your cloud application of the situation.

7) Adopt a strategy for Disaster Recovery and High Availability

Our final recommendation is that you adopt a strategy for Disaster Recovery (DR) and High Availability (HA) for your IoT device application. A good starting point is the Disaster Recovery for AWS IoT Implementation Guide and the Disaster Recovery for AWS IoT solution. To understand how AWS IoT Core approaches resilience, you can read Resilience in AWS IoT Core.

Conclusion

In this blog post we presented several recommendations, along with detailed techniques, to help you build resilient IoT device applications using AWS IoT Core and the AWS IoT Device SDKs. Your device application will experience negative scenarios, and it is your responsibility to mitigate against these. By following the above mentioned recommendations, your device application can become more resilient and remain active, even under negative scenarios.

As further reading, we recommend the IoT Lens from the AWS Well-Architected Framework. In particular the Design for offline behavior design principle is relevant to resilience.

About the author

Diggory Briercliffe is a Senior IoT Architect at Amazon Web Services supporting customers in the IoT area.

The Internet of Things on AWS – Official Blog