Best practices for ingesting data from devices using AWS IoT Core and/or Amazon Kinesis

Internet of Things (IoT) devices generate data that can be used to identify trends and drive decisions in the cloud.
Designing a scalable ingestion technique is a complex task and the first step is to understand the behavior expected from the device: how is the device sending data and how much, what pattern does the data follow and what direction does the data flow, what information is traversing, and what is the purpose of it. These are some of the necessary questions to define the ingestion process. This blog post explores use-case specific best practices for ingesting data at scale with AWS IoT Core and/or Amazon Kinesis.

To ingest IoT data into AWS we will cover two main service families in AWS:
AWS IoT offers a suite of fully managed services that enables the connection, management, and secure communication among billions of IoT devices and the cloud. It offers a set of capabilities that help organizations build, deploy, and scale IoT applications. AWS IoT Core supports connectivity for billions of devices and processes trillions of messages. Using AWS IoT Core, you can securely route messages to AWS endpoints and other devices, and establish a management and control layer for your IoT solution.
Amazon Kinesis cost-effectively processes and analyzes streaming data at any scale. With Amazon Kinesis, you can ingest real-time data, such as video, audio, application logs, website clickstreams, and IoT telemetry data, for machine learning (ML), analytics, and other applications. Amazon Kinesis Data Streams is a scalable and affordable streaming data service. It captures data from diverse sources in real-time, enabling instant analytics for applications like dashboards, anomaly detection, and dynamic pricing.
When operating IoT devices you need to be aware of the environment, activity, and situation in which they perform to select the best data ingestion stack. This blog will guide you the different aspects and tradeoffs to define the most appropriate ingestion strategy.

What is your environment?
The environment refers to the type of devices in use, the software stack provisioned in them, the operational goal, and the connectivity expected from the devices.

How many devices are you operating? Where are those devices operating? What is their function? What operational control do we need on the devices?
The first factor to consider is the volume of the fleet you are operating and the location and goal of the devices. Working with remote devices on uncontrolled environments requires built-in control of the device lifecycle and remote visibility into the current status. To manage and maintain large quantities of remote and constrained devices that operate in the field, you can use AWS IoT Core as it supports encrypted information exchange with devices to get their current status and information, and performs remote actions on them. We refer to controlled devices to multi-purpose or edge devices which have a management connection path to them. Controlled devices that need to send frequent or large amounts of data but do not require to receive information, benefit from ingesting data through Amazon Kinesis. You can use Amazon Kinesis Producer Library to build your data ingestion clients as a separate component or use Kinesis Agent to collect and send data to Amazon Kinesis Data Streams.

What is the software stack you are working with?
Your choice of device and its development tools, along with your experience or preference with programming language, define the software to use to build your data ingestion layer. Devices with limited resources like microcontrollers (MCU) benefit from purpose-built operating systems like FreeRTOS and lightweight messaging protocols like MQTT, which is supported by AWS IoT Core for building applications to send data.
For multi-purpose devices (MPU) where there’s a broad choice of operating systems and tooling to integrate data ingestion clients into your existing applications or ecosystems, you can use Amazon Kinesis Producer Library and Kinesis Client Library to build your data ingestion producer and consumer components.

What activity do you plan to accomplish?
Understanding the source of data, volume, and flow will determine the best ingestion approach.

What is the volume and rate of data to be ingested? What flow does the data follow?
In situations when you have devices that generate high-throughput data (greater than 512KB/s), you need to be aware of the throughput per connection. Kinesis Data Streams can help to collect and process unidirectional data in real-time and can scale thanks to its underlying serverless architecture.
Messaging with payload sizes up to 128KB can use MQTT, a lightweight publish/subscribe messaging protocol, supported by AWS IoT Core to send and receive data. It supports a wide range of communication approaches, from unidirectional communication and bidirectional/command-and-control approaches to remotely manage devices. Payload sizes up to 1MB can use Kinesis Data Streams to ingest data into AWS and can scale the required read and write throughput as necessary by adding or removing shards – a shard is a uniquely identified sequence of data records in a stream, and a stream is composed of one or more shards.

What ingestion protocol is required?
The choice of the communication protocol is influenced by the flow and nature of the data. For bidirectional data, especially when you work with intermittent data connections or offline modes, AWS IoT Core provides support for MQTT to fulfill that requirement as it reduces the protocol overhead compared to HTTPS. In data intensive IoT applications we can consider WebSockets over MQTT in AWS IoT Core, which further reduces the overhead by reusing a TCP session to share data. For unidirectional communication, both AWS IoT Core and Kinesis Data Streams support HTTPS, making the choice based on the application goal.

What is the main purpose of the ingested data?
Data generated by IoT devices serves two primary purposes: metrics and processing. Metrics refer to statistical data generated by the device or a related component with the purpose of analyzing its behavior. Processing refers to generated data from the device or a connected application to be ingested, transformed, and loaded into the cloud. A device fleet might need to exchange metrics among devices to drive actions. In such cases, we can use MQTT support on AWS IoT core to establish communication channels. Data that is meant to analyze device behaviors and extract analytics can use AWS IoT Core and AWS IoT Analytics to transform, aggregate, and query time-based data. Data that needs to be processed and connected to other data solutions and is disconnected from the producer entity, such as a data warehouse or data lakes, can use Kinesis Data Streams to persist and connect data for processing.

What is your situation?
Managing a fleet of devices requires you to define a security posture to control access to resources and data.
The degree of access and visibility can be enforced on the devices, but you should define how their deployment and operation will be.

What is the security posture required? How do devices need to communicate with AWS?
In hostile or uncontrolled environments where you cannot guarantee the physical control of the device, we can define an authentication and authorization strategy based on unique device certificates and roles. AWS IoT Core supports X.509 certificates to authenticate and uniquely authorize each device. AWS IoT Core has a managed certificate authority (CA) and also provides the option to import your own CA.
In controlled environments where all devices perform the same activity and you have direct access to the underlying platform, we can implement an authentication and authorization strategy based on AWS credentials. Kinesis Data Streams works with AWS credentials and we can increase the security control by using temporary access credentials and not exposing long-term credentials.

What level of access do devices need?
Devices might need to interact with a subset of data generated by the cloud or by other devices. Using AWS IoT Core brings fine-grained control to restrict access to specific MQTT topics and provides the identity of devices for decision making processes. For one-way data flow situations, where the entity that generates data is not relevant and only needs to send data at scale, Amazon Kinesis provides a single stream to which multiple producers can write data.
In such a situation, any producer can write in the same stream of data to be read by any consumer.
Working together
There are use cases in which it is required to have both approaches – ingesting high-frequency data and having fine-grained visibility and control of the devices.

Use case 1: Processing and visualizing aggregated data from multiple devices

Imagine that you have thousands of devices spread across a field. Every device reports its operational metrics and generates a small amount of data. To gain an overall view of operational status, drive anomaly detection, perform predictive maintenance, or analyze historical data, you need to control all devices and aggregate all data to get real-time or batch insights. AWS IoT Core provides the communication, management, authorization, and authentication of the devices and Kinesis Data Streams provides ingestion of high-frequency data.
You start by publishing data to AWS IoT Core, which integrates with Amazon Kinesis, allowing you to collect, process, and analyze large bandwidths of data in real time.
With Amazon Kinesis Data Analytics for Apache Flink, you can use Java, Scala, or SQL to process and analyze streaming data. The service enables you to author and run code against your IoT data to perform time-series analytics, feed real-time dashboards, and create real-time metrics.
For reporting, you can use Amazon QuickSight for batch and scheduled dashboards. If the use-case demands a more real-time dashboard capability, you can use Amazon OpenSearch with OpenSearch Dashboards.

Use case 2: Controlling and streaming high-throughput data from IoT devices

Another use case for combining both AWS IoT and Amazon Kinesis services is for high-throughput requirements with fine-grained control of devices.
To control devices generating large amounts of data that need to be processed in the cloud, such as turbines or LIDAR data, you can use AWS IoT Core to provide the communication, management, authorization, and authentication of the devices and Amazon Kinesis Video Streams to ingest that high-throughput data.
In the following diagram, AWS IoT Core is used to securely provision devices using X.509 certificates instead of using hard-coded AWS access key pairs and Amazon Kinesis Video Streams is used to send video data to the cloud.

Conclusion
To ingest data from IoT devices at scale, you must decide which technologies to use based on your use case, payload size, end goal, and device constraints. The following decision matrix offers guidance for positioning the right AWS service to ingest data at scale. Depending on your specific use case, you may opt for a combination of services.

	AWS IoT	Amazon Kinesis
Command & control of the device	Most relevant
Constrained device	Most relevant
High-throughput data		Most relevant
Bi-directional communication	Most relevant
Fine-grained access	Most relevant

We reviewed the common aspects of an IoT deployment and proposed qualifying questions and best practices to apply to each case. To learn more visit the Amazon Kinesis Data Streams and the Amazon IoT Core documentation.

Andreas Calvo Gómez

Andreas is a Senior Solutions Architect at AWS. He works with digital native businesses to help them build their solutions in AWS. He is passionate about cloud technology and IoT.

The Internet of Things on AWS – Official Blog

Best practices for ingesting data from devices using AWS IoT Core and/or Amazon Kinesis

Andreas Calvo Gómez

Resources

Follow