How to implement a disaster recovery solution for IoT platforms on AWS

This blog post introduces a real-world use case from Internet of Things (IoT) service providers that use Disaster Recovery for AWS IoT to improve the reliability of their IoT platforms.

IoT service providers, especially those running high-reliability businesses, require consistent device connectivity and the seamless transfer of connectivity configurations and workloads to other regions when regional IoT services become unavailable. This blog post describes a customizable solution that enables cross-region transfer for AWS IoT Core and application services that rely on it.

Introduction

Integrating a disaster recovery (DR) solution within an IoT platform has emerged as a critical imperative for companies operating in the IoT domain. The inherent complexity of IoT systems, characterized by numerous interconnected devices and vast data streams, amplifies the risks posed by potential disruptions. Given that IoT platforms often carry critical applications across industries such as healthcare, manufacturing, and autonomous vehicles, even a brief downtime or data loss could lead to severe financial losses, compromised customer trust, and regulatory non-compliance. By incorporating disaster recovery capability into your IoT architecture, you can proactively mitigate these risks, deliver business continuity, and reinforce your IoT platform’s reliability against network outages, application unavailability, and unforeseen events.

Solution overview

The architecture shown in Figure 1 shows how the DR solution is adopted and extended to the comprehensive DR implementation in the IoT platform of the providers. Multiple AWS accounts are used in the architecture since many IoT service providers prefer the multiple-account strategy on AWS.

Amazon Route 53, in the shared services account, controls the fail-over according to results returned by the health checks of Amazon Route 53. The health checks make the calls to the APIs placed into multiple AWS accounts and decide to perform fail-over according to the responses from the API calls.
The IoT service providers’ applications built on AWS IoT Core are deployed in the IoT services account, along with the DR solution composed of AWS IoT Core rules engine, Amazon DynamoDB, AWS Lambda, and AWS Step Functions.
The command & control account exposes the APIs to integrate with external administration consoles which are used to issue device management commands, such as for the onboarding or suspension of devices. The AWS Lambda functions behind the APIs assume AWS Identity and Access Management (AWS IAM) roles provided by the IoT services account to run the commands.
The data analytics account uses the event buses provided by Amazon EventBridge to absorb the data from the IoT services account. The data can be swallowed by multiple Amazon EventBridge targets, for example, Amazon Kinesis Data Streams, AWS Step Functions, etc. Those targets can further process the data on demand and release data insights to external data visualization dashboards.

Figure 1: The architecture of the reliable IoT solution with DR

Disaster recovery

The solution uses Amazon DynamoDB global tables to synchronize all the operations against AWS IoT Core in the primary region to the secondary region. AWS Step Functions and the AWS Lambda function in the secondary region replicate all those operations into AWS IoT Core in the secondary region. All the data synchronized for DR across the regions is application irrelevant and not required to be maintained by the users.

Health checks

The solution uses Amazon Route 53 health checks to decide the fail-over launch. All the factors below are monitored and the failure from any one of them can trigger the fail-over process. The factors show the health status of:

AWS IoT Core message broker
Application services
Command & control services
Data analytics services

The unhealthy status of each factor in each of the regions is detected by the APIs powered by Amazon API Gateway placed in both the primary region and the secondary region of the IoT services account, the command & control account, and the data analytics account. Those APIs and the Lambda functions behind them use predefined checkpoints in the code logic to decide whether to return failure or success in the responses. The API placed in the IoT services account uses the same logic provided by the DR solution to check the health of AWS IoT Core, and it also checks the health of the application services. The APIs placed in the command & control account and the data analytics account check the health of those services and return failure once an error is detected.

As shown by the dotted red lines in Figure 1, the AWS Lambda function used in Amazon Route 53 health checks makes calls to the APIs and receives all the responses, across all the AWS accounts included in the architecture. The VPC endpoint for Amazon API Gateway can help the Lambda function invoke the APIs across accounts. Please refer to using interface VPC endpoint to access a private API in another AWS account for details. The Lambda function aggregates the API response and decides whether to trigger the fail-over process or not. The decision is passed to Amazon Route 53 via the health check APIs, and Amazon Route 53 performs the fail-over according to the decision.

Fail-over process

Amazon Route 53 follows the policies defined in the records to enforce the fail-over. As shown in Figure 2, iot.shiyin.people.aws.dev is the IoT data endpoint used on the devices. The devices get the DNS destination from primaryiot.shiyin.people.aws.dev or failoveriot.shiyin.people.aws.dev after DNS lookup, and connect to the destination. The destinations where the records route traffic can be AWS IoT endpoint and AWS IoT Core configurable endpoints.

Figure 2: The records for fail-over in Amazon Route 53

Once the fail-over starts, AWS IoT Device SDK running on the devices needs to terminate the connection to AWS IoT Core in the primary region and connect to AWS IoT Core in the secondary, as only during the reconnection does the SDK look up the DNS destination from Amazon Route 53. If the fail-over is triggered by AWS IoT Core unavailability, the SDK performs reconnection automatically since the connection between the device and AWS IoT Core is already cut off by the unavailability. If the fail-over is not triggered by AWS IoT Core unavailability, the SDK will be forced to cut over to the secondary region because the current connection between the device and AWS IoT Core in the primary region is still active and required to be terminated. There are several options to trigger the reconnection.

Send Amazon Simple Notification Service (SNS) notifications from the Amazon Route 53 health checks, as shown in Figure 3. The notifications can be processed and delivered to the devices.
Figure 3: Notification configuration in Amazon Route 53 health check
Terminate the current connections from the IoT services. IoT services can get notifications from the health check and initiate new connections that interrupt the current connections between the device and AWS IoT Core, for the devices reconnecting.
Look up the DNS destination frequently. The devices compare the destination returned from DNS lookup to the destination currently in use, and actively reconnect to the new destination if they are different.

As shown in Figure 1, the application services implement high availability for the fail-over, relying on the Lambda functions deployment in both regions, multi-region access points of Amazon Simple Storage Service (Amazon S3), and global table replication of Amazon DynamoDB. As shown by the orange lines in Figure 1, the administration consoles publish messages to the command & control services through Amazon Route 53. Once the health check returns failure, Amazon Route 53 points the API endpoint to the services in the secondary region. As shown by the purple lines in Figure 1, to minimize data loss, the data from the Amazon EventBridge event bus in both regions is ingested into the data visualization. During the fail-over, the data that remained in the primary region can continue to be processed.

Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

The RTO of the architecture mainly depends on the duration of the fail-over. The duration is composed of 4 factors:

The DNS resolvers use the Amazon Route 53 records in their cache for a certain period, i.e., TTL configuration, before they ask Amazon Route 53 for the latest records.
Record interval is between the time that each health check gets a response and the time that it sends the next health check request.
Failure threshold is the number of consecutive health checks that must pass or fail to change the current status of the destination from unhealthy to healthy or vice versa.
The processing time of the health checks relies on the performance of APIs used in the health checks.

The fail-over duration can be cut down by reducing the number of those factors, and the requests will be made to the health checks by Amazon Route 53 more frequently.

The RPO of the architecture can be impacted by the following factors:

When the primary AWS IoT Core runs into an outage, the MQTT messages might not be processed by the rules engine even though they are received by AWS IoT Core.
When the command & control services in the primary region become unavailable, all the API calls from the administration consoles will be forwarded automatically by Amazon Route 53 to the secondary region.
The AWS Lambda function targeted by AWS IoT Core rules engine accesses the Amazon EventBridge event bus via Amazon EventBridge Global Endpoint powered by Amazon Route 53. The global endpoint will transmit the data ingested to the event bus in the secondary region, once the primary event bus becomes unavailable.
When AWS IoT Core remains working but the application services fail in the primary region, the devices keep connecting and publishing data to the primary AWS IoT Core until Amazon Route 53 completes changing the DNS destination. During the destination change, those data will be processed if the command & control services trigger the fail-over, and the data cannot be processed if the data analytics services trigger the fail-over.

Summary

By leveraging the DR architecture introduced in this blog, IoT service providers can simply implement disaster recovery within their IoT platforms and reap a multitude of benefits. You can help safeguard against potential revenue loss resulting from IoT service interruptions, cultivate customer trust and loyalty, and enhance your IoT platform’s security posture.

Beyond risk mitigation, the adoption of DR bolsters the operational efficiency of IoT businesses by reducing downtime-related costs and minimizing the need for manual interventions during disruptions.

We look forward to seeing how you enable disaster recovery to reinforce the reliability of your IoT platforms built on AWS. Get started with AWS IoT by going to the AWS Management Console.

About the author

Shi Yin is a senior IoT consultant from AWS Professional Services, based in California. Shi has worked with many enterprise customers to leverage AWS IoT services to build IoT solutions and platforms, e.g., Smart Homes, Smart Warehouses, Connected Vehicles, Commercial IoT, Industrial IoT, etc.

The Internet of Things on AWS – Official Blog