Maximizing the value of OT data with DeepIQ DataStudio (Data + AI) on AWS

Operational technology (OT) data refers to the data generated by sensors and other devices used in industrial settings to monitor and control physical processes. Supervisory control and data acquisition (SCADA) systems, data historians, and programmable logic controllers (PLCs) are examples of technologies that generate OT data. SCADA systems are used to monitor and control industrial processes, while data historians are used to collect and store historical data from industrial systems. PLCs are specialized computers that control industrial equipment and generate data on the equipment’s status and performance. Moving this OT data to Amazon Web Services (AWS) facilitates several transformative capabilities, such as cross-functional analytics with IT-OT convergence, near-real-time insights, scalability, and predictive solutions that use state-of-the-art artificial intelligence (AI) models.

With DataStudio, DeepIQ’s no-code, simple-to-use solution, you can achieve these transformative capabilities on AWS from your diverse OT landscape.

Customer challenges

The development of OT data analytic workflows for production-grade deployment presents several challenges. The following section will examine some of the most significant challenges.

Challenge 1: Siloed OT data
Due to the strict network security restrictions on control networks, OT data sources such as SCADA, data historians, and PCLs may not be directly accessible from your cloud tenant. The Purdue security model is the widely accepted industrial control system security framework to define a data transfer strategy. One of its fundamental principles is the strict separation of control system networks from other networks, such as corporate or external networks. This separation is achieved using different security zones, and it prohibits pull requests from outside the control network.

Moreover, OT data sources employ exclusive and proprietary protocols or frameworks for communication that are incompatible with conventional data integration software. As a result, OT data sources are often isolated within the OT network, which restricts data mobility and impedes cross-functional analytical support.

Challenge 2: Time series data transformations
After OT data is moved to the cloud, transformation workflows might be required before it is ready for consumption. The following are examples of possible necessary transformations:

Data may have to be restructured and moved to other databases or storage layers to support integration with other systems or applications.
Data originating from OT systems tends to adopt localized hierarchies and naming conventions that reflect the OT environment. In contrast, many other enterprise data systems are designed to align with an IT-driven data model. Without proper organization and contextualization, it can be challenging to understand the relationships between the time-series data and the physical assets and to integrate multiple data sources into an enterprise-wide asset hierarchy.
Raw time-series data may require significant preprocessing before it is ready for analytics. Raw measurements may not occur at regular intervals. Standard time-series issues include high-frequency noise and outliers because of faulty measurements. Part of the signals might be missing because of connectivity or source issues. Conventional data integration or analytic software has limited or no support for these scenarios, especially at the scale necessary for Internet of Things (IoT) use cases.

Challenge 3: Scalability and low-latency support
The volume and velocity of time-series data are typically greater than traditional IT sources. Certain use cases related to safety, quality, or predictive health may require low-latency responses or the application of transformations, machine learning, or statistical models to streaming data. Data ingestion software that can handle high data volumes and velocities and can enrich data in motion is necessary to facilitate these use cases.

DeepIQ’s solution

DeepIQ and the AWS technology stack provide a powerful solution to these challenges.

DataStudio is a self-service {Data + AI} software that focuses on structured, time-series, and geospatial data. With DataStudio’s browser-based graphical interface, users can “drag and drop” more than 250 prebuilt data ingestion, transformation, and analytics components to build and deploy compelling data and analytics workflows.

Addressing challenge 1: OT data liberation
DataStudio provides a wide range of native connectors to industry-standard SCADA systems and data historians. Additionally, DeepIQ software is designed to work with any possible network topology. DeepIQ has implemented special-purpose edge software that employs a push protocol to meet the stringent security requirements of this network topology. The edge software monitors the communication channel for data requests, collects them, and sends them to the local data source (for example, the SCADA system, data historian, or PCL) using the native protocol or framework. Finally, it collects the response generated by the request and forwards it to the configured AWS sink, such as a bucket in Amazon Simple Storage Service (Amazon S3), which offers object storage built to retrieve any amount of data from anywhere, or a stream in Amazon Kinesis Data Streams, a serverless streaming data service that makes it easy to capture, process, and store data streams at any scale.

The edge software is designed to be robust and resilient against connection failures between the edge and cloud or the edge and the source and against intermittent data source failures. In the event of a connection failure between the edge and the cloud, DeepIQ Edge can locally buffer data for several days (or for as long as there is enough disk and memory). The query will continue at the point of failure. Below is a figure showing the standard architecture of connecting to your edge using DeepIQ Edge.

Figure 1. Edge-to-cloud architecture overview

Addressing challenge 2: Time series data transformations

With DeepIQ’s prebuilt components, you can easily convert data from various sources to desired sink schemas in both batch and streaming modes. An example workflow is shown in figure 2. The workflow reads data from Amazon Kinesis Data Streams, converts it to the format required by Amazon Timestream—a fast, scalable, and serverless time-series database service—and persists the data in the database.

Figure 2. Streaming data ingestion: Amazon Kinesis Data Streams to Amazon Timestream

DeepIQ also offers a comprehensive set of highly scalable, prebuilt time-series components for handling all standard time-series preprocessing steps, such as smoothing, imputation, interpolation, and change point detection. Figure 3 shows a workflow that reads data from the raw storage table and creates a regular-spaced data table after removing outliers. In this workflow, we use a cubic interpolation algorithm and a standard outlier-removal algorithm.

Figure 3. Time-series data cleaning and processing

DeepIQ has extensive components that can be used to build asset hierarchies and map raw measurements. These components can be used to design workflows that create the initial mapping of assets across the different sources and maintain the integrity of this mapping when assets are added or modified. A detailed explanation of DeepIQ’s contextualization capabilities are beyond the scope of this article and can be found in our earlier whitepapers (such as this whitepaper).

DeepIQ offers robust support for developing supervised, unsupervised, and deep learning models. The below image shows DeepIQ’s no-code interface (component) to build and configure ML and deep learning models. Users can choose from several prebuilt deep learning architectures, select epoch and batch sizes, and perform hyperparameter tuning using the interface below. Users also have the flexibility to add additional stages, like data normalization and dimensionality reduction, before the ML steps.

Figure 4. DataStudio interface for deep learning models: Build/Train/Upload

Figure 5. DataStudio interface for ML models: Build/Train/Upload with normalization and dimensionality reduction

Addressing challenge 3: Scalability and near-real-time capabilities
DataStudio components are implemented using Apache Spark to run in a parallel and distributed manner. DataStudio components can run natively on AWS, using Databricks on AWS or Amazon EMR (formerly Amazon Elastic MapReduce), the industry-leading cloud big data solution for petabyte-scale data processing, interactive analytics, and ML. As a result, these workflows can scale from kilobytes to petabytes of data, virtually eliminating complexity for the user.

Additionally, DeepIQ’s transformation workflows can operate in batch and streaming modes. The below figure shows a DeepIQ workflow that reads a streaming source (Amazon Kinesis Data Streams), applies schema-defining transformations, and runs an ML model before it persists the results into Amazon Redshift, which uses SQL to analyze structured and semi-structured data.

Figure 6. Predict streaming data with pretrained ML models

DeepIQ on AWS architecture

The following figure shows the reference architecture of DeepIQ on AWS. The reference architecture showcases eight different data flows, various data sources, compute frameworks, and persistence layers.

Figure 7. DeepIQ Data Studio reference architecture on AWS

Architecture walkthrough

The pipelines are described as follows:

The DeepIQ Edge solution can integrate into your OT network as an intermediary between your OT data sources and your cloud storage. It is designed to support a range of industry-standard protocols and standards such as MQTT, OPC UA, OPC HDA, and WITSML as well as native software development kit (SDK)–based frameworks, including Aveva’s OSI PI, Aspentech IP.21, Weatherford’s Cygnet, and Inductive Automation’s Ignition SCADA. With support for OPC protocols, DeepIQ can facilitate connectivity to various data historians, like Honeywell PHD and GE Proficy. For protocols that are not currently supported, DeepIQ employs commercially available middleware to translate data from the source protocol into the supported standards.
DeepIQ Edge provides the ability to transmit data in bulk to DataStudio Connector in near real time, which in turn stores data in Amazon S3.
Once data lands in the streaming engines such as Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK)—a service that makes it easy to ingest and process streaming data in near real time with fully managed Apache Kafka—DataStudio orchestrates structured Apache Spark jobs that can transform the data streams and enrich them with ML or statistical algorithms.
DataStudio provides scalable data exploration, workflow editing, and data management capabilities on Amazon Elastic Kubernetes Service (Amazon EKS), a managed Kubernetes service to run Kubernetes in the AWS Cloud and on-premises data centers.
All workflows developed by DeepIQ use Amazon EMR or Databricks for runtime.
IT sources such as systems application and process (SAP) and geospatial data can be integrated through a secure VPN connection to the near-real-time and historical data through DataStudio Connector.
DataStudio can save the results of your data workflows in multiple data solutions or data warehouses, including Snowflake, Amazon Redshift, or Delta Lake data stores.
Once data is persisted into the data solutions, users visualize data on dashboards or access data through API or third-party applications. DeepIQ’s DataStudio helps users to explore data in motion using its web interface. All DeepIQ DataStudio services run on Amazon EKS to provide on-demand scalability and high availability.

Use Cases

With DeepIQ Edge and DataStudio, you can build several high-impact solutions:

IT-OT convergence reports: Support descriptive, predictive, and prescriptive reports based on integrated IT-OT data sources.
- Sample solution delivery: An oil and gas company connected over 40 edge systems to a near-real-time data solution with more than 400 reports spanning IT and OT data sources, including near-real-time operations monitoring, crew efficiency, and financial and HR reports.
Predictive maintenance: Develop ML models to monitor industrial equipment and proactively schedule maintenance before catastrophic failures.
- Sample solution delivery: A manufacturing company used DeepIQ’s software to build a prognostic algorithm to detect catastrophic failures 24 hours ahead. Validation studies showed a recall of 92 percent at the required precision (80 percent).
Quality monitoring and control: Use video and sensor data to identify and virtually eliminate defects.
- Sample solution delivery: Due to faulty mechanical equipment, a downstream company experienced frequent regulatory compliance issues because their equipment would often leak during transportation. The company implemented a solution involving a camera sensor to gather near-real-time image data at the edge, which was then processed by DeepIQ’s deep learning workflow. This solution achieved a 97-percent accuracy rate in identifying leakage issues.
Process or product optimization: Use advanced analytics to determine optimal conditions for maximizing yield while adhering to safety and uptime requirements.
- Sample solution delivery: A downstream company uses DeepIQ’s ML to obtain the remaining useful life of a critical piece of equipment and AI-based optimization algorithms to fine-tune the operational profile of the equipment to improve yield without increasing the risk of catastrophic failures. Field studies recorded a 2-percent improvement in production without any new CAPEX or increased downtime.
Environmental, social, and governance (ESG): Measure environmental impact and optimize operational parameters to minimize ESG impact, such as carbon emissions.
- Sample solution delivery: A renewable fuel company uses DeepIQ software to collect emissions data from IoT sensors from its refinery and builds near-real-time dashboards to measure emissions.

Conclusion

DeepIQ, in combination with the AWS technology stack, provides powerful acceleration to your IT-OT convergence use cases. Customers discover the possibilities of combining the industry-leading cloud platform with advanced artificial intelligence.