Getting Started with the Industrial Data Platform on AWS

Data is the key enabler to digital transformation and Industry 4.0. Manufacturers can use data to realize a single view of operations and derive operational insights. These insights can be used to improve production quality, make real-time prediction, and generate cost savings. Big Data analytics techniques can provide new capabilities related to measuring performance, maintenance, process improvement.

Manufacturers collect data from many disparate sources, with each source stored in its own database (“silo”), and accessed using ad-hoc reporting or analytics systems. Over time, these silos and analytics system become more isolated and difficult to support. Traditional data warehouses can quickly and efficiently query, transform, or analyze structured data – however they are often not the best choice for unstructured or semi-structured data and may not efficiently scale to meet the demands of Industry 4.0.

To gain operational insight from manufacturing data companies are implementing an Industrial Data Platform. This blog outlines challenges around current solutions and describe a robust and resilient architecture on AWS.

Challenges surrounding the Industrial Data Platform

There are multiple challenges when collecting and analyzing industrial data:

Unable to link data together – Data comes from disparate sources, and often requires enrichment and cataloging to derive meaningful insights.
Data collected too infrequently – Sensors and systems-of-record may produce data frequently, but using ad-hoc aggregation tools or consolidating into a target datastore may only occur at sparse intervals. This makes it impossible to do real-time analysis or make predictions using Artificial Intelligence or Machine Learning (AI/ML).
Data too difficult to access – Applications can be on different physical networks, require different database engines, or have different data structure. Data from each application may require different transformations using a different tool set in order to make it accessible and ready for consumption.
Scaling and flexibility – producing more data or adding additional data sources often increases operational overhead and expense.

Common solutions

Two common solutions deployed as an Industrial Data Platform are individual data silos and an Enterprise Data Warehouse.

Data silos

Data silos are databases that serve a single purpose – for example, to collect PLC data or to store information for an ERP system. Companies build data silos over time – as organizations add new processes or applications, new data silos are created. Each silo may require different management, security, and authorization approaches, all of which increase operational cost and risk. There is no unifying catalog that outlines the available data, where the data is stored, or how to access the data. Data needed for a given analytics workload may be split across multiple silos and be inaccessible. The platform scales as each individual silo expands or contracts to meet data demand. The individual silos might not meet price performance requirements as they scale.

Figure 1: Data Silos

Data warehouse

An Enterprise Data Warehouse (EDW) is a central data repository, and traditionally contains large amounts of structured data. Data flows into a data warehouse from transactional systems, relational databases, and other sources on a regular cadence. Business intelligence (BI) tools, SQL clients, and other analytics applications are used to query and analyze this data. When an Enterprise Data Warehouse is used as an Industrial Data Platform or manufacturing Data Lake, it attempts to overcome issues encountered with data silos by storing all data in one location. Data producers store the data as structured data with a fixed schema which must be carefully curated and maintained as data needs change. Forcing data into a specific schema may not work for all applications or all types of data, and the resulting relational data set may not be consumable by all analytics platforms. Creating a unifying schema for all industrial data can become technically challenging and administratively burdensome as more data sources are added or modified. To scale the platform, additional compute, storage, and even licensing may need to be added to the Enterprise Data Warehouse, potentially taking it offline while the maintenance occurs – leading to increased cost and downtime.

Figure 2: Enterprise Data Warehouse

Functional gaps

The following function gaps exist with the solutions described above:

Lack of a centralized, globally accessible platform – Data silos decentralize manufacturing data, making it difficult to locate, access, and analyze. Enterprise Data Warehouses handle structured data well, but may not be the best choice to store and analyze unstructured or semi-structured data that is generated by various applications and sources. Analysts need a single place to access all available corporate and operational data to derive insights.
Lack of flexibility to ingest any type of data – Building a full data picture requires ingestion of all types of data: Historian data, energy consumption, ERP systems, streaming data from IoT sensors, and so on. Scaling for this type of flexibility is also critical – if no storage or compute is available, then no data can be stored.
Limited support for analytics – Data silos and data warehouses may only support SQL-based queries and analytics. To gain the most out of the data, the Industrial Data Platform must support advanced analytics and machine learning for predictive insights.
Analytic insights need to feed seamlessly into everything – Without a central platform to share analytics, data silos may not have access to analytics performed on other data sets. Analytics and insights can not be ingested into the Enterprise Data Warehouse unless they are first reformatted to the appropriate schema

Architectural approach

To help overcome the functional gaps of the current solutions, customers spend time, effort, and capital attempting to scale existing data silos or Enterprise Data Warehouses. These solutions are often difficult to support and do not provide for the future flexibility required for growing data sets. Organizations moving to Industry 4.0 require an Industrial Data Platform with data accessible from a single location. The data requires cataloging so that the consumers can easily identify what is available. In addition, the Industrial Data Platform needs to expand or contract over time, accommodating any current or future type of raw data format – structured, unstructured, or semi-structured.

Industrial Data Platform

The following diagram (Figure 3) is a conceptual representation of an Industrial Data Platform on AWS.

Figure 3: Industrial Data Platform

At a high level, data is ingested into the platform from various sources – shop floor applications and production processes, MES, enterprise applications, and so on – through different connectors, edge applications, or IT/OT connectivity blueprints (processes that have architectural patterns unique to each use case or customer implementation). Once the data is collected, it is stored and processed in the unified data backbone. This is where the data is transformed, modeled, and contextualized for consumption by downstream systems. This layer provides flexible data access to the business insights applications – operational dashboards, third-party applications, and advanced analytics services. Security and automation are built in at every layer, from where data is produced to the consumption of data through end user reporting or dashboards.

Modern Data Architecture

At the center of the Industrial Data Platform is a Data Lake (Figure 4). A Data Lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. It can be used to store relational data from line of business applications, and non-relational data from mobile apps, IoT devices, social media, and so on. A Data Lake does not require structured data or a well-defined schema, which allows you to store it as the source generates it and without the need to know how it will be queried in the future. This enables you to easily run different types of analytics on your data – from dashboards and visualizations to big data processing, real-time analytics, and machine learning – to help uncover valuable insights.

Figure 4: Data Lake

The core component of a Data Lake built in AWS is the Amazon Simple Storage Service (Amazon S3) – the optimal choice for storage because of its unmatched durability, availability, performance, and scalability. Amazon S3 integrates with various types of data ingestion services and analytics tooling and services. To keep data secure, Amazon S3 allows for encryption at rest, comprehensive access control, and auditing. Data retrieved from the Data Lake can be restricted at both the column-level and row-level, providing a granular approach to security when accessing the data.

To gain maximum value from the data inside the Industrial Data Platform, deploy a Modern Data Architecture surrounding the Data Lake (Figure 5). The connects your Data Lake, data warehouse, and all other purpose-built services – data silos – into a coherent whole. The Data Lake allows you to have a single source to run analytics across your data, while the purpose-built analytics services provide the speed required for specific use cases like real-time dashboards and log analytics. In a Modern Data Architecture, data and insights are interconnected to enable further analytics.

Figure 5: Modern Data Architecture surrounding the data lake

Reference Implementation

To help visualize a modern Industrial Data Platform built on AWS, consider the Manufacturing reference architecture (Figure 6 and 7). At the core is a secure Data Lake backed by Amazon S3. Data silos connect and data is ingested into the Data Lake. Analytics are produced by various applications and techniques, and these insights are stored back into the Data Lake for other connected services to consume – enabling the Modern Data Architecture.

Data Ingestion

Figure 6: Data Ingestion

Efficient ingestion into the Industrial Data Platform is essential to gaining valuable, timely insights from your data (Figure 6). Industrial devices are connected to the Cloud using AWS IoT Greengrass – running on an edge gateway [1]. Industrial data is streamed into the Data Lake using Amazon Kinesis Data Firehose [2]. AWS IoT SiteWise can be used to model industrial assets, calculate metrics from telemetry data, and visualize data using AWS IoT SiteWise Monitor [3]. Unstructured data can be synchronized into the Data Lake using AWS Storage Gateway [4]. For manufacturing application interface, use Amazon Transfer Family to transfer the files into the Data Lake [5]. Use AWS Database Migration Service to synchronize the data from manufacturing databases into the Amazon Relational Database Service [6]. Enterprise applications can use Amazon API Gateway and AWS Lambda functions to build interfaces to export data and import into the Data Lake [7]. Large data sets can use AWS Snowball Edge to migrate the data into the Data Lake [8].

Unified Data Backbone and Business Insights

Figure 7: Unified Data Backbone

To build the Industrial Data Platform’s unified data backbone, (Figure 7), AWS Lake Formation can be used to establish a secure Data Lake backed by Amazon S3 [1]. For Industrial IoT and automation equipment data ingested through AWS IoT Core, Amazon Kinesis Data Analytics can perform streaming analytics – such as anomaly detection [2]. For near real-time analytics, use AWS Lambda to run analytical functions [3]. Amazon EMR can be used to process, transform, and analyze data in the Data Lake [4]. Amazon SageMaker can develop, train, and deploy machine learning models [5]. Amazon Forecast can be used for demand forecasting use cases [6]. Structured data sets and analytics results can be stored and efficiently queried in a data warehouse using Amazon Redshift [7]. Create BI reports and visualize data with Amazon QuickSight using data in Amazon Redshift, Amazon S3, and Amazon Athena [8].

Outcomes

The Industrial Data Platform on AWS provides a centralized, globally accessible platform to manage and store data. It provides a single hub for the ingestion, storage, and modeling of data required to gain insights. Applications can stream or load data from all sources and of all types into the Industrial Data Platform, allowing industrial analysts a single place to access all available corporate and operational data. The data is not forced into a rigid schema as it is ingested. The Industrial Data Platform on AWS easily scales for new inputs, provides for near-infinite storage capabilities, and supports almost any kind of analytics tooling – from standard SQL queries to advanced AI/ML for predictive insights. Analytic insights flow seamlessly back into the platform to further enrich data and help enable operations.

Implementation approach

Implementing an Industrial Data Platform on AWS that integrates existing data silos, data warehouses, and analytics solutions can seem overwhelming. Consider the following general guidelines when beginning an Industrial Data Platform project:

It’s important to note that not everyone requires all the components on the reference architecture. Start by building out the Data Lake and then adding a small number of data sources and analytics workflows.
To show the value of the Industrial Data Platform to key stakeholders, start your Industrial Data Platform journey with use cases or challenges that have a meaningful business impact.
Introduce new data sources into the platform slowly, over time. As the additional data sources are added, look to gain more insights through analytics and reporting.
Retire technical debt by transitioning off of legacy Industrial platforms and shut down unused data silos.
Review the available solutions in the AWS Solutions Library and experiment with solutions in the Manufacturing Solutions Portfolio, such as the Machine to Cloud Connectivity Framework, Machine Downtime Monitor, or Amazon Virtual Andon. These pre-packaged solutions may give you a jump start to solving different problems.

Conclusion

An Industrial Data Platform can be used to consolidate data silos and expand capabilities of traditional data warehouses to enable advanced analytics and machine learning, and ultimately drive operational insights. The Industrial Data Platform on AWS provides a centralized, globally accessible platform to manage and store data of all types. It can easily scale, provides for near-infinite storage capabilities, and supports almost any kind of analytics tooling – from standard SQL queries to advanced AI/ML for predictive insights.

For further reading, learn how customers (case study and case study) are using AWS services to gain insight into their industrial data and optimizing manufacturing outcomes, or continue learning about AWS for Industrial!

Related resources:
AWS cloud solutions for industrial sector

AWS for Industries