AWS for Industries

Refactor Legacy Manufacturing Execution Systems into Event-Driven Microservices


Breaking the traditional methods of factory floor transformation using emerging cloud technology is often met with resistance. Transformation of manufacturers legacy Manufacturing Execution Systems, MES, commonly know as command and control systems is seen as a high-risk disruption. During the refactoring process, the potential of introducing a second of delay on factory floor is risk operators are not willing to take.

This blog blows away those downtime fears as it uses a new MES refactoring methodology that prescriptively describe how to nearly eradicate the risk of adding IoT fueled microservices to the factory floor. Some manufacturers have up to ten MES that are a mixed bag of home-grown and customized COTS. This methodology is groundbreaking; imagine a virtual fist shattering each existing production workload into hundreds perhaps thousands of use case-based microservices. Factory floor equipment will become service providers that have two-way dialogue over light-weight protocols to other equipment and cloud native services. Additionally, the methods and solutions contained within this document bridges the language chasm between operations technology (OT) and enterprise by revealing their intersections and how to track the KPIs needed to satisfy critical, MES, use case requirements.

COVID-19 pandemic struck, creating an unprecedented demand on supply-chains and manufacturing production lines. Toilet paper, hand sanitizer, and cleaning supplies shortages caused fights in grocery store isles. The global economy shut-down exposed vulnerabilities in production strategies and supply chains of manufacturers and suppliers everywhere. Flight restrictions increased consumer demand for new vehicles. In the fourth quarter, General Motors reported a 4.8% increase in U.S. sales, Toyota Motor Corp and Volkswagen AG sales increased 9.4% and 10.8%, respectively.  Manufacturers attempting to rapidly scale to meet extraordinary production demands encountered downtime that lasted eight hours or more per week costing $1,000,000.00 per hour in losses.

Since the industrial revolution started in 1780, manufacturers transformed from steam engine machinery to the advanced super-factories that we see today.  Factory floor advancement in robotics, data collection and other innovation will impress most manufacturer’s consumers. Within Auto OEM’s, there are multiple types of globally distributed plants – final assembly plant, engine plant, component plant, stamping plant, CKD plant, etc.  The final assembly plant is an active assembly point, where skilled workers and robotic systems bring together all of the necessary parts supplied by OEM’s plants and suppliers to create a final product on a “just-in-time” (JIT) basis.  Automotive OEM’s factory floors are highly automated and efficient, operating at 90%+ efficiency. Connected factory and solutions for collecting performance inputs from the machine line has been in place for at least a decade. Unfortunately, the data collected has not been analyzed effectively and the machine layer has not been effectively integrated into MES. Prior to COVID-19, most manufacturers were working toward their long term 8 to 10 years connected factory Vision.

Now, manufacturers are reducing their long-term MES and connected factory to a near-term 1 – 3 year implementation. Factory business leaders want to analyze the right operational data to allow production lines to quickly scale, use machine learning to predict and remove factory floor devices that will most likely cause downtime. Factory automation has historically been the responsibility of the control system engineer—the device guy who managed PLCs, HMIs, DCS and SCADAs. IT and OT  speak different languages: Engineers who interact with OT work generally with closed, proprietary protocols and programmable logic controllers rather than technologies that afford full computer control. Enterprise IT professionals speak a foreign cloud-native vernacular. This language barrier hinders cross-functional collaboration and perpetuates siloed cultures — both of which are incompatible with mitigating the risks of converging IT and OT systems.

Factory General Managers, operations factory leaders and other factory innovation masterminds know what can’t be measured can’t be managed. It’s crucial to understand the interrelationship between high-level goals, objectives and what actions or methods are required to achieve them. Depending upon systems and processes on the factory floor, production managers may face one of the two problems; either they are not aware of the KPIs they should track to improve factory performance or they are unable to collect sufficient data to accurately measure the KPIs they’re tracking.

MES is a complex topic.  The use case driven approach described within this blog exposes how to slice capabilities from home-grown MES without making changes to the existing production system, allow manufacturers to consistently leverage the value of industrial data to operating costs.

What is MES?

Manufacturing execution system (MES) monitor, manage, and synchronize the execution of multiple production manufacturing lifecycle processes. SCADA, PLCs, Custom Software, ERPs are integrated into MES to gain end-to-end visibility, control and line operation optimization. The manufacturing industry is, in fact, an integrated chain of suppliers, manufacturers and consumers. A ready-made product, such as a car or another type of vehicle, consists of thousand elements manufactured by several hundred or thousands different manufacturers, often in different countries of the world. Manufacturing Execution Systems connect plants, sites, and vendors’ live production information, and integrate with equipment, controllers, and enterprise business applications.

MES Production Operation Problems

Most factory-floor MES systems were designed long ago – in nearly all cases pre-IoT. Most manufactures built advanced MES workloads that allowed leading manufacturers to collect lots of OT data from engineering assets. Due the volume and complexity of aggregating, analyzing and making in real-time decisions based on an integrated OT and IT data has been a challenge to many manufacturing leaders. MES can be immature through sloppy collecting and handling of data:

  • Data is stored in paper files, spreadsheets, and directories
  • Lack of clarity or absence of coherent policies and processes for data collection
  • Poor configuration, including failure to correctly monitor controls
  • Weak processes and cultural factors
  • Culture does not support openness and reporting
  • The processes are good but execution is poor—frequent errors
  • Insufficient supervision and internal control
  • Audit trail is not available

Manufacturers are unable to commence their transition to The Fourth Industrial Revolution (Industry 4.0) because MES systems constrain manufacturer’s ability to see a plants operation holistically and take action. Invaluable insights are lost because data is siloed and right data is often not captured.  Industry 4.0, sometimes referred to as IIoT, smart or connected manufacturing, marries physical production and operations with smart digital technology, machine learning, and big data to create a more holistic and better-connected ecosystem for companies that focus on manufacturing and supply chain management.

Business Value of Event-Driven MicroService Factory Floor

Customers demand for highly customized products are increasing while manufacturers are under pressure to produce efficiently, increase throughput, reduce costs, ensure quality, and enable product personalization. An event-driven microservice MES provides real-time visibility, agility, data-driven decisions the upshot for business operation is the ability to address problems before they become a major incident, reduce cost, increase profit, and take a major into IIoT transformation. A refactored MES impacts many areas of the business, benefiting a multitude of diverse stakeholders; such as, factory-floor digital transformation leaders, solutions architects, quality center of excellence, project leaders, release managers, process coordinators, performance managers, etc. Key stakeholders will have a global view of key manufacturing KPIs:

  • Quality – product’s ability to fulfill its expected functions and behavior, such as engine efficiency, product features and environmental exhaust standard
  • Average downtime – (downtime hours in a time period) ÷ (total time available to produce vehicles in the same time period) x 100.
  • Inventory Turns = (cost of goods sold) ÷ (average inventory)
  • Throughput = (units produced) ÷ (time)

Industry 4.0 transformation leaders are looking for a prescriptive methodology refactor a legacy production Route Cause MES.  Fear of deleteriously production outage and the inability to scale prevents the transformation of these mission critical systems.

Refactoring Using Case Driven Methodology
Step 1: Identify Most Critical Production Use Case

Some manufacturers use a monolith or multiple MES workloads to monitor and manage production systems. The use cases, not the legacy MES architecture drives the microservice decomposition strategy. Common manufacturing use cases are listed in the table below.

Production Use Cases
1 Root cause and alerting system Energy Optimization
2 Transportation and Logistics Facility Management
3 Reactive Quality Management Digital Twin
4 Predictive Quality Management Warranty Optimization
5 Predictive Maintenance Material Handling Optimization

Step 2 – Target Design Patterns and De-risk Mitigation Approach

Design Pattern

Each new capability will be implemented as a microservice and MES use cases will be incrementally migrated. The legacy monolithic MES will be strangled as event-driven microservices use cases run in parallel with the legacy MES and building new microservices around the legacy route cause MES.


Each use case will have defined boundaries, data cohesion and encapsulation and separate deployments.

Step 3: Understand Current State MES Architecture

Thirty years ago, manufacturer’s enterprise IT developed a monolithic C++ route cause MES that has issues that are typical with monoliths:

  • Security has no clear consensus on a strong authentication scheme.
  • Throughput rate is low and hampers subsequent operations.
  • Frequent downtime each time there is an upgrade or an application failure and single point of failure datastore.
  • Technology adoption: In order to adopt or upgrade a technology stack, it would require the whole application to be upgraded, tested, and deployed, since modules are interdependent and the entire code base of the project is affected.

Current State Factory-Floor Route Cause MES Architecture

Step 4: Decomposition Process Steps

1.     Capabilities Decomposition

2.     Date Decomposition

3.     Deployment Decomposition


Capabilities Decomposition

1.     Start with functionality and capabilities, not the data; avoid CRUD service. Most importantly, work backwards from the customer to model services starting with the business capabilities:


a.     business capability: plant microservices are loosely-coupled with implementing the single responsibility principle. Microservices are decomposed based on business functions: operation capability, operation performance, tracking, resource management, production definition

b.     subdomains: further decomposing the business functions in subdomains, corresponding different parts of operational capability: process standardization and visual aids, process enforcement, configuration verification, in-process inspection, statistical process control, production process verification, personnel qualification, device history records

Data Decomposition

Factory Floor Events

Real-time alerts business capability will be a single service that owns the data.  The route cause MES will place data in queue for transmission by shift, hour, unit and near- real time. Units per hour blocked time, stop time, downtime, pass-rate, and other critical route cause data will be transmitted via MQTT.  IoT functions will be transformed into IoT services – IoT platform offers different capabilities or operations to fulfill all functional requirements. These capabilities have been divided into two distinct groups depending on the level of demand and how vital this capability is defined.

The legacy route cause MES uses Apache Kafka.  Those workloads will be migrated into Amazon MSK to manage provisioning, configuration, and maintenance of Apache Kafka clusters and Apache ZooKeeper nodes. Amazon MSK also shows key Apache Kafka performance metrics in the AWS Management Console. Factory-floor central IT will no longer need experts to operate Apache Kafka clusters. MSK will continuously monitor cluster health, automatically replace unhealthy nodes with no downtime and secure Apache Kafka clusters by encrypting data at rest.  In addition to enterprise-grade security features out of the box, MSK has built-in AWS integrations that accelerate development of streaming data applications.

Call out: Don’t rip and replace parts of the legacy systems that work well.  Look at ways to optimize like replatform to a managed service as mentioned above.

Data Locality

The microservices will be supported by purpose built serverless databases. Similar to fulfillment center, AWS DMS and SCT will be used to migrate the logistics application Oracle Databases to Aurora PostgreSQL to manage microservices related to inbound and outbound shipments, inventory control, inventory distribution, etc.  Aurora will provide performance and availability of Oracle at approximately one-tenth the cost while handling complex write transactions at speeds comparable to the Oracle system.

Factory-floor requirement of scalability and high availability will be met by scaling efforts reduced by 95%, according to, provisioning happening in minutes, ensuring data protection. Amazon Timestream will be used as the datastore for Telit DeviceWise series that provides scalable for IoT and operational applications. AWS Timeseries is enable manufactures to store and analyze trillions of events per day up to 1,000 times faster. Using Timestream will save time and cost by managing the lifecycle of time series data – keeping recent data in memory and moving historical data to a cost optimized storage tier based upon user-defined policies. Its built-in time series analytics functions will identify trends and patterns the shop floor root cause, quality, alerting microservices.


Deployment Decomposition

Separate into distinct deployable services, expose APIs and decide on integration methods. Root cause features that are identified as most critical, such as the ingestion, will follow the microservices pattern and use APIs to connect with other layers. Root case will be divided into four microservices: Ingestion, Query, Alert and Device update. In this use case the production leaders were satisfied with an existing IoT, Telit DeviseWise. The IoT Telit DeviseWise will be transformed into microservices that are wrapped in API to enable communication between components, macro or microservices. The telegram microservices will talk to other telegrams that are exchanged between custom middleware, if needed. The functionalities for each of the microservices are detailed below.

  • Ingestion: aims to connect with the Broker to collect data originating from the IoT devices and send this data to the Persistence module
  • Query: offers different operations to retrieve historical data provided by the IoT device.
  • Alert: is responsible for collecting alerts and propagates them to the final clients.
  • Devices: aims to connect to IoT devices in order to send them Over the Air (OTA) updates and ensure the integrity of the IoT devices network.

If the factory-floor prefers AWS IoT Greengrass, then each message, instead of telegrams, will be sent between services. The primary goal is not to rip and replace components that are meeting the shop-floor requirements. Instead, the primary goal is to focus on the component/modules that don’t meet the target state requirements.

Decompose components and develop modern serverless applications to increase agility throughout of the application landscape. Serverless services require no server management. Applications made through AWS can also be scaled automatically. This is done by manipulating the units of consumption (such as memory) without toggling any server settings. Services run by the AWS serverless platform have built-in fault tolerance and availability. There is no idle capacity, no charge when code is not running, and this eliminates the need to overestimate the provisions needed for capabilities such as storage.

Step 5: Hybrid Cloud Platform

To meet production low-latency requirements of most shop-floors, containerized microservices will run in Hybrid on AWS Cloud and on-premises in AWS Outposts and AWS IoT GreenGrass. These AWS fully managed services will provide AWS infrastructure, AWS services, APIs, and tools to virtually any data center, co-location space, or on-premises facility for a truly consistent hybrid experience. Meet shop floor KPIs using Outpost’s data processing, data residency, security and migration of applications with local system inter-dependencies. Integrate on-premises applications with services running in the AWS Region for centralized operations.


Hybrid Cloud Blueprint


This MES blueprint outlines the production manufacturing execution systems production outcomes.  IT blueprint elucidates high-level technologies used, uses cases, third-party IOT systems and technology. The biggest challenge within the plant floor for a majority of manufacturers is visibility into operational technology data from machines, programmable logic controllers (PLCs), and supervisory control and data acquisition (SCADA) systems for performing root cause analysis (RCA) when a line or a machine goes down, improving throughput without compromising quality, and understanding micro-stoppages of machinery in real time.

A plant floor has a disparate set of PLCs and multiple industrial protocols (300+) which make it challenging to talk to the plant floor “Things” and access data which can be leveraged for diagnostics and predictive analytics. In addition, most customers and partners spend excessive time building connectivity to these plant floor systems. This approach is unscalable due to the multiple integration points. Customers and partners have to build additional technology components to scale this for multi-plant roll-out, to address security, and to generate multiple views of the data.

Leveraging this blueprint of a use cases driven approach to unlock data from equipment, such as PLCs and Historians, to optimize operations, improve productivity, and availability will allow manufacturers to reduce risk and complexity while refactoring from legacy MES to microservices.

Target Design – Route Cause Event-Driven Microservice Architecture

This diagram shows how to unlock data from equipment, such as PLCs and Historians, to optimize operations, improve productivity, and availability. Data will stream or batch in from third party devices.  When anomalies are detected, appropriate stakeholders will be notified.   Operational business friendly dashboards will provide KPI to business and operational line leaders.  The industrial data lake will be used for predictive analytics or other machine learning use case.

Legacy MES prevent key stakeholders from accessing data for visualizations alarming and alerting because data is locked into proprietary systems. Downtime notification will be near-real time, outages will shorten and system will eventually self-heal as the data lake is grows. Data enrichment will happen the manufacturer continues to merge third-party data from disparate authoritative devices and IT systems. This will allow business and IT leaders as well as line operators to make more informed decisions.


Target State KPI’s Outcomes

Business Outcomes

Event-driven microservice enabled connected factories may see a 5 – 10% improvement in productivity when the critical MES use cases are live in production.  Early adopters are seeing unplanned downtime down from 11% to 5.8%, Defect rate down from 4.9% to 2.5%, Average OEE improves from 74% to 86%, and inventory turns increased from 14 to 19.  Global transparency of factory floor improves trust, accelerates innovation and improves collaboration all key stakeholders in OT and IT. Other valuable outcome for the business:

  • Near live view of plant availability across the world
  • OEE across all plants, across part/plant portfolio
  • Fully transparent track and trace of parts across the plants/suppliers (connected supply chain)
  • Product genealogy
  • Plant/line utilization, smart sourcing
  • SCM Risk mitigation on the component level

Technical outcomes

  • Blueprint framework for ingesting data from any type of PLC / SCADA / Historian for OT data (support for all major protocols) can be reuse to transform the manufactures global locations
  • Manufacturing Data Lake infrastructure
  • Certified Edge HW with security / device update infrastructure
  • Downtime alerts and notifications
  • Operational KPIs and Business Intelligence (BI) dashboards
  • Conditional monitoring of assets in near-real time
  • OEE (performance, quality, availability) in near-real time
  • Historical trending and enable self-service for root cause analysis (RCA)
  • Framework for predictive analytics / IT-OT alignment


Refactoring legacy critical production systems into use case driven microservices reduces risk, improve production line throughput and accelerates smart factory transformation. Contact your AWS account representative to schedule a MES/critical production workload review.

Get started with AWS

Tolla Cherwenka

Tolla Cherwenka

Tolla Cherwenka is an AWS Global Solutions Architect who is certified in data and analytics. She uses an art of the possible approach to work backwards from business goals to develop transformative event-driven data architectures that enable data-driven decisions. Moreover, she is passionate about creating prescriptive solutions for refactoring to mission critical monolithic workloads to microservices, supply chain and connected factories that leverage IoT, machine learning, big data and analytics services.