AWS Partner Network (APN) Blog
Orchestrating a Predictive Maintenance Data Pipeline on AWS and Control-M
An update was made to this post in November 2023 to reflect current information from AWS and BMC.
By Sunil Bemarkar and Scott Kellish, Partner Solutions Architects – AWS
By Vij Balakrishna, Sr. Partner Development Manager – AWS
By Basil Faruqui, Principal Solutions Marketing Manager – BMC Digital Business Automation
By Joe Goldberg, Innovation Evangelist – BMC
BMC Software |
Predictive maintenance is a game-changer for the transportation industry, allowing it to evolve from just-in-case to just-in-time maintenance that prevents failures during operation.
By using artificial intelligence (AI) and the Internet of Things (IoT), anomaly detection can identify rare items, events, or observations that differ significantly from most of the data, and are connected to some kind of problem or rare event. To build and train machine learning (ML) models, you need data science expertise, and to run those ML models in production at scale.
The ability to access a Hadoop cluster or data warehouse as a service is a major advantage of Amazon Web Services (AWS), but managing dependencies between the different services can be a challenge. This is especially true when trying to integrate new data-focused applications with traditional business services.
To address this, Control-M by BMC Software offers a workflow orchestration solution that simplifies these complex steps.
In this post, we will walk through the architecture of a predictive maintenance system that we developed to simplify the complex orchestration in an ML pipeline used to reduce downtime and costs for a trucking company.
BMC Software is an AWS Specialization Partner and AWS Marketplace Seller with Competencies in DevOps as well as Migration and Modernization. BMC offers a full suite of solutions to help you migrate to and maximize your investment in AWS.
Five Orchestration Challenges
Five challenges stand out in simplifying the orchestration of a machine learning data pipeline:
- Defining the intended workflow: This is a complex process involving the understanding of dependencies and decision tree branching of the pipeline.
- Managing multiple teams: Multiple teams may be involved in creating the pipeline, and it’s important to have a way of defining their specific workflows using tools they are comfortable with, while ensuring these workflows can be merged with those of other teams.
- Establishing standards: It’s important to establish standards for naming conventions, job descriptions, and other aspects of the pipeline to ensure consistency and clarity.
- Providing a holistic view of the pipeline: The orchestrator must provide a holistic view of the entire pipeline and its dependencies, while minimizing the number of tools used.
- Handling errors and failures: The orchestrator must have built-in error handling capabilities to automatically analyze the impact of delays and failures on downstream jobs in a workflow, and determine whether to automatically restart a failed job or page a human for investigation.
Overall, simplifying the orchestration of a machine learning data pipeline can be a complex and challenging process, but it’s important to carefully consider these challenges and take steps to mitigate them.
Real-World Example
We’ll be using a real-world example where a truck manufacturer set out to improve the utilization of trucks for customers subscribed to its connected vehicle program.
A recent study by the American Transportation Research Institute showed the cost per hour for trucking companies is $7.89, a 33% increase from 2013. The ability to reduce this cost is critical to improving the bottom line for trucking companies, which generally operate on low margins.
One way to do this is to predict which trucks will need service and perform maintenance while the vehicle is mobile. This can be done by using data from connected vehicles.
We selected Control-M for this task because it helps integrate and visualize the newest and most demanding data technologies into an end-to-end business service, and because it gave us the freedom to choose the infrastructure we needed. You can choose Control-M in a self-hosted deployment or BMC Helix Control-M as a software-as-a-service (SaaS) model.
By implementing this solution, we were able to reduce the overall dwell time by about 40%, resulting in significant cost savings and increased productivity. Our customers were also delighted with the improved service.
Use Our Demo System
We created a demonstration version of our solution so you can experiment with it on AWS. You can find all of our code, plus sample data, in GitHub.
We also developed three programs to simulate the actual fleet IoT data the pipeline will process:
- Create a logistic regression training model.
- Create sample data using actual sensor data and adding noise scaled by the standard deviation.
- Feed the data into a model to show which vehicles require maintenance.
The code to prepare the simulated data is also available in GitHub.
How it Works
The following architecture depicts the predictive maintenance solution we implemented. It begins by using telematics providers to collect IoT sensor data from vehicle-mounted moisture, temperature, and pressure sensors.
Using Amazon Kinesis Data Firehose, the solution uploads that data to an Amazon Simple Storage Service (Amazon S3) bucket, which serves as the first storage layer.
Like most IoT data, telematics data is concise and mainly contains terse sensor data, along with the vehicle identifier. To link that data to a customer and warranty plan, we aggregate data from business systems. Once a failure is detected, the vehicle is scheduled for maintenance at the optimum time.
Vehicle history is updated with all detected incidents along with the recommended actions to remediate the issues.
Figure 1 – Workflow details.
AWS Services in the Pipeline
Next, we’ll show you the different AWS services that go into the pipeline that handles our ML process, and explore the value of each.
Figure 2 – AWS services that are part of the pipeline for ML process.
For data processing, we selected Amazon EMR Serverless because it gives us access to a wide variety of tools in the Apache Hadoop ecosystem for big data processing and analytics.
We leverage Apache Spark on Amazon EMR to execute the machine learning code.
Amazon Simple Notification Service (Amazon SNS) alerts drivers and maintenance staff of a potential equipment failure.
For instance, the ML model could determine these actions should be automatically carried out:
- Truck driver is notified through Amazon SNS.
- Fleet maintenance schedules the next available appointment after the current delivery run is completed.
- Maintenance inventory is checked, and any missing items needed to repair or maintain the truck are ordered.
- A replacement truck and driver are placed into service while the first truck is being serviced.
Our solution applies the algorithms once a day. Therefore, the Amazon EMR clusters are instantiated once a day and remain running until all of the jobs in the pipeline have finished executing.
Once we have processed the data through the ML algorithms, but before we terminate the Amazon EMR cluster, we move the data to persistent storage. For this we selected Amazon Redshift because it’s a fully managed, petabyte-scale data warehouse allowing us to optimize cost and only run Amazon EMR clusters while we are processing data.
Two Ways to Define Pipeline Workflows
Control-M allows you to define workflows in two different ways: with a graphical editor or through JSON using jobs-as-code approach. The graphical editor lets you drag and drop different steps of workflow into a workspace and connect them, while the jobs-as-code approach allows you to integrate validation, testing, and deployment of workflow artifacts via REST API into your CI/CD toolchain.
AWS code services provides a variety of developer tools for CI/CD, including AWS CodePipeline, AWS CodeBuild, and AWS CodeDeploy.
Defining Jobs in the Pipeline
Jobs are the basic execution units of Control-M, and each job has attributes such as type, location, runner, schedule, and dependencies. Jobs are contained in folders, which can also have instructions that apply to all jobs in the folder. These entities are defined in JSON format.
Figure 3 shows a snippet of the JSON code describing the workflow for our predictive maintenance architecture. The full JSON is available in this GitHub file, which is part of the larger Control-M Automation API Community Solutions GitHub repo. That repository contains solutions, code samples, and how-to for the Control-M Automation API.
Figure 3 – JSON components of the workflow.
The jobs defined in the workflow map to the steps in the architecture diagram shown in Figure 2.
The workflow has three sections:
- Defaults that contain functions applying to the entire workflow, such as who should be notified in case a job fails in this flow.
- Individual job definitions.
- Flows that define how the jobs are connected and what site standards should be followed in determining the jobs in the flow.
Running Pipeline Workflows with Control-M
The orchestration is performed by BMC Helix Control-M, a SaaS offering that’s hosted on AWS. It was architected and developed with support from AWS SaaS Factory and manages workflow scheduling and submission to agents, which are lightweight workers.
In the implementation described, all the agents (workers) are running on Amazon Elastic Compute Cloud (Amazon EC2) instances. This solution provides a modern, flexible, and scalable approach to application and data pipeline orchestration, while also addressing the needs of traditional applications that are not API-enabled.
With Amazon EMR Serverless, the setup and configuration required for a Hadoop cluster is handled automatically. Teams can create and manage serverless applications separately, depending on their requirements.
Here is the command to create an Amazon EMR Serverless application instance:
The above command executes the AWS Command Line Interface (CLI) to create an application. Successful creation returns application-ID, which is required for job execution. This ID is provided as a parameter to the workflow.
Next, here is the command to delete an instance of an Amazon EMR Serverless application:
The application can be deleted if no longer required.
Jobs in the Workflow
This section describes the job flow with an emphasis on interesting details. The first few jobs in the flow collect and prepare data, while the job dataops-pm-MasterData-synch provides customer, parts, and similar corporate data extracts that come from core business systems the company uses for daily operations.
A Control-M plugin for file transfers saves time and effort that’s usually required with scripting file transfers. Notice in the JSON code that you don’t need any user credentials because the job is calling a connection-profile for login credentials.
Connection profiles contain the environment information and credentials required by Control-M to connect to applications where the jobs will execute.
The job dataops-pm-databrew-clean-telematics-data uses the AWS Glue DataBrew service to cleanse the IoT data delivered by telematics providers.
Processing the Data
The snippet in Figure 4 defines the job dataops-pm-analyze, which reads pressure, moisture, and temperature readings that were ingested from the telematics providers. This job executes the ML model against IoT data and identifies anomalies.
This is the analytics component of the pipeline that runs Spark to identify potential problem vehicles that should be serviced. A bash script invokes Amazon EMR Serverless using AWS CLI. The invocation of Amazon EMR Serverless is highlighted below, and you can find the script on GitHub.
Figure 4 – Details of Amazon EMR Serverless Spark job.
The next job, dataops-pm-Notify, uses Amazon Simple Notification Service (SNS) to notify truck drivers via text message problems that were detected and advises them to schedule maintenance for the specific issue.
Downstream Optimization
After identifying problem vehicles and notifying the drivers, it’s important we capture the failure event and its lifecycle for further analysis in a broader context. We want to track how early we correctly detect errors, for which parts and from which suppliers. A significant part of the vehicle construction supply chain and vehicle servicing procedures can be optimized if this data is analyzed effectively.
Examining the State of the Pipeline
Now, let’s look at the pipeline when it’s running. Control-M provides a single pane of glass for monitoring jobs running on a diverse technology stack. It allows you to view the output and logs of jobs without having to connect to consoles of each application.
As shown in Figure 5, the first few jobs have completed successfully as indicated by their green color. The next job is executing, and depicted in yellow. Jobs that are waiting-to-run are shown in grey.
Figure 5 – Control-M workflow monitoring, output, and logs.
Control-M also allows you to perform operational actions on the jobs in the pipeline, such as hold, rerun, and kill. You sometimes need to perform these actions when troubleshooting a failure or skipping a job, for example.
All of the functions discussed here are also available from a REST-based API or a CLI.
Conclusion
Despite the rich set of machine learning tools AWS provides, coordinating and monitoring workflows across an ML pipeline remains a complex task.
Any time you need to orchestrate a business process that combines file transfers, applications, data sources, or infrastructure, Control-M can simplify your workflow orchestration. It integrates, automates, and orchestrates application workflows whether on-premises, on the AWS Cloud, or in a hybrid environment.
In this post, we have given you access to the code and several data samples to help you become familiar with how Control-M simplifies the orchestration of workflows in a data pipeline used by a trucking company.
Control-M is available in AWS Marketplace.
BMC Software – AWS Partner Spotlight
BMC Software is an AWS Partner that offers a comprehensive set of solutions for the management of dynamic, cloud-based applications and the monitoring and provisioning of AWS infrastructure.