How to Orchestrate a Data Pipeline on AWS with Control-M from BMC Software
By Scott Kellish, Partner Solutions Architects at AWS
By Basil Faruqui, Principal Solutions Marketing Manager at BMC Digital Business Automation
By Joe Goldberg, Innovation Evangelist at BMC
Predictive maintenance, which analyzes sensor data to predict equipment failures, has emerged as one of the most common business use cases of machine learning (ML) and the Internet of Things (IoT).
To build and train ML models, you need data science expertise. If you’re going to run those ML models in production at scale, you need data engineering expertise to build a pipeline for data ingestion, storage, processing, and analytics.
Amazon Web Service (AWS) offers a diverse collection of services for data scientists and data engineers. If, for instance, you need a Hadoop cluster or data warehouse, you can deploy it in a few hours using AWS services.
However, coordinating and monitoring the actions across the data pipeline in a way that consistently delivers results in the expected timeframe remains a complex task. You need a way to orchestrate the steps in the pipeline and manage the dependencies between them.
Control-M, a workflow orchestration solution by BMC Software, Inc., simplifies complex application, data, and file transfer workflows, whether on-premises, on the AWS Cloud, or across a hybrid cloud model. BMC is an AWS Partner Network (APN) Advanced Technology Partner with AWS Competencies in DevOps and Migration.
In this post, we will walk through the architecture of a predictive maintenance system that we developed to simplify the complex orchestration steps in a ML pipeline used to reduce downtime and costs for a trucking company.
Five Orchestration Challenges
Five challenges stand out in simplifying the orchestration of a machine learning data pipeline.
The first challenge is understanding the intended workflow through the pipeline, including any dependencies and required decision tree branching. For example, if data ingestion succeeds, then proceed down path A; otherwise, proceed with path B. And so on.
Multiple teams may be involved in creating the flow. They all must have a way of defining their specific aspect of the workflow from a standard interface, and then have the ability to merge their respective workflows that will make up the pipeline.
It’s vital that teams follow standards when building such workflows. For example, having naming conventions for jobs in a workflow is essential. We wouldn’t want multiple jobs with the same name. It’s important that meaningful descriptions are provided for each step so when there is a failure, it’s easy to identify what the step performs.
Minimize the number of tools required, and provide a single tool to visualize and interact with the pipeline and its dependencies. It’s difficult to manage what you can’t see, so visualization is key at the definition stage and even more so when the pipeline is running.
The orchestration engine must have built-in error handling capabilities. For example, if a file transfer is taking longer than usual there must be a way to automatically analyze the impact of this delay on the downstream jobs in a workflow. You also need to know how this delay affects the business SLA. In other words, does it affect the ability to schedule maintenance for trucks or not?
Similarly, a job failure shouldn’t stop the pipeline in its tracks and require immediate human intervention. Instead, the workflow designer may deem it safe for the orchestrator to automatically restart a failed job if a network glitch caused it. Likewise, they could determine a human should be paged to investigate the issue only if it fails a certain number of times with the same error.
We’ll be using a real-world example where a truck manufacturer set out to improve the utilization of trucks for its customers who subscribed to its connected vehicle program.
The trucking industry deals with daily breakdowns that extract a high economic cost. When a freight-carrying truck is in the shop for emergency repairs, neither the truck driver nor the freight haulers are paid, and customers don’t receive their shipments on time.
A recent study by the American Transportation Research Institute reported the marginal cost per mile for repair and maintenance for trucking companies reached 17.1 cents in 2018, a 38 percent increase from 2010.
The ability to reduce the marginal cost per mile by harnessing the power of data is critical to improving the bottom line for trucking companies, which generally operate on low margins to begin with.
Our goal was to predict which trucks need to be removed from service and schedule their maintenance at the optimum time. We selected Control-M for this task because it helps us gain business-critical insights from the newest and most demanding data technologies, and because it gave us the freedom to choose the infrastructure we needed.
We expected this approach to decrease the overall “dwell” time; that is, the time a vehicle is unproductive while awaiting servicing. The eventual reduction was about 40 percent, which resulted in significant cost savings and increased productivity, as well as delighted customers.
Use Our Demo System
We created a demonstration version of our solution so you can experiment with it on AWS. You can find all of our code, plus sample data, in GitHub.
We also developed three programs to simulate the actual fleet IoT data the pipeline will process:
- Create a logistic regression training model.
- Create sample data using actual sensor data and adding noise scaled by the standard deviation.
- Feed the data into a model to show which vehicles require maintenance.
The code to prepare the simulated data is also available in GitHub.
How it Works
The following architecture depicts the predictive maintenance solution we implemented. It begins by using telematics providers to collect IoT sensor data from vehicle-mounted moisture, temperature, and pressure sensors.
We can use data stored in Amazon S3 for several things, but we focused on using it to identify signs of impending failure so a truck can be sent for scheduled maintenance at the optimum time and avoid much more costly emergency repairs.
Figure 1 – Predictive maintenance pipeline architecture.
AWS Services in the Pipeline
Next, we’ll show you the different AWS services that go into the pipeline that handles our ML process, and explore the value of each.
Figure 2 – AWS services that are part of the pipeline for ML process.
We leverage Apache Spark on Amazon EMR to execute the machine learning code.
Amazon Simple Notification Service (Amazon SNS) alerts drivers and maintenance staff of a potential equipment failure.
For instance, the ML model could determine these actions should be automatically carried out:
- Truck driver is notified through Amazon SNS.
- Fleet maintenance schedules the next available appointment after the current delivery run is completed.
- Maintenance inventory is checked, and any missing items needed to repair or maintain the truck are ordered.
- A replacement truck and driver are placed into service while the first truck is being serviced.
Our solution applies the algorithms once a day. Therefore, the Amazon EMR clusters are instantiated once a day and remain running until all of the jobs in the pipeline have finished executing.
Once we have processed the data through the ML algorithms, but before we terminate the Amazon EMR cluster, we move the data to persistent storage. For this we selected Amazon Redshift because it’s a fully managed, petabyte-scale data warehouse allowing us to optimize cost and only run Amazon EMR clusters while we are processing data.
Two Ways to Define Pipeline Workflows
Control-M allows you to define a workflow in two different ways. You can use a graphical editor that allows you to drag and drop different steps of a workflow into a workspace and connect them.
Alternatively, you can define workflows using RESTful APIs in a jobs-as-code approach, and JSON to integrate with your continuous integration/continuous delivery (CI/CD) toolchain. This approach enhances workflow management by allowing jobs to flow through an automated build, test, and release pipeline.
Defining Jobs in the Pipeline
A job is a basic execution unit of Control-M. Each job has several attributes:
- What is the job type—script, command, Hadoop/Spark, file transfer?
- Where does this job run; on what host, for example?
- Who runs the job—connection profile or run as?
- When should it run; and what are the scheduling criteria?
- What are the job dependencies—for example, a specific job must complete or file must arrive?
Jobs are contained in folders, which can also have scheduling criteria, dependencies, and other instructions that apply to all jobs in the folder. You define these entities in JSON format.
Figure 3 shows a snippet of the JSON code describing the workflow for our predictive maintenance architecture. The full JSON is available in this GitHub file, which is part of the larger Control-M Automation API Community Solutions GitHub repo. That repository contains solutions, code samples, and how-to for the Control-M Automation API.
Figure 3 – High-level view of Control-M predictive maintenance workflow automation.
The jobs defined in the workflow map to the steps in the architecture diagram shown in Figure 2.
The workflow has three sections:
- Defaults that contain functions applying to the entire workflow, such as who should be notified in case a job fails in this flow.
- Individual job definitions.
- Flows that define how the jobs are connected and what site standards should be followed in determining the jobs in the flow.
The first job in the flow is called
IOT_Create_Cluster. As shown in Figure 4, this job of type
Job:Script, calls a batch script
launchEMR.bat which, in turn, executes a Perl script. The Perl script instantiates an Amazon EMR cluster by running the AWS Command Line Interface (CLI) command for
aws emr create cluster.
You can use multiple options to add utilities to the Amazon EMR cluster, such as Hive, PIG, or Spark. We are adding Spark as part of this Amazon EMR cluster to run the ML code. Full details of all the Amazon EMR options available with CLI are available in the AWS documentation. Here is the complete script used in our demo.
Figure 4 – Example IOT_Create_Cluster job.
Scheduling Pipeline Workflows
Control-M uses a server-and-agent model. The server is the central engine that manages workflow scheduling and submission to agents, which are lightweight workers. In our demo, the Control-M server is running on an Amazon Elastic Compute Cloud (Amazon EC2) instance.
As part of the Amazon EMR instantiation process, we bootstrap a Control-M agent, which serves as the worker to execute workloads on Amazon EMR. The agent submits the workload to the underlying Hadoop environment and monitors the progress of the jobs.
As highlighted at the bottom of Figure 4, one of the options passed in the
emr create-cluster function is the path to a bootstrap script,
EMRBootStrap.sh. This script, contained in an Amazon S3 bucket, is run during the creation of the cluster.
After the bootstrap process, the Amazon EMR cluster has a functioning Control-M agent that manages all the data movement to and from the cluster, and the Hadoop/HDFS/Spark jobs that are required to accomplish our business process.
The following snippet defines
IOT_JAR_Setup, the next job in the flow.
Figure 5 – File transfer job.
In our demo,
IOT_JAR_Setup is a file transfer job using Secure File Transfer Protocol (SFTP) that moves:
- The IoT maintenance data from Amazon S3 to the Amazon EMR cluster.
- The jar file that is executed using Spark.
Control-M comes with a plugin for file transfers, saving time and effort usually required with scripting file transfers. Notice in the JSON code that you don’t need any user credentials because the job is calling a connection profile for login credentials.
Connection profiles contain the environment information and credentials required by Control-M to connect to the applications where the jobs will execute. The credentials are securely stored in Control-M’s database. An external credentials vault could also be used by providing the appropriate API call specified by the vault provider.
Processing the Data
The snippet in Figure 6 defines
IOT_ReadCSV, which executes a Spark job that reads the pressure, moisture, and temperature readings that were ingested from the telematics providers. This Spark job executes the ML model against the IoT data and identifies anomalies.
Figure 6 – Spark job to ingest vehicle sensor data.
The next job is IOT_Notify, which uses Amazon SNS to notify the truck driver via a text message that identifies the detected problem and advises the driver to schedule maintenance for the specific issue.
IOT_Redshift_SQL_Load copies the refined data out of the cluster to an existing Amazon Redshift data warehouse before we terminate the Amazon EMR cluster at the end of the process. You can also use the warehoused data for subsequent downstream analytics.
In the final step, the IOT_Terminate_Cluster job, we terminate the Amazon EMR cluster to avoid being charged for the cluster resources while it is idle.
Examining the State of the Pipeline
Now that you have an idea of how jobs are defined, let’s take a look at what the pipeline looks like when it’s running.
Control-M provides a user interface for monitoring workflows. In the Figure 7 screenshot, the first job is executing and it’s depicted in yellow. Jobs that are waiting to run are shown in grey.
You can access the output and logs of every job from the pane on the right-hand side. This capability is vital during daily operations. To monitor those operations more easily, Control-M provides a single pane of glass to view the output of jobs running on disparate systems without having to connect to the consoles of each application.
Figure 7 – Control-M workflow monitoring, output, and logs.
Control-M also allows you to perform several actions on the jobs in the pipeline, such as hold, rerun, and kill. You sometimes need to perform these actions when troubleshooting a failure or skipping a job, for example.
All of the functions discussed here are also available from a REST-based API or a CLI.
In spite of the rich set of machine learning tools AWS provides, coordinating and monitoring workflows across an ML pipeline remains a complex task.
Anytime you need to orchestrate a business process that combines file transfers, applications, data sources, or infrastructure, Control-M can simplify your workflow orchestration. It integrates, automates, and orchestrates application workflows whether on-premises, on the AWS Cloud, or in a hybrid environment.
In this post, we have given you access to the code and several data samples to help you become familiar with how Control-M simplifies the orchestration of workflows in a data pipeline used by a trucking company.
Control-M is available in AWS Marketplace with a BYOL license and pay-per-hour usage model. A 14-day trial license is included. At conclusion of the trial period, you can contact BMC support to purchase a permanent license key.
BMC Software – APN Partner Spotlight
BMC Software is an AWS Competency Partner. BMC offers a comprehensive set of solutions for the management of dynamic, cloud-based applications and the monitoring and provisioning of AWS infrastructure.
*Already worked with BMC Software? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.