AWS Machine Learning Blog

Predict vehicle fleet failure probability using Amazon SageMaker Jumpstart

Predictive maintenance is critical in automotive industries because it can avoid out-of-the-blue mechanical failures and reactive maintenance activities that disrupt operations. By predicting vehicle failures and scheduling maintenance and repairs, you’ll reduce downtime, improve safety, and boost productivity levels.

What if we could apply deep learning techniques to common areas that drive vehicle failures, unplanned downtime, and repair costs?

In this post, we show you how to train and deploy a model to predict vehicle fleet failure probability using Amazon SageMaker JumpStart. SageMaker Jumpstart is the machine learning (ML) hub of Amazon SageMaker, providing pre-trained, publicly available models for a wide range of problem types to help you get started with ML. The solution outlined in the post is available on GitHub.

SageMaker JumpStart solution templates

SageMaker JumpStart provides one-click, end-to-end solutions for many common ML use cases. Explore the following use cases for more information on available solution templates:

The SageMaker JumpStart solution templates cover a variety of use cases, under each of which several different solution templates are offered (the solution in this post, Predictive Maintenance for Vehicle Fleets, is in the Solutions section). Choose the solution template that best fits your use case from the SageMaker JumpStart landing page. For more information on specific solutions under each use case and how to launch a SageMaker JumpStart solution, see Solution Templates.

Solution overview

The AWS predictive maintenance solution for automotive fleets applies deep learning techniques to common areas that drive vehicle failures, unplanned downtime, and repair costs. It serves as an initial building block for you to get to a proof of concept in a short period of time. This solution contains data preparation and visualization functionality within SageMaker and allows you to train and optimize the hyperparameters of deep learning models for your dataset. You can use your own data or try the solution with a synthetic dataset as part of this solution. This version processes vehicle sensor data over time. A subsequent version will process maintenance record data.

The following diagram demonstrates how you can use this solution with SageMaker components. As part of the solution, the following services are used:

  • Amazon S3 – We use Amazon Simple Storage Service (Amazon S3) to store datasets
  • SageMaker notebook – We use a notebook to preprocess and visualize the data, and to train the deep learning model
  • SageMaker endpoint – We use the endpoint to deploy the trained model

Solution overview

The workflow includes the following steps:

  1. An extract of historical data is created from the Fleet Management System containing vehicle data and sensor logs.
  2. After the ML model is trained, the SageMaker model artifact is deployed.
  3. The connected vehicle sends sensor logs to AWS IoT Core (alternatively, via an HTTP interface).
  4. Sensor logs are persisted via Amazon Kinesis Data Firehose.
  5. Sensor logs are sent to AWS Lambda for querying against the model to make predictions.
  6. Lambda sends sensor logs to Sagemaker model inference for predictions.
  7. Predictions are persisted in Amazon Aurora.
  8. Aggregate results are displayed on an Amazon QuickSight dashboard.
  9. Real-time notifications on the predicted probability of failure are sent to Amazon Simple Notification Service (Amazon SNS).
  10. Amazon SNS sends notifications back to the connected vehicle.

The solution consists of six notebooks:

  • 0_demo.ipynb – A quick preview of our solution
  • 1_introduction.ipynb – Introduction and solution overview
  • 2_data_preparation.ipynb – Prepare a sample dataset
  • 3_data_visualization.ipynb – Visualize our sample dataset
  • 4_model_training.ipynb – Train a model on our sample dataset to detect failures
  • 5_results_analysis.ipynb – Analyze the results from the model we trained

Prerequisites

Amazon SageMaker Studio is the integrated development environment (IDE) within SageMaker that provides us with all the ML features that we need in a single pane of glass. Before we can run SageMaker JumpStart, we need to set up SageMaker Studio. You can skip this step if you already have your own version of SageMaker Studio running.

The first thing we need to do before we can use any AWS services is to make sure we have signed up for and created an AWS account. Then we create an administrative user and a group. For instructions on both steps, refer to Set Up Amazon SageMaker Prerequisites.

The next step is to create a SageMaker domain. A domain sets up all the storage and allows you to add users to access SageMaker. For more information, refer to Onboard to Amazon SageMaker Domain. This demo is created in the AWS Region us-east-1.

Finally, you launch SageMaker Studio. For this post, we recommend launching a user profile app. For instructions, refer to Launch Amazon SageMaker Studio.

To run this SageMaker JumpStart solution and have the infrastructure deployed to your AWS account, you need to create an active SageMaker Studio instance (see Onboard to Amazon SageMaker Studio). When your instance is ready, use the instructions in SageMaker JumpStart to launch the solution. The solution artifacts are included in this GitHub repository for reference.

Launch the SageMaker Jumpstart solution

To get started with the solution, complete the following steps:

  1. On the SageMaker Studio console, choose JumpStart.
    choose jumpstart
  2. On the Solutions tab, choose Predictive Maintenance for Vehicle Fleets.
    choose predictive maintenance
  3. Choose Launch.
    launch solution
    It takes a few minutes to deploy the solution.
  4. After the solution is deployed, choose Open Notebook.
    open notebook

If you’re prompted to select a kernel, choose PyTorch 1.8 Python 3.6 for all notebooks in this solution.

Solution preview

We first work on the 0_demo.ipynb notebook. In this notebook, you can get a quick preview of what the outcome will look like when you complete the full notebook for this solution.

Choose Run and Run All Cells to run all cells in SageMaker Studio (or Cell and Run All in a SageMaker notebook instance). You can run all the cells in each notebook one after the other. Ensure all the cells finish processing before moving to the next notebook.

run all cells

This solution relies on a config file to run the provisioned AWS resources. We generate the file as follows:

import boto3
import os
import json

client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)

keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
stack_output = dict(zip(keys, values))

with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
json.dump(stack_output, f)

We have some sample time series input data consisting of a vehicle’s battery voltage and battery current over time. Next, we load and visualize the sample data. As shown in the following screenshots, the voltage and current values are on the Y axis and the readings (19 readings recorded) are on the X axis.

volt

current

volt and current

We have previously trained a model on this voltage and current data that predicts the probability of vehicle failure and have deployed the model as an endpoint in SageMaker. We will call this endpoint with some sample data to determine the probability of failure in the next time period.

Given the sample input data, the predicted probability of failure is 45.73%.

To move to the next stage, choose Click here to continue.

next stage

Introduction and solution overview

The 1_introduction.ipynb notebook provides an overview of the solution and stages, and a look into the configuration file that has content definition, data sampling period, train and test sample count, parameters, location, and column names for generated content.

After you review this notebook, you can move to the next stage.

Prepare a sample dataset

We prepare a sample dataset in the 2_data_preparation.ipynb notebook.

We first generate the configuration file for this solution:

import boto3
import os
import json

client = boto3.client('servicecatalog')
cwd = os.getcwd().split('/')
i= cwd.index('S3Downloads')
pp_name = cwd[i + 1]
pp = client.describe_provisioned_product(Name=pp_name)
record_id = pp['ProvisionedProductDetail']['LastSuccessfulProvisioningRecordId']
record = client.describe_record(Id=record_id)

keys = [ x['OutputKey'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
values = [ x['OutputValue'] for x in record['RecordOutputs'] if 'OutputKey' and 'OutputValue' in x]
stack_output = dict(zip(keys, values))

with open(f'/root/S3Downloads/{pp_name}/stack_outputs.json', 'w') as f:
json.dump(stack_output, f)
import os

from source.config import Config
from source.preprocessing import pivot_data, sample_dataset
from source.dataset import DatasetGenerator
config = Config(filename="config/config.yaml", fetch_sensor_headers=False)
config

The config properties are as follows:

fleet_info_fn=data/example_fleet_info.csv
fleet_sensor_logs_fn=data/example_fleet_sensor_logs.csv
vehicle_id_column=vehicle_id
timestamp_column=timestamp
target_column=target
period_ms=30000
dataset_size=25000
window_length=20
chunksize=10000
processing_chunksize=2500
fleet_dataset_fn=data/processed/fleet_dataset.csv
train_dataset_fn=data/processed/train_dataset.csv
test_dataset_fn=data/processed/test_dataset.csv
period_column=period_ms

You can define your own dataset or use our scripts to generate a sample dataset:

if should_generate_data:
    fleet_statistics_fn = "data/generation/fleet_statistics.csv"
    generator = DatasetGenerator(fleet_statistics_fn=fleet_statistics_fn,
                                 fleet_info_fn=config.fleet_info_fn, 
                                 fleet_sensor_logs_fn=config.fleet_sensor_logs_fn, 
                                 period_ms=config.period_ms, 
                                 )
    generator.generate_dataset()

assert os.path.exists(config.fleet_info_fn), "Please copy your data to {}".format(config.fleet_info_fn)
assert os.path.exists(config.fleet_sensor_logs_fn), "Please copy your data to {}".format(config.fleet_sensor_logs_fn)

You can merge the sensor data and fleet vehicle data together:

pivot_data(config)
sample_dataset(config)

We can now move to data visualization.

Visualize our sample dataset

We visualize our sample dataset in 3_data_vizualization.ipynb. This solution relies on a config file to run the provisioned AWS resources. Let’s generate the file similar to the previous notebook.

The following screenshot shows our dataset.

dataset

Next, let’s build the dataset:

train_ds = PMDataset_torch(
    config.train_dataset_fn,
    sensor_headers=config.sensor_headers,
    target_column=config.target_column,
    standardize=True)

properties = train_ds.vehicle_properties_headers.copy()
properties.remove('vehicle_id')
properties.remove('timestamp')
properties.remove('period_ms')

Now that the dataset is ready, let’s visualize the data statistics. The following screenshot shows the data distribution based on vehicle make, engine type, vehicle class, and model.

visualize

Comparing the log data, let’s look at an example of the mean voltage across different years for Make E and C (random).

The mean of voltage and current is on the Y axis and the number of readings is on the X axis.

  • Possible values for log_target: [‘make’, ‘model’, ‘year’, ‘vehicle_class’, ‘engine_type’]
    • Randomly assigned value for log_target: make
  • Possible values for log_target_value1: [‘Make A’, ‘Make B’, ‘Make E’, ‘Make C’, ‘Make D’]
    • Randomly assigned value for log_target_value1: Make B
  • Possible values for log_target_value2: [‘Make A’, ‘Make B’, ‘Make E’, ‘Make C’, ‘Make D’]
    • Randomly assigned value for log_target_value2: Make D

Based on the above, we assume log_target: make, log_target_value1: Make B and log_target_value2: Make D

make b and d

The following graphs break down the mean of the log data.

engine g h e

The following graphs visualize an example of different sensor log values against voltage and current.

volt current 2

Train a model on our sample dataset to detect failures

In the 4_model_training.ipynb notebook, we train a model on our sample dataset to detect failures.

Let’s generate the configuration file similar to the previous notebook, and then proceed with training configuration:

sage_session = sagemaker.session.Session()
s3_bucket = sagemaker_configs["S3Bucket"]  
s3_output_path = 's3://{}/'.format(s3_bucket)
print("S3 bucket path: {}".format(s3_output_path))

# run in local_mode on this machine, or as a SageMaker TrainingJob
local_mode = False

if local_mode:
    instance_type = 'local'
else:
    instance_type = sagemaker_configs["SageMakerTrainingInstanceType"]
    
role = sagemaker.get_execution_role()
print("Using IAM role arn: {}".format(role))
# only run from SageMaker notebook instance
if local_mode:
    !/bin/bash ./setup.sh
cpu_or_gpu = 'gpu' if instance_type.startswith('ml.p') else 'cpu'


We can now define the data and initiate hyperparameter optimization:

%%time

estimator = PyTorch(entry_point="train.py",
                    source_dir='source',                    
                    role=role,
                    dependencies=["source/dl_utils"],
                    instance_type=instance_type,
                    instance_count=1,
                    output_path=s3_output_path,
                    framework_version="1.5.0",
                    py_version='py3',
                    base_job_name=job_name_prefix,
                    metric_definitions=metric_definitions,
                    hyperparameters= {
                        'epoch': 100,  # tune it according to your need
                        'target_column': config.target_column,
                        'sensor_headers': json.dumps(config.sensor_headers),
                        'train_input_filename': os.path.basename(config.train_dataset_fn),
                        'test_input_filename': os.path.basename(config.test_dataset_fn),
                        }
                     )

if local_mode:
    estimator.fit({'train': training_data, 'test': testing_data})
%%time

tuner = HyperparameterTuner(estimator,
                            objective_metric_name='test_auc',
                            objective_type='Maximize',
                            hyperparameter_ranges=hyperparameter_ranges,
                            metric_definitions=metric_definitions,
                            max_jobs=max_jobs,
                            max_parallel_jobs=max_parallel_jobs,
                            base_tuning_job_name=job_name_prefix)
tuner.fit({'train': training_data, 'test': testing_data})

Analyze the results from the model we trained

In the 5_results_analysis.ipynb notebook, we get data from our hyperparameter tuning job, visualize metrics of all the jobs to identify the best job, and build an endpoint for the best training job.

Let’s generate the configuration file similar to the previous notebook and visualize the metrics of all the jobs. The following plot visualizes test accuracy vs. epoch.

test accuracy

The following screenshot shows the hyperparameter tuning jobs we ran.

hyperparameter tuning jobs

You can now visualize data from the best training job (out of the four training jobs) based on the test accuracy (red).

As we can see in the following screenshots, the test loss declines and AUC and accuracy increase with epochs.

auc and accuracy

auc and accuracy 2

Based on the visualizations, we can now build an endpoint for the best training job:

%%time

role = sagemaker.get_execution_role()

model = PyTorchModel(model_data=model_artifact,
                     role=role,
                     entry_point="inference.py",
                     source_dir="source/dl_utils",
                     framework_version='1.5.0',
                     py_version = 'py3',
                     name=sagemaker_configs["SageMakerModelName"],
                     code_location="s3://{}/endpoint".format(s3_bucket)
                    )

endpoint_instance_type = sagemaker_configs["SageMakerInferenceInstanceType"]

predictor = model.deploy(initial_instance_count=1, instance_type=endpoint_instance_type, endpoint_name=sagemaker_configs["SageMakerEndpointName"])

def custom_np_serializer(data):
    return json.dumps(data.tolist())
    
def custom_np_deserializer(np_bytes, content_type='application/x-npy'):
    out = np.array(json.loads(np_bytes.read()))
    return out

predictor.serializer = custom_np_serializer
predictor.deserializer = custom_np_deserializer

After we build the endpoint, we can test the predictor by passing it sample sensor logs:

import botocore

config = botocore.config.Config(read_timeout=200)
runtime = boto3.client('runtime.sagemaker', config=config)

data = np.ones(shape=(1, 20, 2)).tolist()
payload = json.dumps(data)

response = runtime.invoke_endpoint(EndpointName=sagemaker_configs["SageMakerEndpointName"],
ContentType='application/json',
Body=payload)
out = json.loads(response['Body'].read().decode())[0]

print("Given the sample input data, the predicted probability of failure is {:0.2f}%".format(100*(1.0-out[0])))

Given the sample input data, the predicted probability of failure is 34.60%.

Clean up

When you’ve finished with this solution, make sure that you delete all unwanted AWS resources. On the Predictive Maintenance for Vehicle Fleets page, under Delete solution, choose Delete all resources to delete all the resources associated with the solution.

clean up

You need to manually delete any extra resources that you may have created in this notebook. Some examples include the extra S3 buckets (to the solution’s default bucket) and the extra SageMaker endpoints (using a custom name).

Customize the solution

Our solution is simple to customize. To modify the input data visualizations, refer to sagemaker/3_data_visualization.ipynb. To customize the machine learning, refer to sagemaker/source/train.py and sagemaker/source/dl_utils/network.py. To customize the dataset processing, refer to sagemaker/1_introduction.ipynb on how to define the config file.

Additionally, you can change the configuration in the config file. The default configuration is as follows:

fleet_info_fn=data/example_fleet_info.csv
fleet_sensor_logs_fn=data/example_fleet_sensor_logs.csv
vehicle_id_column=vehicle_id
timestamp_column=timestamp
target_column=target
period_ms=30000
dataset_size=10000
window_length=20
chunksize=10000
processing_chunksize=1000
fleet_dataset_fn=data/processed/fleet_dataset.csv
train_dataset_fn=data/processed/train_dataset.csv
test_dataset_fn=data/processed/test_dataset.csv
period_column=period_ms

The config file has the following parameters:

  • fleet_info_fn, fleet_sensor_logs_fn, fleet_dataset_fn, train_dataset_fn, and test_dataset_fn define the location of dataset files
  • vehicle_id_column, timestamp_column, target_column, and period_column define the headers for columns
  • dataset_size, chunksize, processing_chunksize, period_ms, and window_length define the properties of the dataset

Conclusion

In this post, we showed you how to train and deploy a model to predict vehicle fleet failure probability using SageMaker JumpStart. The solution is based on ML and deep learning models and allows a wide variety of input data including any time-varying sensor data. Because every vehicle has different telemetry on it, you can fine-tune the provided model to the frequency and type of data that you have.

To learn more about what you can do with SageMaker JumpStart, refer to the following:

Resources


About the Authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.