Migrate On-Premises Machine Learning Operations to Amazon SageMaker Pipelines for Computer Vision
By Na Yu and Caitlin Berger, Sr. Big Data Engineers – Mission Cloud
By Ryan Ries, Practice Lead – Mission Cloud Services
By Qiong Zhang, Kris Skrinak and Cristian Torres Salamanca, Sr. Partner Solutions Architects – AWS
Amazon SageMaker Pipelines is a workflow orchestration tool for building machine learning (ML) pipelines with continuous integration and continuous delivery (CI/CD) capabilities. SageMaker Pipelines helps machine learning operations (MLOps) to automate and scale the entire ML lifecycle, including ML model building, training, deployment, and monitoring for lower costs and faster go-to-market time.
When migrating on-premises MLOps to Amazon SageMaker Pipelines, customers often find it challenging to monitor metrics in training scripts and add inference scripts for custom ML models.
There are a few examples from Amazon Web Services (AWS) showing how to use SageMaker Pipelines for training and inference scripts for AWS built-in algorithms. There is no such example with custom ML models for computer vision, however. Many customers also want to know the best practices of SageMaker Pipelines from use cases in production.
Mission Cloud implemented an end-to-end SageMaker Pipeline to build the workflow of model development to production, accelerating their customer’s computer vision model production process.
In this post, you will learn how to train and deploy customized training and inference scripts with SageMaker Script Mode for Bring Your Own Model (BYOM) in SageMaker Pipelines. You’ll learn best practices for building an end-to-end MLOps workflow for a computer vision model with SageMaker Pipelines, including:
- How to train with a customized PyTorch script, including custom metrics tracked on Amazon CloudWatch.
- How to reduce training costs by using Amazon EC2 Spot instances and checkpointing with SageMaker Pipelines.
- How to register (with ModelStep) and deploy models with customized inference scripts in SageMaker Pipelines.
Mission Cloud is an AWS Premier Tier Services Partner and Managed Service Provider (MSP) with the Machine Learning Consulting Competency. An AWS Marketplace Seller also, Mission Cloud delivers a comprehensive suite of services to help businesses migrate, manage, modernize, and optimize their AWS cloud environments.
Detectron2 is an open-source computer vision model that provides object detection and segmentation. There are many existing examples about how to train and deploy Detectron2 using a PyTorch model, but we’ll focus on training an existing PyTorch model training script and building an end-to-end SageMaker Pipeline with BYOM.
The following diagram illustrates the architecture for our modeling pipeline, where SageMaker Pipelines is used to automate ML steps, including training, model evaluation, metrics evaluation, model fail, register model, and an AWS Lambda step. The Lambda step invokes a Lambda function that deploys an endpoint, so researchers can do ad-hoc testing on the model.
Figure 1 – Reference architecture on AWS.
The diagram below depicts the workflow of the modeling pipeline generated by SageMaker Pipelines for our Detectron2 object detection model training.
Figure 2 – ML workflow generated by SageMaker Pipelines.
Sample Code for SageMaker Pipelines
This section shows the pipeline development step by step using SageMaker Pipelines with code samples.
Pipeline Parameters Setup
Pipeline parameters allow you to assign values to variables at runtime. This sample code shows the definitions for the seven pipeline parameters used in this modeling pipeline.
We use a PyTorch estimator to bring our own customized training script. It uses Detectron2 for object detection, and the details of this training script can be found in train.py.
Define Customized Metrics Associated with the Training Script
SageMaker training jobs allow you to publish custom metrics from your training script to track the progress of training over time. These metrics can be viewed directly through SageMaker Experiments, the SageMaker console, and Amazon CloudWatch.
The sample code below shows how to define the customized metrics of training and testing average precisions in the modeling pipeline notebook. This
metric_definitions is used as an input parameter of the estimator for the training job.
Metric definition includes two keys to be specified:
Name which is used to define the metric name, and
Regex which defines a regular expression that’s used to detect the metric. It searches all logs published to CloudWatch by the training job, so anything printed by the training job using a
If a log message contains a metric matching the regex above, it will be recorded in the SageMaker Experiment console.
Figure 3 – Screenshot from SageMaker Experiments.
Correspondingly, the training script (train.py) must print log messages containing the metrics consistent with the regex defined in the code sample above.
The following code sample needs to be added to the training script to write logs that match this regex so the metrics will be recognized and recorded when the training job runs. Note that these two metric examples are not in the above given training script.
testing_ap have been calculated and assigned earlier in the training script. We are just printing them to the logs using the correct format.
Enable Training with Spot Instances
Amazon SageMaker makes it easy to take advantage of the cost savings available through Amazon EC2 Spot instances. You can use Spot instances for training jobs and save up to 90% on training costs over on-demand instance pricing. You can see the estimated percentage of cost saving from the SageMaker training job console.
To enable training with Spot instances, we’ll need to set values for
max_run; make sure you set the
max_wait larger than
max_run. SageMaker Estimator documentation provides detailed explanations for each of the parameters.
The max value allowed by AWS for
max_run is 28 days, although your own AWS account may have a higher limit requested. The recommended best practice is to set
max_run larger than the estimated length of the training job with additional buffer time.
In this example, since the training job takes about 10 minutes to finish,
max_run is set to be 30 minutes (1800 seconds), as shown in the example below.
max_wait are used in the estimator definition in the section “Define the PyTorch estimator.”
Configure Dynamic Checkpoint Paths Through a Pipeline Parameter
Checkpoints are model snapshots used to save the state of machine learning models during training;
checkpoint_s3_uri in the PyTorch estimator is where the checkpoint path is specified.
In our modeling pipeline, after one pipeline run is finished a checkpoint file is generated and saved to the specified checkpoint path. When you re-execute the pipeline to change the learning rate, for example, an error occurs in the training job if you do not update the checkpoint path to a new unique value. This happens because the training job tries to continue from the saved checkpoint path.
Our solution is to pass the checkpoint path as a pipeline parameter so it can be re-defined each time it’s run with a unique path. This is shown in the “Execute the pipeline” section.
Define the PyTorch Estimator
The following code shows how to define a PyTorch estimator using the parameters defined above. Make sure the train.py script is in the code folder.
This sample code shows how to evaluate models by using a processing step.
In the ProcessingStep, we use a PyTorch processor that’s the same as the framework version of our training step. This determines what container is used to run our provided evaluation script.
Next, provide an output directory using the
outputs parameter. This is the local path on the processing job instance; anything you write to this processing job output is saved as an output in Amazon Simple Storage Service (Amazon S3).
Then, set a property file with the
property_files parameter. This property file will be used to pass any values or metrics between this step and the condition step. The property file must be written to the above output directory as part of the processing job.
Finally, select the script to use for your model evaluation with the
code parameter. The evaluation script must contain the code below in order to create and write to the specified property file.
For this step, we also need to add a dependency on the model training step using the method
add_depends_on to make sure the evaluation step only runs after the training job is finished.
To register our model to SageMaker Model Registry, we use
ModelStep, a new feature in the SageMaker software development kit (SDK). ModelStep either packages and creates the model, or packages the model and registers it in SageMaker Model Registry. Packaging the model will zip the model artifacts with the code necessary for inference and define what container should be used to deploy the model.
Any class that extends the base SageMaker Model can be used with
ModelStep. Here, we use
PyTorchModel. We use the model artifacts resulting from the training step above, and package it with our customized inference script, inference.py.
Note that both SageMaker Estimator and Model can be used to create model objects for deployment. However, the model object is a lower-level class that allows us to repackage with a customized inference script for model deployment.
In ModelStep, we register the model to the specified model package group and provide content and response types, allowed instance types for inference and transform. Note that
response_types should be configured correctly corresponding to the inference script. Otherwise, the deployment pipeline will run into error.
We set the approval status to be “approved” so that we can later deploy our model without manual intervention.
If the evaluation metric does not pass the user-defined threshold in the condition step, a fail step below is defined to output a customized message.
We use a condition step to direct which steps to take if the condition returns true (the register step), or if the condition returns false (the fail step). The condition to be checked is
The condition reads the
training_accuracy metric in the property file resulting from the evaluation step and compares
training_accuracy with a threshold of 0.90.
Deploy Model Endpoint with the AWS Lambda Step
The sample code below shows how we use the Lambda step to deploy a SageMaker endpoint.
First, create a Lambda function containing the code below.
Note that the AWS Identity and Access Management (IAM) role assigned to the Lambda function should already be configured with the appropriate permissions.
Next, create a Lambda step in the pipeline notebook to invoke the Lambda function. The input parameters of the Lambda step include the Amazon Resource Name (ARN) of the above Lambda function, the registered model ARN output from the model registration step, and the endpoint instance type.
Define and Submit the Pipeline
The following code shows how to define and submit the pipeline by using all of the defined pipeline parameters.
Execute the Pipeline
There are two ways to execute the pipeline after it has been submitted. One option is to use
start() to execute the pipeline with specified input parameters, as shown below. This sample code shows how the two pipeline parameters checkpoint path (
s3Path) and the learning rate (
LearningRate) can be updated when executing the pipeline.
Another option is to execute the pipeline from the SageMaker Studio console. In the SageMaker Pipelines dashboard, select the pipeline created and then select “Create execution.” This should bring up a window that allows you to fill in the input parameters.
Click “Start” to trigger a new pipeline execution. The values input from the console overwrite the values set when defining the input parameters in the beginning.
Figure 4 – Screenshot to start an execution of a pipeline.
This post demonstrated how to build end-to-end Amazon SageMaker Pipelines with customized training and inference scripts for a computer vision model. We showed how to leverage SageMaker features, including SageMaker PyTorch Estimator, PyTorchProcessor, and PyTorchModel to simplify each pipeline step and to optimize compute cost.
The best practices highlighted in this post can help you explore the capabilities of SageMaker and simplify your MLOps pipeline building process.
Mission Cloud thanks Authentic ID for the partnership that allowed Mission to collaborate, define, and build Amazon SageMaker workflows. Authentic ID is a leading identity verifier utilizing cloud-native ML frameworks to realize its mission of “Identity made simple.”
You can also learn more about Mission Cloud in AWS Marketplace.
Mission Cloud – AWS Partner Spotlight
Mission Cloud is an AWS Premier Tier Services Partner and MSP that delivers a comprehensive suite of services to help businesses migrate, manage, modernize, and optimize their AWS cloud environments.