Automating model retraining and deployment using the AWS Step Functions Data Science SDK for Amazon SageMaker

As machine learning (ML) becomes a larger part of companies’ core business, there is a greater emphasis on reducing the time from model creation to deployment. In November of 2019, AWS released the AWS Step Functions Data Science SDK for Amazon SageMaker, an open-source SDK that allows developers to create Step Functions-based machine learning workflows in Python. You can now use the SDK to create reusable model deployment workflows with the same tools you use to develop models. You can find the complete notebook for this solution in the “automate_model_retraining_workflow” folder of our GitHub repo.

This post demonstrates the capabilities of the Data Science SDK with a common use case: scheduled model retraining and deployment. In this post, you create a serverless workflow to train an ML model, check the performance of the model against a validation dataset, and deploy the model to production if the model accuracy surpasses a set threshold. Finally, the post shows how to trigger a workflow based off a periodic schedule.

The following diagram shows the serverless workflow described above utilizing AWS Step Functions.

This post uses the following AWS services:

AWS Step Functions allows you to coordinate several AWS services into a serverless workflow. You can design and run workflows in which the output of one step acts as the input to the next step, and embed error handling into the workflow.
Amazon SageMaker is a fully managed service that provides developers and data scientists with the tools to build, train, and deploy different types of ML models.
AWS Glue is a fully managed extract, transform, and load (ETL) service. You can point AWS Glue to a supported data store and it generates the code to extract and load it into your target data store. AWS Glue runs on a distributed Apache Spark environment, which allows you to take advantage of Spark without managing the infrastructure.
AWS Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda executes your code only when triggered and scales automatically, from a few requests per day to thousands per second.
Amazon EventBridge is a serverless event bus that makes it easy to connect different SaaS applications, AWS services, and data from your applications.

Overview of the SDK

The SDK provides a new way to use AWS Step Functions. A Step Function is a state machine that consists of a series of discrete steps. Each step can perform work, make choices, initiate parallel execution, or manage timeouts. You can develop individual steps and use Step Functions to handle the triggering, coordination, and state of the overall workflow. Before the Data Science SDK, you had to define Step Functions using the JSON-based Amazon States Language. With the SDK, you can now easily create, execute, and visualize Step Functions using Python code.

This post provides an overview of the SDK, including how to create Step Function steps, work with parameters, integrate service-specific capabilities, and link these steps together to create and visualize a workflow. You can find several code examples throughout the post; however, we created a detailed Amazon SageMaker notebook of the entire process. For more information, see our GitHub repo.

Steps, parameters, and dynamic workflows

Within a Step Function, each step passes its output to the next. You can use these outputs in the following steps to create dynamic workflows. You can also pass input parameters for each Step Function execution. Parameters allow you to keep your workflow general so it can support other projects.

To use the SDK to define the required input parameters for your workflow, see the following code:

execution_input = ExecutionInput(schema={
    'TrainingJobName': str,
    'GlueJobName': str,
    'ModelName': str,
    'EndpointName': str,
    'LambdaFunctionName': str
})

Built-in service integrations

The Data Science SDK integrates with several AWS services. The integrations allow you to directly control the supported services, without needing to write API calls. This post uses the AWS Glue, Lambda, and Amazon SageMaker integrations. For more information, see AWS Step Functions Service Integrations.

For model retraining, you first need to retrieve the latest data. You also need to enrich raw data while saving it to a file type and location supported by your ML model. AWS Glue connects to most data stores, supports custom scripting in Python, and doesn’t require management of servers. Use AWS Glue to start your workflow by reading data from your production data store and writing the transformed data to Amazon S3.

The Data Science SDK makes it easy to add an AWS Glue job to your workflow. The AWS Glue job itself specifies the data source location, Python code for ETL, and file destination to use. All the SDK requires is the name of the AWS Glue job as a parameter for the GlueStartJobRunStep. For more information, see Getting Started with AWS Glue ETL on YouTube.

Use an input parameter so you can choose your AWS Glue job at runtime:

etl_step = steps.GlueStartJobRunStep(
    'Extract, Transform, Load',
    parameters={"JobName": execution_input['GlueJobName']}
)

After you extract and save the input data, train a model using the SDK’s TrainingStep. Amazon SageMaker handles the underlying compute resources, but you need to specify the algorithm, hyperparameters, and data sources for training. See the following code:

training_step = steps.TrainingStep(
    'Model Training',
    estimator=xgb,
    data={
      'train': sagemaker.s3_input(train_data, content_type='csv'),
      'validation': sagemaker.s3_input(validation_data, content_type='csv')},
    job_name=execution_input['TrainingJobName']
)

The estimator in the preceding code, xgb, encapsulates the XGBoost algorithm and its hyperparameters. For more information about how to define an estimator, see the GitHub repo.

The Step Function workflow remains in the training step until training completes. Afterwards, it needs to retrieve the training results so that your workflow can branch based on the accuracy of the new model. Use a Step Functions LambdaStep to call Lambda to run a simple Python function that queries the Amazon SageMaker training job and returns the results. To add a Lambda state with the SDK, specify the function name and payload. This post uses JSON paths to select the TrainingJobName in the Lambda function payload so it knows which training job to query. See the following code:

lambda_step = steps.compute.LambdaStep(
    'Query Training Results',
    parameters={"FunctionName": execution_input['LambdaFunctionName'],
        'Payload':{"TrainingJobName.$": "$.TrainingJobName"}
    }
)

To deploy the model after training, you need to create a model object and deployment configuration from the training artifacts using the ModelStep and EndpointConfigStep from the SDK. See the following code:

model_step = steps.ModelStep(
    'Save Model',
    model=training_step.get_expected_model(),
    model_name=execution_input['ModelName'],
    result_path='$.ModelStepResults'
)

endpoint_config_step = steps.EndpointConfigStep(
    "Create Model Endpoint Config",
    endpoint_config_name=execution_input['ModelName'],
    model_name=execution_input['ModelName'],
    initial_instance_count=1,
    instance_type='ml.m4.xlarge'
)

Finally, the workflow can deploy the new model as a managed API endpoint using the EndpointStep. The “update” parameter causes it to update an existing Amazon SageMaker endpoint as opposed to creating a new one. See the following code:

endpoint_step = steps.EndpointStep(
    'Update Model Endpoint',
    endpoint_name=execution_input['EndpointName'],
    endpoint_config_name=execution_input['ModelName'],
    update=True
)

Control flow and linking states

The Step Functions SDK’s Choice state supports branching logic based on the outputs from previous steps. You can create dynamic and complex workflows by adding this state.

This post creates a step that branches based on the results of your Amazon SageMaker training step. See the following code:

check_accuracy_step = steps.states.Choice(
    'Accuracy > 90%'
)

Add the branches and branching logic to the step. Choice states support multiple data types and compound Boolean expressions. However, for this post, you want to compare two numeric values. The first is a set threshold value of 0.90, the second is the model accuracy on the validation dataset from the TrainingStep. The training results show the error of the model, which is calculated as (#wrong cases)/(#all cases). As a result, model accuracy is over 90% if the measured error is less than 10% (.10).

For more information, see Choice Rules.

Add the following comparison rule:

threshold_rule = steps.choice_rule.ChoiceRule.NumericLessThan(variable=lambda_step.output()['Payload']['trainingMetrics'][0]['Value'], value=.10)

check_accuracy_step.add_choice(rule=threshold_rule, next_step=endpoint_config_step)
check_accuracy_step.default_choice(next_step=fail_step)

The choice rule specifies the next step in the workflow if the rule passes successfully. So far, you have created your steps but haven’t linked them to create an order of execution. You can link steps together in two different ways using the SDK. Firstly, you can use the next() method to specify the next step for an individual step. See the following code:

endpoint_config_step.next(endpoint_step)

You can also use the Chain() method to link multiple steps together all at once. See the following code:

workflow_definition = steps.Chain([
    etl_step,
    training_step,
    model_step,
    lambda_step,
    check_accuracy_step
])

Workflow creation

After you define and order all your steps, create the Step Function itself with the following code:

workflow = Workflow(
    name='MyInferenceRoutine_{}'.format(id),
    definition=workflow_definition,
    role=workflow_execution_role,
    execution_input=execution_input
)

workflow.create()

After you create the workflow, workflow.render_graph() returns a diagram of the workflow, similar to what you would see in the Step Functions console which is shown below.

You are now ready to run your new deployment pipeline. You can run the model manually using the SDK with the execute() method, or you can automate this task.

Scheduling a workflow using an EventBridge trigger

You can schedule your workflow using EventBridge triggers. This post shows how to create a rule within EventBridge to invoke the target Step Function on a set schedule. For more information, see Creating an EventBridge Rule that Triggers on an Event from an AWS Resource.

Complete the following steps:

First, modify the existing state machine to update an existing SageMaker endpoint instead of creating a new one.

On the AWS Management Console, under Services, choose Step Functions.
Select the state machine you previously created and choose Edit.
Navigate to the JSON representation of the state machine. Under the “Update Model Endpoint” state, change the Resource value from arn:aws:states:::sagemaker:createEndpoint to arn:aws:states:::sagemaker:updateEndpoint.
Choose Save. You will be prompted with an IAM role warning. If you are following along with the blog, the Step Functions IAM role already has the required permissions, so choose Save anyway.

Next, define the schedule for the Step Function trigger.

On the AWS Management Console, under Services, choose Amazon EventBridge.
Choose Rules.
Choose Create rule.
Under Name and description, for Name, enter the name of your rule. This post enters the name automate-model-retraining-trigger.
As an optional step, for Description, enter a description of your step.
For Define pattern, select Schedule.
For Fix rate every, choose 1 Hours.
Under Select event bus, select AWS default event bus.
Select Enable the rule on the selected event bus.
Under Select targets, for Target, choose Step Functions state machine.
For State machine, choose your machine.
Select Configure input then Input transformer. Input Transformer customizes the event text that is passed to downstream targets. We will pass a unique ID that will give a distinct name to each SageMaker training job and model.
In the input path text box enter {“id”:”$.id”}. This captures the unique event ID from the scheduled trigger.

In the input template text box enter the following text, making sure to enter your unique Glue, Lambda, and SageMaker endpoint names.

{
	"TrainingJobName": <id>,
	"GlueJobName": "Enter the name of your Glue job",
	"ModelName": <id>,
	"EndpointName": "Enter the name of your SageMaker endpoint ",
	"LambdaFunctionName": "Enter the name of your Lambda function"
}

Select Create a new role for this specific resource.
Enter the name of your role. If you have an existing role, select Use existing role instead.
Choose Create.

Summary

This post provided an overview of the AWS Step Functions Data Science SDK for Amazon SageMaker. It showed how to create a reusable deployment model workflow using Python. The workflow included an AWS Glue job to extract and transform your data, a training step to train your ML model with new data, a Lambda step to query the training results, a model step to create model artifacts, an endpoint configuration step to define the deployment parameters, and an endpoint step to deploy the updated model to an existing endpoint. The post also provided an overview of how to use EventBridge to trigger the workflow automatically according to a given schedule.

For additional technical documentation and example notebooks related to the SDK, please see the AWS Step Functions Data Science SDK for Amazon SageMaker announcement page.

If you have questions or suggestions, please leave a comment.

About the authors

Sean Wilkinson is a Solutions Architect at AWS focusing on serverless and machine learning.

Julia Soscia is a Solutions Architect at Amazon Web Services based out of New York City. Her main focus is to help customers create well-architected environments on the AWS cloud platform. She is an experienced data analyst with a focus in Analytics and Machine Learning.

Select your cookie preferences

AWS Machine Learning Blog