Deploy a Machine Learning Model to a Serverless Inference Endpoint

TUTORIAL

Overview

In this tutorial, you learn how to deploy a trained machine learning (ML) model to a serverless inference endpoint using Amazon SageMaker Studio. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers while providing high availability, built-in fault tolerance, and automatic scaling.

Amazon SageMaker Studio is a fully integrated development environment (IDE) for ML that provides a fully managed Jupyter notebook interface in which you can perform end-to-end ML lifecycle tasks. You can create and explore data sets, prepare training data, build and train models, and deploy trained models for inference—all from within SageMaker Studio.

SageMaker offers different inference options to support a broad range of use cases:

Use Case: Auto Insurance Fraud Detection

In this tutorial, you will use a binary classification XGBoost model that has been trained on a synthetically generated auto insurance claims dataset. The dataset that was used to train the model contained details and extracted features on claims and customers along with a fraud column indicating whether a claim was fraudulent or otherwise. For inference, the model predicts the probability of a claim being fraudulent. In this tutorial, as the machine learning engineer, you deploy this model to a serverless inference endpoint and run sample inferences from within SageMaker Studio.

What you will accomplish

In this tutorial, you will:

  • Create a SageMaker model from a trained model artifact
  • Configure and deploy a serverless endpoint serving a SageMaker model
  • Invoke the deployed endpoint and run inference on test data

Prerequisites

Before starting this tutorial, you will need:

 AWS experience

Beginner/Intermediate

 Minimum time to complete

18 minutes

 Cost to complete

See Amazon SageMaker pricing to estimate cost for this tutorial.

 Requires

You must be logged into an AWS account.

 Services used

Amazon SageMaker Serverless Inference, Amazon SageMaker Studio

 Last updated

April 25, 2023

Step 1: Set up your Amazon SageMaker Studio domain

With Amazon SageMaker, you can deploy models visually using the console or programmatically using either SageMaker Studio or SageMaker notebooks. In this tutorial, you deploy the models programmatically using a SageMaker Studio notebook, which requires a SageMaker Studio domain.
 
An AWS account can have multiple Studio domains per AWS Region. If you already have a SageMaker Studio domain in the US East (N. Virginia) Region, follow the SageMaker Studio setup guide to attach the required AWS IAM policies to your SageMaker Studio account, then skip Step 1, and proceed directly to Step 2.
 
If you don't have an existing SageMaker Studio domain, continue with Step 1 to run an AWS CloudFormation template that creates a SageMaker Studio domain and adds the permissions required for the rest of this tutorial.
 
Choose the AWS CloudFormation stack link. This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. In the CloudFormation console, confirm that US East (N. Virginia) is the Region displayed in the upper right corner. Stack name should be CFN-SM-IM-Lambda-Catalog, and should not be changed. This stack takes about 10 minutes to create all the resources.
 
This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC.

Select I acknowledge that AWS CloudFormation might create IAM resources, and then choose Submit.

On the CloudFormation pane, choose Stacks. It takes about 10 minutes for the stack to be created. When the stack is created, the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE.

Step 2: Set up a SageMaker Studio notebook

In this step, you launch a new SageMaker Studio notebook instance, install the necessary open-source libraries, and configure the SageMaker variables required to fetch the trained model artifacts from Amazon S3. But since the model artifact cannot be directly deployed for inference, you need to first create SageMaker models from the model artifacts. The created models will contain the training and inference code that SageMaker will use for model deployment.

Enter SageMaker Studio into the console search bar, and then choose SageMaker Studio.

Choose US East (N. Virginia) from the Region dropdown list on the upper right corner of the SageMaker console. Select Studio from the left navigation pane to open the SageMaker Studio using the studio-user profile.

Open the SageMaker Studio interface. On the navigation bar, choose File, New, Notebook.
In the Set up notebook environment dialog box, under Image, select Data Science. The Python 3 kernel is selected automatically. Choose Select.
The kernel on the top right corner of the notebook should now display Python 3 (Data Science).

Copy and paste the following code snippet into a cell in the notebook, and press Shift+Enter to run the current cell to update the aiobotocore library, which is an API to interact with many of the AWS services. Ignore any warnings to restart the kernel or any dependency conflict errors.

%pip install --upgrade -q aiobotocore 

You also need to instantiate the S3 client object and the location for the read S3 bucket. The read bucket is a public S3 bucket name sagemaker-sample-files, which contains the trained model artifacts. Copy and paste the following code block and run the cell.

import pandas as pd
import boto3
import sagemaker
import time
import json
import io
from io import StringIO
import base64
import re

from sagemaker.image_uris import retrieve

sess = sagemaker.Session()

region = sess.boto_region_name
s3_client = boto3.client("s3", region_name=region)
sm_client = boto3.client("sagemaker", region_name=region)
sm_runtime_client = boto3.client("sagemaker-runtime")

sagemaker_role = sagemaker.get_execution_role()


# S3 locations used for parameterizing the notebook run
read_bucket = "sagemaker-sample-files"
read_prefix = "datasets/tabular/synthetic_automobile_claims" 
model_prefix = "models/xgb-fraud"

# S3 location of trained model artifact
model_uri = f"s3://{read_bucket}/{model_prefix}/fraud-det-xgb-model.tar.gz"

# S3 locatin of test data
test_data_uri = f"s3://{read_bucket}/{read_prefix}/test.csv"

Step 3: Create a serverless inference endpoint

As mentioned earlier, SageMaker provides different avenues to deploy a model. For this tutorial, you deploy the model to a serverless inference endpoint. SageMaker Serverless Inference is ideal for workloads that have idle periods between traffic spurts and can tolerate cold starts (long idle period followed by sudden burst of traffic). Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. With a pay-per-use model, Serverless Inference is a cost-effective option if you have an infrequent or unpredictable traffic pattern.

In SageMaker, you can deploy a trained model to a serverless inference endpoint in multiple ways, namely SageMaker SDK, AWS SDK-Boto3, and SageMaker console (see Deploy Models for Inference in the Amazon SageMaker Developer Guide for more details). SageMaker SDK has more abstraction compared to the AWS SDK-Boto3, with the latter exposing lower-level APIs, which allows for more control when setting up model deployment. In this example, you deploy the model using the AWS SDK-Boto3. You need to follow three steps in sequence to deploy a model:

  1. Create model—Create a SageMaker model that can be used in SageMaker hosting services for deployment.
  2. Create endpoint configuration—Configure the endpoint to serve the model by specifying properties such as the SageMaker model. This is the step where you specify the serverless type of the endpoint.
  3. Create endpointCreate the model-serving endpoint using the specified endpoint configuration.

 

3.1 – In this step, you create a SageMaker model using the trained model artifact stored in Amazon S3. Copy and paste the following code to create a SageMaker model. The create_model method takes in the Docker container containing the training image (for this model, the XGBoost container) and the S3 location of the trained model artifact (you specified this S3 path in the previous step).

# Retrieve the SageMaker managed XGBoost image
training_image = retrieve(framework="xgboost", region=region, version="1.3-1")

# Specify an unique model name that does not exist
model_name = "fraud-detect-xgb"
primary_container = {
                     "Image": training_image,
                     "ModelDataUrl": model_uri
                    }

model_matches = sm_client.list_models(NameContains=model_name)["Models"]
if not model_matches:
    model = sm_client.create_model(ModelName=model_name,
                                   PrimaryContainer=primary_container,
                                   ExecutionRoleArn=sagemaker_role)
else:
    print(f"Model with name {model_name} already exists! Change model name to create new")
You can check the created model in the SageMaker console under the Models section.

3.2 – Once a SageMaker model has been created, you can use the create_endpoint_config method provided by Boto3 to configure the endpoint. Copy and paste the following code to set up the endpoint configuration. For serverless endpoints, you do not need to specify any instance type or count. The main configuration parameters are the endpoint memory size and the maximum concurrency which defines the maximum number of concurrent invocations of the endpoint. The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB, whereas the maximum concurrency can be any integer between 1 and 200.

# Endpoint Config name
endpoint_config_name = f"{model_name}-serverless-epconfig"

# Endpoint conifg parameters
production_variant_dict = {
                           "VariantName": "Alltraffic",
                           "ModelName": model_name,
                           "ServerlessConfig": {"MemorySizeInMB": 4096, # Endpoint memory in MB
                                                "MaxConcurrency": 1 # Number of concurrent invocations
                                               }
                          }

# Create endpoint config if one with the same name does not exist
endpoint_config_matches = sm_client.list_endpoint_configs(NameContains=endpoint_config_name)["EndpointConfigs"]
if not endpoint_config_matches:
    endpoint_config_response = sm_client.create_endpoint_config(
                                                                EndpointConfigName=endpoint_config_name,
                                                                ProductionVariants=[production_variant_dict]
                                                               )
else:
    print(f"Endpoint config with name {endpoint_config_name} already exists! Change endpoint config name to create new")

You can check the created endpoint configuration in the SageMaker console under the Endpoint configurations section.

3.3 – The final step after setting the endpoint configuration is to create the endpoint itself. Copy and paste the following code to create the endpoint. The create_endpoint method takes in the endpoint configuration and creates the inference endpoint.

# Endpoint name
endpoint_name = f"{model_name}-serverless-ep"

# Create endpoint if one with the same name does not exist
endpoint_matches = sm_client.list_endpoints(NameContains=endpoint_name)["Endpoints"]
if not endpoint_matches:
    endpoint_response = sm_client.create_endpoint(
                                                  EndpointName=endpoint_name,
                                                  EndpointConfigName=endpoint_config_name
                                                 )
else:
    print(f"Endpoint with name {endpoint_name} already exists! Change endpoint name to create new")

resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
while status == "Creating":
    print(f"Endpoint Status: {status}...")
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
print(f"Endpoint Status: {status}")

To check the status of the endpoint, select Deployments from the Home menu in the SageMaker Studio console. Select Endpoints, and choose fraud-detect-xgb-serverless-ep.

Step 4: Invoke the SageMaker Serverless endpoint

4.1 – When the endpoint status is InService, you can invoke the endpoint multiple ways, namely by calling the REST API directly (not created in this tutorial) through the AWS SDK, a graphical interface in SageMaker Studio, the AWS CLI, or by using the SageMaker Python SDK. In this tutorial, you will use the AWS SDK-Boto3 SageMaker runtime API. Before calling an endpoint, it is important that the test data is formatted suitably for the endpoint. This formatting is performed through the process of serialization-deserialization. The process of converting raw data (such as, csv) to byte streams that is usable by the endpoint is called serialization. The reverse process of converting byte stream to human-readable format is called deserialization. In this tutorial, you invoke the endpoint by sending the first five samples from a test set. Copy and paste the following code to invoke the endpoint and get prediction results.

# Fetch test data to run predictions with the endpoint
test_df = pd.read_csv(test_data_uri)

# For content type text/csv, payload should be a string with commas separating the values for each feature
# This is the inference request serialization step
# CSV serialization
csv_file = io.StringIO()
test_sample = test_df.drop(["fraud"], axis=1).iloc[:5]
test_sample.to_csv(csv_file, sep=",", header=False, index=False)
payload = csv_file.getvalue()
response = sm_runtime_client.invoke_endpoint(
                                             EndpointName=endpoint_name,
                                             Body=payload,
                                             ContentType="text/csv"
                                            )

# This is the inference response deserialization step
# This is a bytes object
result = response["Body"].read()
# Decoding bytes to a string with comma separated predictions
result = result.decode("utf-8")
# Converting to list of predictions
result = re.split(",|\n",result)

prediction_df = pd.DataFrame()
prediction_df["Prediction"] = result[:5]
prediction_df["Label"] = test_df["fraud"].iloc[:5].values
prediction_df
Note that since the request to the endpoint (test data) is in csv form, a csv serialization process is used to create the payload. The response is then deserialized to an array of predictions. Upon execution, the cell returns the model predictions and the true labels for the test samples (note the XGBoost model returns probabilities instead of actual class labels). The model has predicted very low likelihood for the test samples to be fraudulent claims and the predictions are in line with the true labels.
  • You can monitor the endpoint metrics (see Monitor a Serverless Endpoint for all metrics emitted by a serverless endpoint), such as Invocations and ModelSetupTime, using Amazon CloudWatch. From the SageMaker console, select Endpoints from the left navigation pane. In the Endpoints section, select the endpoint name.
  • On the endpoint details page, select View invocation metrics under the Monitor section.
  • The Metrics page shows multiple metrics that summarize the functioning of the endpoint. You can choose different time periods during which to assess the endpoint invocation performance. Select the checkbox against any metric to graph the trend of that metric over the chosen time period. In the next section, you choose one of these metrics to define scaling policies that aid in handling fluctuating traffic efficiently.

Step 5: Clean up resources

In the following steps, you can easily clean up the resources that you created in this tutorial. It is a best practice to delete resources you are no longer using so you don’t keep getting charged for them.

5.1 – Delete the model, endpoint configuration, and endpoint you created in this tutorial by running the following code block in your notebook.

# Delete model
sm_client.delete_model(ModelName=model_name)

# Delete endpoint configuration
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)

# Delete endpoint
sm_client.delete_endpoint(EndpointName=endpoint_name)

5.2 – Delete SageMaker Studio user and domain

  • On the SageMaker Studio console, select studio-user and for each app listed under Apps choose Delete app. Follow on-screen prompts to confirm the delete operation. Wait until status shows as Deleted.
  • When the status of all the apps changes to Deleted, choose Edit on the bottom right.
  • On the General Settings dialog box choose Delete user.
  • When your SageMaker Studio user is deleted, go to the SageMaker console. Select Domains and then StudioDomain, and choose Edit.
  • On the General Settings dialog box, choose Delete domain. Follow the on-screen prompts to confirm the delete operation.

If you ran the CloudFormation template in Step 1 to create a new SageMaker Studio domain, continue with the following steps to delete the domain, user, and the resources created by the CloudFormation template.

Open the CloudFromation console. In the CloudFormation pane, choose Stacks. From the status dropdown list, select Active. Under Stack name, choose CFN-SM-IM-Lambda-catalog to open the stack details page.

On the CFN-SM-IM-Lambda-catalog stack details page, choose Delete to delete the stack along with the resources it created in Step 1.

Conclusion

Congratulations! You have finished the Deploy a Machine Learning Model to a Serverless Inference Endpoint tutorial.
 
You have successfully used Amazon SageMaker Studio to create a SageMaker model and deploy it to a serverless inference endpoint. You used the AWS SDK-Boto3 API to invoke the endpoint and test it by running sample inferences.
 
You can continue your machine learning journey with SageMaker by following the Next steps section below.

Was this page helpful?

Next steps