Train a Machine Learning Model

TUTORIAL

Overview

In this tutorial, you'll learn how to train, tune, and evaluate a machine learning (ML) model using Amazon SageMaker Studio and Amazon SageMaker Clarify.

Amazon SageMaker Studio is an integrated development environment (IDE) for ML that provides a fully managed Jupyter notebook interface in which you can perform end-to-end ML lifecycle tasks. Using SageMaker Studio, you can create and explore datasets; prepare training data; build, train, and tune models; and deploy trained models for inference—all in one place. With Amazon SageMaker Clarify, you can have greater visibility into your training data and models so you can identify and limit bias and explain predictions.

For this tutorial, you'll use a synthetically generated auto insurance claims dataset. The inputs are the training, validation, and test datasets, each containing details and extracted features about claims and customers along with a fraud column indicating whether a claim was fraudulent or otherwise. You'll use the open source XGBoost framework to build a binary classification model on this synthetic dataset to predict the likelihood of a claim being fraudulent. You'll also evaluate the trained model by running bias and feature importance reports, deploy the model for testing, and run sample inference to evaluate model performance and explain predictions.

What you will accomplish

In this guide, you will:

  • Build, train, and tune a model using script mode
  • Detect bias in ML models and understand model predictions
  • Deploy the trained model to a real-time inference endpoint for testing
  • Evaluate the model by generating sample predictions and understanding feature impact

Prerequisites

Before starting this guide, you will need:

 AWS experience

Beginner

 Time to complete

2 hours

 Cost to complete

See SageMaker pricing to estimate cost for this tutorial.

 Requires

You must be logged into an AWS account.

 Services used

Amazon SageMaker Studio, Amazon SageMaker Clarify

 Last updated

May 3, 2022

Implementation

Step 1: Set up your Amazon SageMaker Studio domain

With Amazon SageMaker, you can deploy a model visually using the console or programmatically using either SageMaker Studio or SageMaker notebooks. In this tutorial, you deploy the model programmatically using a SageMaker Studio notebook, which requires a SageMaker Studio domain.

An AWS account can have only one SageMaker Studio domain per Region. If you already have a SageMaker Studio domain in the US East (N. Virginia) Region, follow the SageMaker Studio setup guide to attach the required AWS IAM policies to your SageMaker Studio account, then skip Step 1, and proceed directly to Step 2. 

If you don't have an existing SageMaker Studio domain, continue with Step 1 to run an AWS CloudFormation template that creates a SageMaker Studio domain and adds the permissions required for the rest of this tutorial.

Choose the AWS CloudFormation stack link. This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. In the CloudFormation console, confirm that US East (N. Virginia) is the Region displayed in the upper right corner. Stack name should be CFN-SM-IM-Lambda-catalog, and should not be changed. This stack takes about 10 minutes to create all the resources.

This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC. 

Select I acknowledge that AWS CloudFormation might create IAM resources, and then choose Create stack.

On the CloudFormation pane, choose Stacks. It takes about 10 minutes for the stack to be created. When the stack is created, the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE

Step 2: Set up a SageMaker Studio notebook

In this step, you'll launch a new SageMaker Studio notebook, install the necessary open source libraries, and set up the SageMaker variables required to interact with other services, including Amazon Simple Storage Service (Amazon S3).

Enter SageMaker Studio into the console search bar, and then choose SageMaker Studio.

Choose US East (N. Virginia) from the Region dropdown list on the upper right corner of the SageMaker console. For Launch app, select Studio to open SageMaker Studio using the studio-user profile.

Open the SageMaker Studio interface. On the navigation bar, choose FileNewNotebook

In the Set up notebook environment dialog box, under Image, select Data Science. The Python 3 kernel is selected automatically. Choose Select

The kernel on the top right corner of the notebook should now display Python 3 (Data Science).

To install specific versions of the open source XGBoost and Pandas libraries, copy and paste the following code snippet into a cell in the notebook, and press Shift+Enter to run the current cell. Ignore any warnings to restart the kernel or any dependency conflict errors.

%pip install -q  xgboost==1.3.1 pandas==1.0.5

You also need to instantiate the S3 client object and the locations inside your default S3 bucket where contents such as metrics and model artifacts are uploaded. To do this, copy and paste the following code example into a cell in the notebook and run it. 

import pandas as pd
import boto3
import sagemaker
import json
import joblib
from sagemaker.xgboost.estimator import XGBoost
from sagemaker.tuner import (
    IntegerParameter,
    ContinuousParameter,
    HyperparameterTuner
)
from sagemaker.inputs import TrainingInput
from sagemaker.image_uris import retrieve
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import CSVDeserializer

# Setting SageMaker variables
sess = sagemaker.Session()
write_bucket = sess.default_bucket()
write_prefix = "fraud-detect-demo"

region = sess.boto_region_name
s3_client = boto3.client("s3", region_name=region)

sagemaker_role = sagemaker.get_execution_role()
sagemaker_client = boto3.client("sagemaker")
read_bucket = "sagemaker-sample-files"
read_prefix = "datasets/tabular/synthetic_automobile_claims" 


# Setting S3 location for read and write operations
train_data_key = f"{read_prefix}/train.csv"
test_data_key = f"{read_prefix}/test.csv"
validation_data_key = f"{read_prefix}/validation.csv"
model_key = f"{write_prefix}/model"
output_key = f"{write_prefix}/output"


train_data_uri = f"s3://{read_bucket}/{train_data_key}"
test_data_uri = f"s3://{read_bucket}/{test_data_key}"
validation_data_uri = f"s3://{read_bucket}/{validation_data_key}"
model_uri = f"s3://{write_bucket}/{model_key}"
output_uri = f"s3://{write_bucket}/{output_key}"
estimator_output_uri = f"s3://{write_bucket}/{write_prefix}/training_jobs"
bias_report_output_uri = f"s3://{write_bucket}/{write_prefix}/clarify-output/bias"
explainability_report_output_uri = f"s3://{write_bucket}/{write_prefix}/clarify-output/explainability"

Notice that the write bucket name is derived from the SageMaker session object. Your default bucket has the name sagemaker-<your-Region>-<your-account-id>. This bucket is where all training artifacts are uploaded. The datasets that you use for training exist in a public S3 bucket named sagemaker-sample-files, which has been specified as the read bucket. Note that the SageMaker XGBoost framework being imported is not the open-source framework you installed in the previous step. This is the built-in framework with a Docker container image that you use to scale up model training.

Copy and paste the following code block to set model name and training and inference instance configurations and counts. These settings allow you to manage the training and inference processes by using the appropriate instance type and count.

tuning_job_name_prefix = "xgbtune" 
training_job_name_prefix = "xgbtrain"

xgb_model_name = "fraud-detect-xgb-model"
endpoint_name_prefix = "xgb-fraud-model-dev"
train_instance_count = 1
train_instance_type = "ml.m4.xlarge"
predictor_instance_count = 1
predictor_instance_type = "ml.m4.xlarge"
clarify_instance_count = 1
clarify_instance_type = "ml.m4.xlarge"

Step 3: Launch hyperparameter tuning jobs in script mode

With SageMaker Studio you can bring your own logic within Python scripts to be used for training. By encapsulating training logic in a script, you can incorporate custom training routines and model configurations while still using common ML framework containers maintained by AWS. In this tutorial, you prepare a training script which uses the open source XGBoost framework supported by the AWS provided XGBoost container and launch hyperparameter tuning jobs at scale. To train the model, you use the column fraud as the target column.

The first level of script mode is the ability to define your own training process in a self-contained, customized Python script and to use that script as the entry point when defining your SageMaker estimator. Copy and paste the following code block to write a Python script encapsulating the model training logic.

%%writefile xgboost_train.py

import argparse
import os
import joblib
import json
import pandas as pd
import xgboost as xgb
from sklearn.metrics import roc_auc_score

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    # Hyperparameters and algorithm parameters are described here
    parser.add_argument("--num_round", type=int, default=100)
    parser.add_argument("--max_depth", type=int, default=3)
    parser.add_argument("--eta", type=float, default=0.2)
    parser.add_argument("--subsample", type=float, default=0.9)
    parser.add_argument("--colsample_bytree", type=float, default=0.8)
    parser.add_argument("--objective", type=str, default="binary:logistic")
    parser.add_argument("--eval_metric", type=str, default="auc")
    parser.add_argument("--nfold", type=int, default=3)
    parser.add_argument("--early_stopping_rounds", type=int, default=3)
    

    # SageMaker specific arguments. Defaults are set in the environment variables
    # Location of input training data
    parser.add_argument("--train_data_dir", type=str, default=os.environ.get("SM_CHANNEL_TRAIN"))
    # Location of input validation data
    parser.add_argument("--validation_data_dir", type=str, default=os.environ.get("SM_CHANNEL_VALIDATION"))
    # Location where trained model will be stored. Default set by SageMaker, /opt/ml/model
    parser.add_argument("--model_dir", type=str, default=os.environ.get("SM_MODEL_DIR"))
    # Location where model artifacts will be stored. Default set by SageMaker, /opt/ml/output/data
    parser.add_argument("--output_data_dir", type=str, default=os.environ.get("SM_OUTPUT_DATA_DIR"))
    
    args = parser.parse_args()

    data_train = pd.read_csv(f"{args.train_data_dir}/train.csv")
    train = data_train.drop("fraud", axis=1)
    label_train = pd.DataFrame(data_train["fraud"])
    dtrain = xgb.DMatrix(train, label=label_train)
    
    
    data_validation = pd.read_csv(f"{args.validation_data_dir}/validation.csv")
    validation = data_validation.drop("fraud", axis=1)
    label_validation = pd.DataFrame(data_validation["fraud"])
    dvalidation = xgb.DMatrix(validation, label=label_validation)

    params = {"max_depth": args.max_depth,
              "eta": args.eta,
              "objective": args.objective,
              "subsample" : args.subsample,
              "colsample_bytree":args.colsample_bytree
             }
    
    num_boost_round = args.num_round
    nfold = args.nfold
    early_stopping_rounds = args.early_stopping_rounds
    
    cv_results = xgb.cv(
        params=params,
        dtrain=dtrain,
        num_boost_round=num_boost_round,
        nfold=nfold,
        early_stopping_rounds=early_stopping_rounds,
        metrics=["auc"],
        seed=42,
    )
    
    model = xgb.train(params=params, dtrain=dtrain, num_boost_round=len(cv_results))
    
    train_pred = model.predict(dtrain)
    validation_pred = model.predict(dvalidation)
    
    train_auc = roc_auc_score(label_train, train_pred)
    validation_auc = roc_auc_score(label_validation, validation_pred)
    
    print(f"[0]#011train-auc:{train_auc:.2f}")
    print(f"[0]#011validation-auc:{validation_auc:.2f}")

    metrics_data = {"hyperparameters" : params,
                    "binary_classification_metrics": {"validation:auc": {"value": validation_auc},
                                                      "train:auc": {"value": train_auc}
                                                     }
                   }
              
    # Save the evaluation metrics to the location specified by output_data_dir
    metrics_location = args.output_data_dir + "/metrics.json"
    
    # Save the model to the location specified by model_dir
    model_location = args.model_dir + "/xgboost-model"

    with open(metrics_location, "w") as f:
        json.dump(metrics_data, f)

    with open(model_location, "wb") as f:
        joblib.dump(model, f)

Notice how the script imports the open source XGBoost library you installed earlier.

SageMaker runs the entry point script and supplies all input parameters such as model configuration details and input and output paths as command line arguments. The script uses the ‘argparse’ Python library to capture the supplied arguments.

Your training script runs inside a Docker container and SageMaker automatically downloads training and validation datasets from Amazon S3 to local paths inside the container. These locations can be accessed through environment variables. For an exhaustive list of the SageMaker environment variables, see Environment variables.

Once you have prepared your training script, you can instantiate a SageMaker estimator. You use the AWS managed XGBoost estimator, since it manages the XGBoost container that can run your custom script. To instantiate the XGBoost estimator, copy and paste the following code.

# SageMaker estimator

# Set static hyperparameters that will not be tuned
static_hyperparams = {  
                        "eval_metric" : "auc",
                        "objective": "binary:logistic",
                        "num_round": "5"
                      }

xgb_estimator = XGBoost(
                        entry_point="xgboost_train.py",
                        output_path=estimator_output_uri,
                        code_location=estimator_output_uri,
                        hyperparameters=static_hyperparams,
                        role=sagemaker_role,
                        instance_count=train_instance_count,
                        instance_type=train_instance_type,
                        framework_version="1.3-1",
                        base_job_name=training_job_name_prefix
                    )

You can specify the static configuration parameters when specifying the estimator. In this tutorial, you use the Receiver Operating Characteristics Area Under the Curve (ROC-AUC) as the evaluation metric. To control the time it takes to run, the number of rounds has been set to 5.

The custom script and the training instance configurations are passed to the estimator object as arguments. The XGBoost version is chosen to match the one you installed earlier.

 

You tune four XGBoost hyperparameters in this tutorial:

  • eta: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.
  • subsample: Subsample ratio of the training instances. Setting it to 0.5 means that XGBoost randomly samples half of the training data prior to growing trees. Using different subsets for every boosting iteration helps prevent overfitting.
  • colsample_bytree: Fraction of features used to generate each tree of the boosting process. Using a subset of features to create each tree introduces more randomness in the modeling process, improving the generalizability.
  • max_depth: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted.

Copy and paste the following code block to set up the range of the preceding hyperparameters to search from.

# Setting ranges of hyperparameters to be tuned
hyperparameter_ranges = {
    "eta": ContinuousParameter(0, 1),
    "subsample": ContinuousParameter(0.7, 0.95),
    "colsample_bytree": ContinuousParameter(0.7, 0.95),
    "max_depth": IntegerParameter(1, 5)
}

Copy and paste the following code block to set up the hyperparameter tuner. SageMaker runs Bayesian optimization routines as default for the search process. In this tutorial you use the random search approach to reduce the runtime. The parameters are tuned based on the AUC performance of the model on the validation dataset.

objective_metric_name = "validation:auc"

# Setting up tuner object
tuner_config_dict = {
                     "estimator" : xgb_estimator,
                     "max_jobs" : 5,
                     "max_parallel_jobs" : 2,
                     "objective_metric_name" : objective_metric_name,
                     "hyperparameter_ranges" : hyperparameter_ranges,
                     "base_tuning_job_name" : tuning_job_name_prefix,
                     "strategy" : "Random"
                    }
tuner = HyperparameterTuner(**tuner_config_dict)

You can call the fit() method on the tuner object to launch hyperparameter tuning jobs. For fitting the tuner, you can specify the different input channels. This tutorial provides train and validation channels. Copy and paste the following code block to launch hyperparameter tuning jobs. This takes approximately 13 minutes to complete.

# Setting the input channels for tuning job
s3_input_train = TrainingInput(s3_data="s3://{}/{}".format(read_bucket, train_data_key), content_type="csv", s3_data_type="S3Prefix")
s3_input_validation = (TrainingInput(s3_data="s3://{}/{}".format(read_bucket, validation_data_key), 
                                    content_type="csv", s3_data_type="S3Prefix")
                      )

tuner.fit(inputs={"train": s3_input_train, "validation": s3_input_validation}, include_cls_metadata=False)
tuner.wait()

The launched tuning jobs are visible from the SageMaker console under Hyperparameter tuning jobs (note that the tuning job names as shown in the attached images will not match with what you see because of different timestamps).

 

 

Once the tuning is complete, you can access a summary of results. Copy and paste the following code block to retrieve the tuning job results in a pandas dataframe arranged in descending order of performance.

# Summary of tuning results ordered in descending order of performance
df_tuner = sagemaker.HyperparameterTuningJobAnalytics(tuner.latest_tuning_job.job_name).dataframe()
df_tuner = df_tuner[df_tuner["FinalObjectiveValue"]>-float('inf')].sort_values("FinalObjectiveValue", ascending=False)
df_tuner

You can inspect the combination of hyperparameters that had the best performance.

 

 

Step 4: Check model for biases and explain model predictions using SageMaker Clarify

Once you have a trained model, it is important to understand if there is any inherent bias in the model or the data before deployment. Model predictions can be a source of bias (for example, if they make predictions that more frequently produce a negative result for one group than another). SageMaker Clarify helps explain how a trained model makes predictions using a feature attribution approach. In this tutorial the focus is on posttraining bias metric and SHAP values for model explainability. Specifically, the following common tasks are covered:

  • Data and model bias detection
  • Model explainability using feature importance values
  • Impact of features and local explanations for single data samples

Before SageMaker Clarify can perform model bias detection, it requires a SageMaker model that SageMaker Clarify deploys to an ephemeral endpoint as part of the analyses. The endpoint is then deleted once SageMaker Clarify analyses are completed. Copy and paste the following code block to create a SageMaker model from the best training job identified from the tuning job.

tuner_job_info = sagemaker_client.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)

model_matches = sagemaker_client.list_models(NameContains=xgb_model_name)["Models"]

if not model_matches:
    _ = sess.create_model_from_job(
            name=xgb_model_name,
            training_job_name=tuner_job_info['BestTrainingJob']["TrainingJobName"],
            role=sagemaker_role,
            image_uri=tuner_job_info['TrainingJobDefinition']["AlgorithmSpecification"]["TrainingImage"]
            )
else:

    print(f"Model {xgb_model_name} already exists.")

To run bias detection, SageMaker Clarify expects multiple component configurations to be set up. You can find more details at Amazon SageMaker Clarify. For this tutorial, apart from the standard configurations, you set up SageMaker Clarify to detect if the data is statistically biased against females by checking if the target is skewed towards a value based on the customer gender. Copy and paste the following code to set up the SageMaker Clarify configuration.

train_df = pd.read_csv(train_data_uri)
train_df_cols = train_df.columns.to_list()

clarify_processor = sagemaker.clarify.SageMakerClarifyProcessor(
    role=sagemaker_role,
    instance_count=clarify_instance_count,
    instance_type=clarify_instance_type,
    sagemaker_session=sess,
)

# Data config
bias_data_config = sagemaker.clarify.DataConfig(
    s3_data_input_path=train_data_uri,
    s3_output_path=bias_report_output_uri,
    label="fraud",
    headers=train_df_cols,
    dataset_type="text/csv",
)

# Model config
model_config = sagemaker.clarify.ModelConfig(
    model_name=xgb_model_name,
    instance_type=train_instance_type,
    instance_count=1,
    accept_type="text/csv",
)

# Model predictions config to get binary labels from probabilities
predictions_config = sagemaker.clarify.ModelPredictedLabelConfig(probability_threshold=0.5)

# Bias config
bias_config = sagemaker.clarify.BiasConfig(
    label_values_or_threshold=[0],
    facet_name="customer_gender_female",
    facet_values_or_threshold=[1],
)

# Run Clarify job
clarify_processor.run_bias(
    data_config=bias_data_config,
    bias_config=bias_config,
    model_config=model_config,
    model_predicted_label_config=predictions_config,
    pre_training_methods=["CI"],
    post_training_methods=["DPPL"])

clarify_bias_job_name = clarify_processor.latest_job.name

Within SageMaker Clarify, pretraining metrics show pre-existing bias in the data, while posttraining metrics show bias in the predictions from the model. Using the SageMaker SDK, you can specify across which groups you want to check bias and which bias metrics to consider. For the purposes of this tutorial, you use Class Imbalance (CI) and Difference in Positive Proportions in Predicted Labels (DPPL) as exemplars of pretraining and posttraining bias statistics, respectively. You can find details of other bias metrics at Measure Pretraining Bias and Posttraining Data and Model Bias. Copy and paste the following code block to run SageMaker Clarify and generate bias reports. The chosen bias metrics are passed on as arguments to the run_bias method. This code takes approximately 12 minutes to complete.

clarify_processor.run_bias(
    data_config=bias_data_config,
    bias_config=bias_config,
    model_config=model_config,
    model_predicted_label_config=predictions_config,
    pre_training_methods=["CI"],
    post_training_methods=["DPPL"]
    )

clarify_bias_job_name = clarify_processor.latest_job.name

The SageMaker Clarify outputs are saved to your default S3 bucket. Copy and paste the following code to download the SageMaker Clarify report in PDF format from Amazon S3 to your local directory in SageMaker Studio.

# Copy bias report and view locally
!aws s3 cp s3://{write_bucket}/{write_prefix}/clarify-output/bias/report.pdf ./clarify_bias_output.pdf

In the PDF report, based on the pretraining and posttraining bias metrics, the dataset does seem to have class imbalance with respect to the customer gender feature. Such imbalances can be rectified by applying techniques such as SMOTE to re-create the training dataset. You can also use SageMaker Data Wrangler and specify one of the multiple options including SMOTE that are available within the service to balance training datasets. For more details, see Data Wrangler Balance Data. For brevity, this step is not included in this tutorial.

In addition to data bias, SageMaker Clarify can also analyze the trained model and create a model explainability report based on feature importance. SageMaker Clarify uses SHAP values to explain the contribution that each input feature makes to the final prediction. Copy and paste the following code block to configure and run a model explainability analysis. This code block takes approximately 14 minutes to complete.

explainability_data_config = sagemaker.clarify.DataConfig(
    s3_data_input_path=train_data_uri,
    s3_output_path=explainability_report_output_uri,
    label="fraud",
    headers=train_df_cols,
    dataset_type="text/csv",
)

# Use mean of train dataset as baseline data point
shap_baseline = [list(train_df.drop(["fraud"], axis=1).mean())]

shap_config = sagemaker.clarify.SHAPConfig(
    baseline=shap_baseline,
    num_samples=500,
    agg_method="mean_abs",
    save_local_shap_values=True,
)

clarify_processor.run_explainability(
    data_config=explainability_data_config,
    model_config=model_config,
    explainability_config=shap_config
)

Copy and paste the following code to download the SageMaker Clarify explainability report in PDF format from Amazon S3 to your local directory in SageMaker Studio.

# Copy explainability report and view
!aws s3 cp s3://{write_bucket}/{write_prefix}/clarify-output/explainability/report.pdf ./clarify_explainability_output.pdf

The report contains feature importance charts showcasing how the input features contribute towards model predictions. For the model trained in this tutorial, it seems the num-injuries feature plays the most important role, closely followed by the customer_gender_male feature in generating predictions. Such feature rankings provide important insights into the prediction mechanism and drive model refinement and development with fair and explainable use of ML.

 

 

The bias and explainability analysis results can also be viewed in SageMaker Studio under the SageMaker Resources and Experiments and trials option in the dropdown list. Choose Unassigned trial components.

 

 

Select the explainability report named clarify-explainability-<datetimestamp>.

 

 

On the Explainability tab, you can visualize the feature importance chart. You can also download the report by choosing Export PDF report.

 

 

The explainability report generated by SageMaker Clarify also provides a file called out.csv containing local SHAP values for individual samples. Copy and paste the following code block to use this file to visualize the explanation (the impact that each feature has on the prediction of your model) for any single example.

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
local_explanations_out = pd.read_csv(explainability_report_output_uri + "/explanations_shap/out.csv")
feature_names = [str.replace(c, "_label0", "") for c in 
local_explanations_out.columns.to_series()]
local_explanations_out.columns = feature_names

selected_example = 100
print("Example number:", selected_example)

local_explanations_out.iloc[selected_example].plot(
    kind="bar", title="Local explanation for the example number " + str(selected_example), rot=60, figsize=(20, 8)
);

For the chosen example (first sample in the test set), the total claim amount, gender, and number of injuries contributed the most towards the prediction.

 

 

Step 5: Deploy the model to a real-time inference endpoint

In this step, you deploy the best model obtained from the hyperparameter tuning job to a real-time inference endpoint and then use the endpoint to generate predictions. There are multiple methods to deploy a trained model such as the SageMaker SDK, AWS SDK - Boto3, and the SageMaker console. For more information, see Deploy Models for Inference in the Amazon SageMaker documentation. In this example, you deploy the model to a real-time endpoint using the SageMaker SDK.

 

Copy and paste the following code block to deploy the best model.

best_train_job_name = tuner.best_training_job()

model_path = estimator_output_uri + '/' + best_train_job_name + '/output/model.tar.gz'
training_image = retrieve(framework="xgboost", region=region, version="1.3-1")
create_model_config = {"model_data":model_path,
                       "role":sagemaker_role,
                       "image_uri":training_image,
                       "name":endpoint_name_prefix,
                       "predictor_cls":sagemaker.predictor.Predictor
                       }
# Create a SageMaker model
model = sagemaker.model.Model(**create_model_config)

# Deploy the best model and get access to a SageMaker Predictor
predictor = model.deploy(initial_instance_count=predictor_instance_count, 
                         instance_type=predictor_instance_type,
                         serializer=CSVSerializer(),
                         deserializer=CSVDeserializer())
print(f"\nModel deployed at endpoint : {model.endpoint_name}")

The code uses the best training job name to retrieve the model from Amazon S3. XGBoost can accept input data either in text/libsvm or text/csv formats. The input datasets used in this tutorial are in CSV format and therefore the deployment configuration includes a CSVSerializer which converts CSV inputs to byte streams and a CSVDeserializer that converts native model output in byte streams back to the CSV format for our consumption. On completion, the code block returns the name of the endpoint to which the model has been deployed. The deployment also returns a SageMaker Predictor that can be used to invoke the endpoint to run predictions as shown in the next section.

 

You can check out the deployed endpoint from the SageMaker Studio interface by clicking on the SageMaker Resources icon and selecting Endpoints from the dropdown list.

You can also inspect endpoints through the SageMaker console under Inference, Endpoints.

Now that the model has been deployed to an endpoint, you can invoke by calling the REST API directly (not described in this tutorial), through the AWS SDK, through a graphical interface in SageMaker Studio, or by using the SageMaker Python SDK. In this tutorial, you use the SageMaker Predictor made available through the deploy step to get real-time model predictions on one or more test samples. Copy and paste the following code block to invoke the endpoint and send a single sample of test data.

# Sample test data
test_df = pd.read_csv(test_data_uri)
payload = test_df.drop(["fraud"], axis=1).iloc[0].to_list()
print(f"Model predicted score : {float(predictor.predict(payload)[0][0]):.3f}, True label : {test_df['fraud'].iloc[0]}")

The output of the cell shows the true label and the predicted score as sent back by the model endpoint. Since the predicted probability is very low, the test sample was correctly labeled as not fraud by the model.

Step 6: Clean up resources

It is a best practice to delete resources that you are no longer using so that you don't incur unintended charges.

To delete the model and endpoint, copy and paste the following code into the notebook.

# Delete model
try:
	sess.delete_model(xgb_model_name)
except:
	pass
sess.delete_model(model.name)

# Delete inference endpoint config
sess.delete_endpoint_config(endpoint_config_name=predictor._get_endpoint_config_name())

# Delete inference endpoint
sess.delete_endpoint(endpoint_name=model.endpoint_name)

To delete the S3 bucket, do the following: 

  • Open the Amazon S3 console. On the navigation bar, choose Buckets, sagemaker-<your-Region>-<your-account-id>, and then select the checkbox next to fraud-detect-demo. Then, choose Delete
  • On the Delete objects dialog box, verify that you have selected the proper object to delete and enter permanently delete into the Permanently delete objects confirmation box. 
  • Once this is complete and the bucket is empty, you can delete the sagemaker-<your-Region>-<your-account-id> bucket by following the same procedure again.

The Data Science kernel used for running the notebook image in this tutorial will accumulate charges until you either stop the kernel or perform the following steps to delete the apps. For more information, see Shut Down Resources in the Amazon SageMaker Developer Guide.

To delete the SageMaker Studio apps, do the following: On the SageMaker Studio console, choose studio-user, and then delete all the apps listed under Apps by choosing Delete app. Wait until the Status changes to Deleted.

If you used an existing SageMaker Studio domain in Step 1, skip the rest of Step 6 and proceed directly to the conclusion section. 

If you ran the CloudFormation template in Step 1 to create a new SageMaker Studio domain, continue with the following steps to delete the domain, user, and the resources created by the CloudFormation template.  

To open the CloudFormation console, enter CloudFormation into the AWS console search bar, and choose CloudFormation from the search results.

In the CloudFormation pane, choose Stacks. From the status dropdown list, select Active. Under Stack name, choose CFN-SM-IM-Lambda-catalog to open the stack details page.

On the CFN-SM-IM-Lambda-catalog stack details page, choose Delete to delete the stack along with the resources it created in Step 1.

Conclusion

Congratulations! You have finished the Train a Machine Learning Model tutorial. 

In this tutorial, you used Amazon SageMaker Studio to train a binary classification model in script mode. You used the open source XGBoost library with the AWS managed XGBoost container to train and tune the model using SageMaker hyperparameter tuning jobs. You also analyzed bias and model explainability using SageMaker Clarify and used the reports to assess the feature impact on individual predictions. Finally, you used the SageMaker SDK to deploy the model to a real-time inference endpoint and tested it with a sample payload.

You can continue your data scientist journey with Amazon SageMaker by following the next steps section below.

Was this page helpful?

Create an ML model automatically

Learn how to use AutoML to develop ML models without writing code.
Next »

Deploy a trained model

Learn how to deploy a trained ML model for inference.
Next »

Find more hands-on tutorials

Explore other machine learning tutorials to dive deeper.
Next »