Overview
What you will accomplish
In this guide, you will:
- Create a training experiment using SageMaker Autopilot
- Explore the different stages of the training experiment
- Identify and deploy the best performing model from the training experiment
- Predict with your deployed model
Prerequisites
Before starting this guide, you will need:
- An AWS account: If you don't already have an account, follow the Setting Up Your AWS Environment getting started guide for a quick overview.
AWS experience
Beginner
Time to complete
45 minutes
Cost to complete
See SageMaker pricing to estimate cost for this tutorial.
Requires
You must be logged into an AWS account.
Services used
Amazon SageMaker Autopilot
Last updated
July 12, 2022
Implementation
For this workflow, you will use a synthetically generated auto insurance claims dataset. The raw inputs are two tables of insurance data: a claims table and a customers table. The claims table has a fraud column indicating whether a claim was fraudulent or otherwise. For the purposes of this tutorial, we have selected a small portion of the dataset. However, you can follow the same steps in this tutorial to process large datasets.
Step 1: Set up Amazon SageMaker Studio domain
An AWS account can have only one SageMaker Studio domain per AWS Region. If you already have a SageMaker Studio domain in the US East (N. Virginia) Region, follow the SageMaker Studio setup guide to attach the required AWS IAM policies to your SageMaker Studio account, then skip Step 1, and proceed directly to Step 2.
If you don't have an existing SageMaker Studio domain, continue with Step 1 to run an AWS CloudFormation template that creates a SageMaker Studio domain and adds the permissions required for the rest of this tutorial.
Choose the AWS CloudFormation stack link. This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. In the CloudFormation console, confirm that US East (N. Virginia) is the Region displayed in the upper right corner. Stack name should be CFN-SM-IM-Lambda-catalog, and should not be changed. This stack takes about 10 minutes to create all the resources.
This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC.
Select I acknowledge that AWS CloudFormation might create IAM resources, and then choose Create stack.
On the CloudFormation pane, choose Stacks. When the stack is created, the status of the stack should change from CREATE_IN_PROGRESS to CREATE_COMPLETE.
Enter SageMaker Studio into the CloudFormation console search bar, and then choose SageMaker Studio.
Choose US East (N. Virginia) from the Region dropdown list on the upper right corner of the SageMaker console. For Launch app, select Studio to open SageMaker Studio using the studio-user profile.
Step 2: Start a new SageMaker Autopilot experiment
Developing and testing a large number of candidate models is crucial for machine learning (ML) projects. Amazon SageMaker Autopilot helps by providing different model candidates and automatically chooses the best model based on your data. In this step, you will configure a SageMaker Autopilot experiment to predict success from a financial services marketing campaign. This dataset represents a marketing campaign that was run by a major financial services institution to promote certificate of deposit enrollment.
To start a new SageMaker Autopilot experiment, click the + icon to access a new launcher window. In the Launcher window, scroll down to ML tasks and components. Click on the + icon for New Autopilot experiment.
Next, you’ll name your experiment. Click in the Experiment name box and type autopilot-experiment as the name.
Next, you’ll connect the experiment to data that is staged in S3. Click the box Enter S3 bucket location. In the S3 bucket address box, paste the following S3 path: s3://sagemaker-sample-files/datasets/tabular/uci_bank_marketing/bank-additional-full.csv
Leave the manifest file option set to Off. In the Target dropdown, select y as the target feature which our model will attempt to predict.
In the Output data location (S3 bucket) table, choose your own S3 bucket. In the Dataset directory name field, type sagemaker/tutorial-autopilot/output. This is where the output data will be saved once the experiment completes.
Leave the Auto deploy option on and the Auto deploy endpoint field blank. This will automatically deploy our model as an API endpoint and assign a name.
Next, a number of optional advanced settings allow you to manually set experimental parameters, such as the problem type, how the experiment is run, runtime details, IAM access, encryption, security, and others. Click the runtime button to show the optional settings.
For this tutorial, decrease the number of Max candidates from 250 to 5. This will run fewer models more quickly. A full experiment is the best approach for truly optimizing your model, but it can take hours to complete. For this tutorial, we will keep optional settings as is.
Click the Create Experiment button to start the first stage of the SageMaker Autopilot experiment. SageMaker Autopilot will begin to run through the phases of an experiment. In the experiment window, you can track progress through the phases of preprocessing, candidate definitions, feature engineering, model tuning, explainability, and insights. If you see a popup notification asking "Are you sure you want to deploy the best model?", click yes.
Step 3: Interpret model performance
Now that the experiment is complete and you have a model, the next step is to interpret its performance. You will now learn how to use SageMaker Autopilot to analyze the model's performance.
Now that the SageMaker Autopilot experiment is complete, you can open up the top ranking model to obtain more details on the model’s performance and metadata. From the list of models, highlight the first one and right click to bring up model options. Click on Open in model details to review the model’s performance statistics.
In the new window, click on Explainability. The first view you see is called Feature Importance and represents the aggregated SHAP value for each feature across each instance in the dataset. The feature importance score is an important part of model explainability because it shows what features tend to influence the predictions the most in the dataset. In this use case, the customer duration or tenure and employment variation rate are the top two fields for driving the model's outcome.
Now, click on the tab Performance. You will find detailed information on the model’s performance, including recall, precision, and accuracy. You can also interpret model performance and decide if additional model tuning is needed.
Next, visualizations are provided to further illustrate model performance. First, look at the confusion matrix. The confusion matrix is commonly used to understand how the model labels are divided among the predicted and true classes. In this case, the diagonal elements show the number of correctly predicted labels and the off-diagonal elements show the misclassified records. A confusion matix is useful for analyzing misclassifications due to false positives and false negatives.
Next, look at the precision versus recall curve. This curve interprets the label as a probability threshold and shows the trade-off that occurs at various probability thresholds for model precision and recall. SageMaker Autopilot automatically optimizes these two parameters to provide the best model.
Next, look at the curve labelled Receiver Operating Characteristic (ROC). This curve shows the relationship between the true positive rate and the false positive rate over a variety of potential probability thresholds. A diagonal line represents a hypothetical model based on random guessing. The more this curve pulls to the upper left of the chart, the better the model will perform.
The dashed line represents a model with 0 predictive value, which is often called the null model. The null model would randomly assign a 0/1 label, and its area under the ROC curve would be 0.5, representing that it would be accurate 50% of the time.
Next, click on the tab Artifacts. You can find the SageMaker Autopilot experiment’s supporting assets, including feature engineering code, input data locations, and explainability artifacts.
Step 4: Test the SageMaker model endpoint
Now that you have reviewed the model’s details, test the endpoint.
To know where to send the request, look up the model endpoint’s name. On the left pane, click the SageMaker Resources icon. In the SageMaker resources pane, select Endpoints. Click on the endpoint associated with the experiment name you created at the start of this tutorial. This will bring up the Endpoint Details window. Record the endpoint name and navigate back to the Python 3 notebook.
import os
import io
import boto3
import json
import csv
#: Define the endpoint's name.
ENDPOINT_NAME = 'autopilot-experiment-6d00f17b55464fc49c45d74362f284ce'
runtime = boto3.client('runtime.sagemaker')
#: Define a test payload to send to your endpoint.
payload = {
"data":{
"features": {
"values": [45,"blue-collar","married","basic.9y",'unknown',"yes","no","telephone","may","mon",461,1,999,0,"nonexistent",1.1,93.994,-36.4,4.857,5191.0]
}
}
}
#: Submit an API request and capture the response object.
response = runtime.invoke_endpoint(
EndpointName=ENDPOINT_NAME,
ContentType='text/csv',
Body=str(payload)
)
#: Print the model endpoint's output.
print(response['Body'].read().decode())
Congratulations! You have learned how to use SageMaker Autopilot to automatically train and deploy a machine learning model.
Step 5: Clean up your AWS resources
It is a best practice to delete resources that you are no longer using so that you don't incur unintended charges.
If you ran the CloudFormation template in Step 1 to create a new SageMaker Studio domain, continue with the following steps to delete the domain, user, and the resources created by the CloudFormation template.
To open the CloudFormation console, enter CloudFormation into the AWS console search bar, and choose CloudFormation from the search results.
In the CloudFormation pane, choose Stacks. From the status drop down list, select Active. Under Stack name, choose CFN-SM-IM-Lambda-catalog to open the stack details page.
Conclusion
Congratulations! You have now completed the Automatically Create Machine Learning Models tutorial.
You have successfully used SageMaker Autopilot to automatically build, train, and tune models, and then deploy the best candidate model to make predictions.
Next steps
Explore SageMaker Autopilot documentation