Automatically Create Machine Learning Models
TUTORIAL
Overview
What you will accomplish
In this guide, you will:
- Create a training experiment using SageMaker Autopilot
- Explore the different stages of the training experiment
- Identify and deploy the best performing model from the training experiment
- Predict with your deployed model
Prerequisites
Before starting this guide, you will need:
- An AWS account: If you don't already have an account, follow the Setting Up Your AWS Environment getting started guide for a quick overview.
AWS experience
Beginner
Time to complete
45 minutes
Cost to complete
See SageMaker pricing to estimate cost for this tutorial.
Requires
You must be logged into an AWS account.
Services used
Amazon SageMaker Autopilot
Last updated
April 25, 2023
Implementation
For this workflow, you will use a synthetically generated auto insurance claims dataset. The raw inputs are two tables of insurance data: a claims table and a customers table. The claims table has a fraud column indicating whether a claim was fraudulent or otherwise. For the purposes of this tutorial, we have selected a small portion of the dataset. However, you can follow the same steps in this tutorial to process large datasets.
Step 1: Set up Amazon SageMaker Studio domain
An AWS account can have only one SageMaker Studio domain per AWS Region. If you already have a SageMaker Studio domain in the US East (N. Virginia) Region, follow the SageMaker Studio setup guide to attach the required AWS IAM policies to your SageMaker Studio account, then skip Step 1, and proceed directly to Step 2.
If you don't have an existing SageMaker Studio domain, continue with Step 1 to run an AWS CloudFormation template that creates a SageMaker Studio domain and adds the permissions required for the rest of this tutorial.
This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC.
Choose the AWS CloudFormation stack link. This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. In the CloudFormation console, confirm that US East (N. Virginia) is the Region displayed in the upper right corner. Stack name should be CFN-SM-IM-Lambda-catalog, and should not be changed. This stack takes about 10 minutes to create all the resources.
Select I acknowledge that AWS CloudFormation might create IAM resources, and then choose Create stack.
On the CloudFormation pane, choose Stacks. When the stack is created, the status of the stack should change from CREATE_IN_PROGRESS to CREATE_COMPLETE.
Enter SageMaker Studio into the CloudFormation console search bar, and then choose SageMaker Studio.
Choose US East (N. Virginia) from the Region dropdown list on the upper right corner of the SageMaker console. For Launch app, select Studio to open SageMaker Studio using the studio-user profile.
Step 2: Start a new SageMaker Autopilot experiment
Developing and testing a large number of candidate models is crucial for machine learning (ML) projects. Amazon SageMaker Autopilot helps by providing different model candidates and automatically chooses the best model based on your data. In this step, you will configure a SageMaker Autopilot experiment to predict success from a financial services marketing campaign. This dataset represents a marketing campaign that was run by a major financial services institution to promote certificate of deposit enrollment.
For Deployment and advanced settings, you can keep everything as default. (Optionally, you specify the Auto deploy endpoint name and Select the machine learning problem from the drop down.) Then select Next: Review and create
Step 3: Interpret model performance
Now that the experiment is complete and you have a model, the next step is to interpret its performance. You will now learn how to use SageMaker Autopilot to analyze the model's performance.
Now that the SageMaker Autopilot experiment is complete, you can open up the top ranking model to obtain more details on the model’s performance and metadata. From the list of models, highlight the first one and right click to bring up model options. Click on Open in model details to review the model’s performance statistics.
In the new window, click on Explainability. The first view you see is called Feature Importance and represents the aggregated SHAP value for each feature across each instance in the dataset. The feature importance score is an important part of model explainability because it shows what features tend to influence the predictions the most in the dataset. In this use case, the customer duration or tenure and employment variation rate are the top two fields for driving the model's outcome.
Now, click on the tab Performance. You will find detailed information on the model’s performance, including recall, precision, and accuracy. You can also interpret model performance and decide if additional model tuning is needed.
Next, visualizations are provided to further illustrate model performance. First, look at the confusion matrix. The confusion matrix is commonly used to understand how the model labels are divided among the predicted and true classes. In this case, the diagonal elements show the number of correctly predicted labels and the off-diagonal elements show the misclassified records. A confusion matix is useful for analyzing misclassifications due to false positives and false negatives.
Next, look at the precision versus recall curve. This curve interprets the label as a probability threshold and shows the trade-off that occurs at various probability thresholds for model precision and recall. SageMaker Autopilot automatically optimizes these two parameters to provide the best model.
The dashed line represents a model with 0 predictive value, which is often called the null model. The null model would randomly assign a 0/1 label, and its area under the ROC curve would be 0.5, representing that it would be accurate 50% of the time.
Next, click on the tab Artifacts. You can find the SageMaker Autopilot experiment’s supporting assets, including feature engineering code, input data locations, and explainability artifacts.
Step 4: Test the SageMaker model endpoint
Now that you have reviewed the model’s details, test the endpoint.
To know where to send the request, look up the model endpoint’s name. On the left pane, click the SageMaker Resources icon. In the SageMaker resources pane, select Endpoints. Click on the endpoint associated with the experiment name you created at the start of this tutorial. This will bring up the Endpoint Details window. Record the endpoint name and navigate back to the Python 3 notebook.
import os
import io
import boto3
import json
import csv
#: Define the endpoint's name.
ENDPOINT_NAME = 'autopilot-experiment-6d00f17b55464fc49c45d74362f284ce'
runtime = boto3.client('runtime.sagemaker')
#: Define a test payload to send to your endpoint.
payload = {
"data":{
"features": {
"values": [45,"blue-collar","married","basic.9y",'unknown',"yes","no","telephone","may","mon",461,1,999,0,"nonexistent",1.1,93.994,-36.4,4.857,5191.0]
}
}
}
#: Submit an API request and capture the response object.
response = runtime.invoke_endpoint(
EndpointName=ENDPOINT_NAME,
ContentType='application/json',
Body=json.dumps(payload)
)
#: Print the model endpoint's output.
print(response['Body'].read().decode())
Congratulations! You have learned how to use SageMaker Autopilot to automatically train and deploy a machine learning model.
Step 5: Clean up your AWS resources
It is a best practice to delete resources that you are no longer using so that you don't incur unintended charges.
If you ran the CloudFormation template in Step 1 to create a new SageMaker Studio domain, continue with the following steps to delete the domain, user, and the resources created by the CloudFormation template.
To open the CloudFormation console, enter CloudFormation into the AWS console search bar, and choose CloudFormation from the search results.
In the CloudFormation pane, choose Stacks. From the status drop down list, select Active. Under Stack name, choose CFN-SM-IM-Lambda-catalog to open the stack details page.
Conclusion
Congratulations! You have now completed the Automatically Create Machine Learning Models tutorial.
You have successfully used SageMaker Autopilot to automatically build, train, and tune models, and then deploy the best candidate model to make predictions.
Next steps
Explore SageMaker Autopilot documentation