Generate Machine Learning Predictions Without Writing Code
What you will accomplish
In this tutorial, you will:
- Import datasets
- Select the target variable for classification
- Inspect datasets visually
- Build an ML model with the SageMaker Canvas Quick Build feature
- Understand model features and metrics
- Generate and understand bulk and single predictions
Before starting this tutorial, you will need:
- An AWS account: If you don't already have an account, follow the Setting Up Your AWS Environment getting started guide for a quick overview.
Time to complete
Cost to complete
See SageMaker Canvas pricing to estimate cost for this tutorial.
You must be logged into an AWS account.
Amazon SageMaker Canvas
June 28, 2022
In this tutorial, you will build an ML model that can predict the estimated time of arrival (ETA) of shipments (measured in days). You will use a dataset that contains complete shipping data for delivered products, including estimated time, shipping, priority, carrier, and origin.
Step 1: Set up Amazon SageMaker Studio domain
An AWS account can have only one SageMaker Studio domain per AWS Region. If you already have a SageMaker Studio domain in the US East (N. Virginia) Region, follow the SageMaker Studio setup guide to attach the required AWS IAM policies to your SageMaker Studio account, then skip Step 1, and proceed directly to Step 2.
If you don't have an existing SageMaker Studio domain, continue with Step 1 to run an AWS CloudFormation template that creates a SageMaker Studio domain and adds the permissions required for the rest of this tutorial.
Choose the AWS CloudFormation stack link. This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. In the CloudFormation console, confirm that US East (N. Virginia) is the Region displayed in the upper right corner. Stack name should be CFN-SM-IM-Lambda-catalog, and should not be changed. This stack takes about 10 minutes to create all the resources.
This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC.
Select I acknowledge that AWS CloudFormation might create IAM resources, and then choose Create stack.
On the CloudFormation pane, choose Stacks. It takes about 10 minutes for the stack to be created. When the stack is created, the status of the stack should change from CREATE_IN_PROGRESS to CREATE_COMPLETE.
Step 2: Log into SageMaker Canvas and upload dataset to Amazon S3 bucket
Enter SageMaker Canvas in the search bar in the AWS console and go to SageMaker Canvas.
Choose Canvas under Control panel from the navigation pane and then choose US East (N. Virginia) from the Region dropdown list on the top right.
- On the SageMaker Canvas page, choose Launch SageMaker Canvas.
- On the Control Panel page, select Canvas from the Launch app dropdown list beside the studio-user.
- The SageMaker Canvas Creating application screen will be displayed. The application will take a few minutes to load.
If this is your first time using SageMaker in US-East 1, SageMaker Canvas creates an Amazon S3 bucket with a name that uses the following pattern: sagemaker-<your-Region>-<your-account-id>. Before proceeding with the rest of the tutorial, download the datasets below and save them to your local computer. You will then upload the datasets to the default S3 bucket that SageMaker Canvas has created for you.
In the AWS console search bar, enter S3 and then select S3.
When the S3 console opens, you can find your default bucket that SageMaker Canvas has created for you under the Buckets section. Choose the bucket named sagemaker-<your-Region>-<your-account-id> and in the next page choose Upload.
On the Upload page, choose Add files, and select the two datasets you downloaded in the previous step. Scroll to the bottom of the page and choose Upload. SageMaker Canvas will access these files before building the model.
Step 3: Set up SageMaker Canvas for automatic model building
Import data into SageMaker Canvas for visual inspection and model building.
Import the dataset into SageMaker Canvas.
- On the SageMaker Canvas interface, choose Datasets on the left pane and then choose + Import.
- Select the Amazon S3 bucket named sagemaker-<your-Region>-<your-account-id> where you uploaded the datasets in the previous step. Select the shipping_logs.csv and product_descriptions.csv datasets by selecting the checkboxes to their left. Two new buttons appear at the bottom of your page: Preview all and Import data. Choose Preview all. This allows you to see a 100-row preview of the datasets.
- After you check the datasets, choose Import data to import them into SageMaker Canvas.
On the SageMaker Canvas page, under the Datasets section, you will find the two datasets that you imported. Choose Join data.
On the Join Datasets page, drag the two datasets from the left panel onto the right pane. Select the join icon between the two datasets. A pop-up showing details about the join will appear. Make sure that the join type is Inner and the joining column is ProductId. Choose Save & close and then choose Import data.
- On the Import data dialog box, enter the name ConsolidatedShippingData in the Import dataset name field and choose Import data.
Step 4: Build, train, and analyze an ML model
Set up the target variable, visually inspect the properties of the data, and initiate the model-building process.
In the navigation pane in SageMaker Canvas, choose Models, and choose + New model. In the Create new model dialog box, enter ShippingForecast in the Model name field and choose Create.
The model view page consists of four tabs which represent the steps involved in building a model and getting predictions. The tabs are:
- Select – Set up the input data.
- Build – Build the ML model.
- Analyze – Analyze the model output and features.
- Predict – Run predictions in bulk or on a single sample.
On the Select tab, choose the radio button for the ConsolidatedShippingData dataset that you created in the previous step. This dataset contains 16 columns and 10,000 rows. It also contains a high-level description of dataset shape and size. Choose Select dataset.
After you select the dataset, SageMaker Canvas automatically moves to the Build phase. On this tab, choose the target column, which is ActualShippingDays in this example. Since this column contains the historical number of days required for goods to arrive, it is suitable to be used as the target column.
- Once the target column is selected, SageMaker Canvas automatically tries to infer the problem type. Because you are interested in how many days it will take for the goods to arrive for the customer, this is a regression or numerical prediction problem. Regression estimates the values of a dependent target variable based on one or more other variables or attributes that are correlated with it. In this case, SageMaker Canvas initially may predict this use case as a Time series forecasting type problem because it detected a column with dates in the dataset. However, you can change the problem type to a Numeric model type by manually selecting it with the Change type link at the center of the page.
You'll notice SageMaker Canvas provides dataset statistics, including missing and mismatched values, unique values, and mean and median values for each of the columns in the dataset. You can use these statistics to drop some of the columns. If you do not want to use a particular column for prediction, you can clear (deselect) it from the left checkbox.
- Notice that the correlation of XShippingDistance and YShippingDistance columns with the target is negligible.
- Because features with negligible correlation with the target are not informative enough for the prediction task at hand, you can drop XShippingDistance,YShippingDistance, ProductID, and OrderID columns because they are primary keys and not expected to contain any valuable information. You can deselect the checkboxes.
- You can select the vertical bars icon to inspect the distributions of the columns. This is useful in highlighting imbalances and potential bias in the data.
For purposes of this tutorial, choose Quick build to begin model building. This process takes less than 5 minutes to compete.
After model building is complete, SageMaker Canvas automatically switches to the Analyze tab to show the quick training results. The SageMaker Canvas model built using Quick build can predict the number of shipping days within +/-1.233 of the actual value. Machine learning introduces some stochasticity in the process of training models, which can lead to different results to different builds. Therefore, the exact performance in terms of the metrics that you see might be different.
On the Overview tab, SageMaker Canvas shows the Column impact or the estimated importance of each input column in predicting the target column. In this example, the ExpectedShippingDays column has the most significant impact on predicting the number of shipping days. On the right panel, you can see the direction of impact of a feature as well. For example, the higher the value of ExpectedShippingDays, the more positive its impact on the number of shipping days prediction.
On the Scoring tab, you can see a plot representing best fit regression line for ActualShippingDays. On average, the model prediction has a difference of +/- 1.233 from the actual value of ActualShippingDays. The Scoring section for numeric prediction shows a line to indicate the model's predicted value in relation to the data used to make predictions. The values of the numeric prediction are often +/- the RMSE (root mean square error) value. The value that the model predicts is often within the range of the RMSE. The width of the purple band around the line indicates the RMSE range. The predicted values often fall within the range. To have a deeper understanding of the model performance, choose the Advanced metrics link on the right to display the Advanced metrics page.
- The various metrics shown on the Advanced metrics page are R2, mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE). The Advanced metrics page also shows plots for visual inspection of the model performance. One image shows a graph of the residuals or errors. The horizontal line indicates an error of 0 or a perfect prediction. The blue dots are the errors. Their distance from the horizontal line indicates the magnitude of the errors.
- Scrolling down on the Advanced metrics page, you can see an error density plot which shows the distribution of the errors and their spread with respect to MAE and RMSE of the model. An error density with a shape similar to a normal distribution is indicative of good model performance.
Step 5: Generate model predictions
Now that you have a regression model, you can either use the model to run predictions, or you can create a new version of this model to train with the Standard build process. In this step, you use SageMaker Canvas to generate predictions, both single and in bulk, from a dataset.
To start generating predictions, choose the Predict button at the bottom of the Analyze page, or choose the Predict tab.
- On the Predict page, Batch prediction is already selected. Choose Select dataset and then select the ConsolidatedShippingData dataset. In actual ML workflows, this dataset should be separate from the training dataset. However, for simplicity, you use the same dataset to demonstrate how SageMaker Canvas generates predictions. Choose Generate predictions.
- After a few seconds, the prediction is done. Choose the options icon and select Preview to see a preview of the predictions by hovering over the predictions dataset name or status. You can also choose Download to download a CSV file containing the full output. SageMaker Canvas returns a prediction for each row of data. In this tutorial, the feature with the highest importance is the ExpectedShippingDays feature. It is also presented beside the predictions for a visual comparison.
- On the Predict page, you can generate predictions for a single sample by selecting Single prediction. SageMaker Canvas presents an interface in which you can manually enter values for each of the input variables used in the model. This type of analysis is ideal for what-if scenarios where you want to know how the prediction changes when one or more variables increase or decrease in value. With the prediction of the single set of column values, SageMaker Canvas provides individual feature importance. This indicates the columns with the highest influence towards the current sample prediction.
For the scope of this tutorial, the Standard model is not covered. However, training in this mode is similar to the steps outlined in this tutorial.
You can start by giving a name to the model such as ShippingForecastStandardModel. In addition, on the Build tab, you can choose Standard build instead of Quick build. From there, proceed through the remaining steps. The Standard build mode is beneficial in providing the additional functionality of being able to share the trained model with data scientists through SageMaker Studio. This allows collaboration, quick model refinement, and iterations. The sharing option is available in the Analyze tab once model training is complete.
Step 6: Clean up your AWS resources
It is a best practice to delete resources that you are no longer using so that you don't incur unintended charges.
Navigate to the S3 console and select the Buckets menu option.
Delete the test object from your test bucket. Select the name of the bucket you have been working with for this tutorial. Put a check mark in the checkbox to the left of your test object name, then select the Delete button. On the Delete objects page, verify that you have selected the proper object to delete and type permanently delete into the Permanently delete objects confirmation box. Then, select the Delete object button to continue. Next, you will be presented with a banner indicating if the deletion has been successful.
On the SageMaker Canvas main page, choose Models. On the right pane, the model you built is visible. Choose the vertical ellipsis to the right of the View option and select Delete model.
On the SageMaker Studio console, select studio-user and for each app listed under Apps, choose Delete app. Follow on-screen prompts to confirm the delete operation. Wait until status shows as Deleted.
If you used an existing SageMaker Studio domain in Step 1, skip the rest of Step 6 and proceed directly to the conclusion section.
If you ran the CloudFormation template in Step 1 to create a new SageMaker Studio domain, continue with the following steps to delete the domain, user, and the resources created by the CloudFormation template.
To open the CloudFormation console, enter CloudFormation into the AWS console search bar, and choose CloudFormation from the search results.
In the CloudFormation pane, choose Stacks. From the status drop down list, select Active. Under Stack name, choose CFN-SM-IM-Lambda-Catalog to open the stack details page.
On the CFN-SM-IM-Lambda-catalog stack details page, choose Delete to delete the stack along with the resources it created in Step 1.
Congratulations! You have finished the Generate Machine Learning Predictions Without Writing Code tutorial.
You have successfully used Amazon SageMaker Canvas to import and prepare a dataset for ML from Amazon S3, select the target variable, build an ML model using the quick build mode, and use the visual interface.
Train a machine learning model
Label training data for machine learning
Find more hands-on tutorials