Generate Machine Learning Predictions Without Writing Code

TUTORIAL

Overview

In this tutorial, you learn how to use Amazon SageMaker Canvas to build machine learning (ML) models and generate accurate predictions without writing a single line of code.
 
Amazon SageMaker Canvas is a visual point-and-click interface that expands the use of ML to business analysts, helping them to make business decisions without any ML experience. With SageMaker Canvas, business analysts can build ML models and generate predictions on their own. As a SageMaker Canvas user, you can import data from disparate sources; pick the target variables needed for predictions; and prepare and analyze data. Using the built-in AutoML capabilities, in just a few clicks you can build an ML model and generate accurate predictions, either single or in bulk, to assist with business decisions.

What you will accomplish

In this tutorial, you will:

  • Import datasets
  • Select the target variable for classification
  • Inspect datasets visually
  • Build an ML model with the SageMaker Canvas Quick Build feature
  • Understand model features and metrics
  • Generate and understand bulk and single predictions

Prerequisites

Before starting this tutorial, you will need:

 AWS experience

Beginner

 Time to complete

20 minutes

 Cost to complete

Less than $1.00, free tier eligible.

 Requires

You must be logged into an AWS account.

 Services used

Amazon SageMaker Canvas

 Last updated

April 25, 2023

Implementation

In this tutorial, you will build an ML model that can predict the estimated time of arrival (ETA) of shipments (measured in days). You will use a dataset that contains complete shipping data for delivered products, including estimated time, shipping, priority, carrier, and origin.

 

Step 1: Set up Amazon SageMaker Studio domain

If you already have a SageMaker Studio domain in the US East (N. Virginia) Region, follow the SageMaker Studio setup guide to attach the required AWS IAM policies to your SageMaker Studio account, then skip Step 1, and proceed directly to Step 2.
 
If you don't have an existing SageMaker Studio domain, continue with Step 1 to run an AWS CloudFormation template that creates a SageMaker Studio domain and adds the permissions required for the rest of this tutorial.
1.1    Choose the AWS CloudFormation stack link. This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. In the CloudFormation console, confirm that US East (N. Virginia) is the Region displayed in the upper right corner. Stack name should be CFN-SM-IM-Lambda-Catalog, and should not be changed. This stack takes about 10 minutes to create all the resources.

This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC.

Step 2: Log into SageMaker Canvas and upload dataset to Amazon S3 bucket

2.1    Enter SageMaker Canvas in the search bar in the AWS console and go to SageMaker Canvas.

2.2    Choose Canvas under Getting started from the left pane and then choose US East (N. Virginia) from the Region dropdown list on the top right.

  • On the SageMaker Canvas page, choose Open Canvas.
  • The SageMaker Canvas Creating application screen will be displayed. The application will take a few minutes to load.
2.3    If this is your first time using SageMaker in US-East 1, SageMaker Canvas creates an Amazon S3 bucket with a name that uses the following pattern: sagemaker-<your-Region>-<your-account-id>. Before proceeding with the rest of the tutorial, download the datasets below and save them to your local computer. You will then upload the datasets to the default S3 bucket that SageMaker Canvas has created for you.

2.4    Download the shipping log and products datasets from the following links

2.5    In the AWS console search bar, enter S3 and then select S3.
2.6    When the S3 console opens, you can find your default bucket that SageMaker Canvas has created for you under the Buckets section. Choose the bucket named sagemaker-<your-Region>-<your-account-id> and in the next page choose Upload.
2.7    On the Upload page, choose Add files, and select the two datasets you downloaded in the previous step. Scroll to the bottom of the page and choose Upload. SageMaker Canvas will access these files before building the model.

Step 3: Set up SageMaker Canvas for automatic model building

Import data into SageMaker Canvas for visual inspection and model building.

3.1    Import the dataset into SageMaker Canvas

  • On the SageMaker Canvas interface, choose Datasets on the left pane and then choose + Import.
  • Select the Amazon S3 bucket named sagemaker-<your-Region>-<your-account-id> where you uploaded the datasets in the previous step. Select the shipping_logs.csv and product_descriptions.csv datasets by selecting the checkboxes to their left. Two new buttons appear at the bottom of your page: Preview all and Import data. Choose Preview all. This allows you to see a 100-row preview of the datasets. Select the arrows to view the preview.
  • After you check the datasets, choose Import data to import them into SageMaker Canvas.
3.2    On the SageMaker Canvas page, under the Datasets section, you will find the two datasets that you imported. Choose Join data.

3.3    On the Join Datasets page, drag the two datasets from the left panel onto the right pane. Select the join icon between the two datasets. A pop-up showing details about the join will appear. Make sure that the join type is Inner and the joining column is ProductId. Choose Save & close and then choose Import data.

On the Import data dialog box, enter the name ConsolidatedShippingData in the Import dataset name field and choose Import data.

Step 4: Build, train, and analyze an ML model

Set up the target variable, visually inspect the properties of the data and initiate the model building process.
4.1    In the navigation pane in SageMaker Canvas, choose Models, and choose + New model. In the Create new model dialog box, enter ShippingForecast in the Model name field and choose Create.

4.2    The model view page consists of four tabs which represent the steps involved in building a model and getting predictions. The tabs are:

  • Select – Set up the input data.
  • Build – Build the ML model.
  • Analyze – Analyze the model output and features.
  • Predict – Run predictions in bulk or on a single sample.

On the Select tab, choose the radio button for the ConsolidatedShippingData dataset that you created in the previous step. This dataset contains 16 columns and 10,000 rows. It also contains a high-level description of dataset shape and size.  Choose Select dataset.

4.3    After you select the dataset, SageMaker Canvas automatically moves to the Build phase. On this tab, choose the target column, which is ActualShippingDays in this example. Since this column contains the historical number of days required for goods to arrive, it is suitable to be used as the target column.
4.4    Once the target column is selected, SageMaker Canvas automatically tries to infer the problem type. Because you are interested in how many days it will take for the goods to arrive for the customer, this is a regression or numerical prediction problem. Regression estimates the values of a dependent target variable based on one or more other variables or attributes that are correlated with it. In this case, SageMaker Canvas initially may predict this use case as a Time series forecasting type problem because it detected a column with dates in the dataset. However, you can change the problem type to a Numeric model type by manually selecting it with the Change type link at the right of the page.

4.5    You'll notice SageMaker Canvas provides dataset statistics, including missing and mismatched values, unique values, and mean and median values for each of the columns in the dataset. You can use these statistics to drop some of the columns. If you do not want to use a particular column for prediction, you can clear (deselect) it from the left checkbox.

  • Notice that the correlation of XShippingDistance and YShippingDistance columns with the target is negligible.
  • Because features with negligible correlation with the target are not informative enough for the prediction task at hand, you can drop XShippingDistance, YShippingDistance, ProductID, and OrderID columns because they are primary keys and not expected to contain any valuable information.  You can deselect the checkboxes. 
  • You can select the vertical bars icon to inspect the distributions of the columns. This is useful in highlighting imbalances and potential bias in the data.

4.6    After you complete data exploration, you can train a model. In SageMaker Canvas, there are two methods for training:  Quick build and Standard build. The Quick build usually takes 2-15 minutes to build the model, whereas the standard build usually takes 2-4 hours and generally has a higher accuracy. Quick build trains fewer combinations of models and hyperparameters to prioritize speed. This is especially applicable in cases like this tutorial, where the goal is to quickly develop a prototype model for the chosen use case.

For purposes of this tutorial, choose Quick build to begin model building. A popup window will appear about validating your data.  Choose Start quick build to begin model training. This process takes less than 5 minutes to complete.

4.7    After model building is complete, SageMaker Canvas automatically switches to the Analyze tab to show the quick training results. The SageMaker Canvas model built using Quick build can predict the number of shipping days within +/-1.148 of the actual value. Machine Learning introduces some stochasticity in the process of training models, which can lead to different results to different builds. Therefore, the exact performance in terms of the metrics that you see might be different.
4.8    On the Overview tab, SageMaker Canvas shows the Column impact or the estimated importance of each input column in predicting the target column. In this example, the ExpectedShippingDays column has the most significant impact on predicting the number of shipping days. On the right panel, you can see the direction of impact of a feature as well. For example, the higher the value of ExpectedShippingDays, the more positive its impact on the number of shipping days prediction.

4.9    On the Scoring tab, you can see a plot representing best fit regression line for ActualshippingDays. On average, the model prediction has a difference of +/- 1.148 from the actual value of ActualShippingDays. The Scoring section for numeric prediction shows a line to indicate the model's predicted value in relation to the data used to make predictions. The values of the numeric prediction are often +/- the RMSE (root mean square error) value. The value that the model predicts is often within the range of the RMSE. The width of the purple band around the line indicates the RMSE range. The predicted values often fall within the range. To have a deeper understanding of the model performance, choose the Advanced metrics link on the right to display the Advanced metrics page.

  • The various metrics shown on the Advanced metrics page are R2, mean absolute error (MAE), mean absolute percentage error (MAPE), and root mean square error (RMSE). The Advanced metrics page also shows plots for visual inspection of the model performance. One image shows a graph of the residuals or errors. The horizontal line indicates an error of 0 or a perfect prediction. The blue dots are the errors. Their distance from the horizontal line indicates the magnitude of the errors.
  • Scrolling down on the Advanced metrics page, you can see an error density plot which shows the distribution of the errors and their spread with respect to MAE and RMSE of the model. An error density with a shape similar to a normal distribution is indicative of good model performance.

Step 5: Generate model predictions

Now that you have a regression model, you can either use the model to run predictions, or you can create a new version of this model to train with the Standard build process. In this step, you use SageMaker Canvas to generate predictions, both single and in bulk, from a dataset.

5.1    To start generating predictions, choose the Predict button at the bottom of the Analyze page, or choose the Predict tab.

  • On the Predict page, Batch prediction is already selected. Choose Select dataset and then select the ConsolidatedShippingData dataset. In actual ML workflows, this dataset should be separate from the training dataset. However, for simplicity, you use the same dataset to demonstrate how SageMaker Canvas generates predictions. Choose Generate predictions.
  • After a few seconds, the prediction is done. Choose the options icon and select Preview to see a preview of the predictions by hovering over the predictions dataset name or status. You can also choose Download to download a CSV file containing the full output. SageMaker Canvas returns a prediction for each row of data. In this tutorial, the feature with the highest importance is the ExpectedShippingDays feature. It is also presented beside the predictions for a visual comparison.
  • On the Predict page, you can generate predictions for a single sample by selecting Single prediction. SageMaker Canvas presents an interface in which you can manually enter values for each of the input variables used in the model. This type of analysis is ideal for what-if scenarios where you want to know how the prediction changes when one or more variables increase or decrease in value. With the prediction of the single set of column values, SageMaker Canvas provides individual feature importance. This indicates the columns with the highest influence towards the current sample prediction.
5.2    After the model building process, SageMaker Canvas uploads all artifacts including the trained model saved as a pickle file, metrics, datasets, and predictions to your default S3 bucket under a location named Canvas/studio-user.  You can inspect the contents and use them as necessary for further development.
5.3    For the scope of this tutorial, the Standard model is not covered. However, training in this mode is similar to the steps outlined in this tutorial.

You can start by giving a name to the model such as ShippingForecastStandardModel. In addition, on the Build tab, you can choose Standard build instead of Quick build. From there, proceed through the remaining steps. The Standard build mode is beneficial in providing the additional functionality of being able to share the trained model with data scientists through SageMaker Studio. This allows collaboration, quick model refinement, and iterations. Choose the Share in the top right, and then in the popup select the user you want to share the model with, then choose Share.

Step 6: Clean up your AWS resources

It is a best practice to delete resources that you are no longer using so that you don't incur unintended charges.

6.1    Navigate to the S3 console and choose Buckets. Navigate to your bucket named sagemaker-<your-Region>-<your-account-id> and select the check box to the left of all of the files and folders. Next, choose Delete.

  • On the Delete objects page, verify that you have selected the proper objects to delete. In the Permanently delete objects section, confirm by entering permanently delete in the text field and choose Delete objects. After completion and the bucket is empty, you can delete the S3 bucket by following the same process. A success banner appears after deletion is complete.

 

6.2    On the SageMaker Canvas main page, choose Models. On the right pane, the model you built is visible. Choose the vertical ellipsis to the right of the View option and select Delete model.
6.3    After the model is deleted, click on Log out to end your Canvas session.
6.4    If you used an existing SageMaker Studio domain in Step 1, skip the rest of Step 6 and proceed directly to the conclusion section.
 
If you ran the CloudFormation template in Step 1 to create a new SageMaker Studio domain, continue with the following steps to delete the domain, user, and the resources created by the CloudFormation template. 
6.5    To open the CloudFormation console, enter CloudFormation into the AWS console search bar, and choose CloudFormation from the search results.
6.6    Open the CloudFromation console. In the CloudFormation pane, choose Stacks. From the status dropdown list, select Active. Under Stack name, choose CFN-SM-IM-Lambda-catalog to open the stack details page.
6.7    On CFN-SM-IM-Lambda-catalog stack details page, choose Delete to delete the stack along with the resources it created in Step 1.

Conclusion

Congratulations! You have finished the Generate Machine Learning Predictions Without Writing Code tutorial.
 

You have successfully used Amazon SageMaker Canvas to import and prepare a dataset for ML from Amazon S3, select the target variable, build an ML model using the quick build mode, and use the visual interface.

Was this page helpful?

Next steps