Prepare Training Data for Machine Learning with Minimal Code

TUTORIAL

Overview

In this tutorial, you will learn how to prepare data for machine learning (ML) using Amazon SageMaker Data Wrangler
 
Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes. Using SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.
 
In this tutorial, you will use Amazon SageMaker Data Wrangler to prepare data to train a credit risk prediction model. You will use a version of the German credit risk dataset found in the UCI Machine Learning Repository. The data consists of a thousand records, each containing an individual's information, including demographics, employment details, and financial data. In addition, each record includes a credit risk field labeled high or low. You will upload the data into Amazon Simple Storage Service (Amazon S3), create a new SageMaker Data Wrangler flow, transform the data, check the data for bias, and lastly save the output to Amazon S3 to be used later for ML training.

What you will accomplish

In this guide, you will:

  • Visualize and analyze data to understand key relationships
  • Apply transformations to clean up the data and generate new features
  • Automatically generate notebooks for repeatable data preparation workflows

Prerequisites

Before starting this tutorial, you will need:

 AWS experience

Beginner

 Time to complete

30 minutes

 Cost to complete

See Amazon SageMaker pricing to estimate cost for this tutorial.

 Requires

You must be logged into an AWS account.

 Services used

Amazon SageMaker Data Wrangler

 Last updated

July 1, 2022

Implementation

Step 1: Set up your Amazon SageMaker Studio domain

With Amazon SageMaker, you can deploy a model visually using the console or programmatically using either SageMaker Studio or SageMaker notebooks. In this tutorial, you deploy the model programmatically using a SageMaker Studio notebook, which requires a SageMaker Studio domain.

An AWS account can have only one SageMaker Studio domain per Region. If you already have a SageMaker Studio domain in the US East (N. Virginia) Region, follow the SageMaker Studio setup guide to attach the required AWS IAM policies to your SageMaker Studio account, then skip Step 1, and proceed directly to Step 2. 

If you don't have an existing SageMaker Studio domain, continue with Step 1 to run an AWS CloudFormation template that creates a SageMaker Studio domain and adds the permissions required for the rest of this tutorial.

Choose the AWS CloudFormation stack link. This link opens the AWS CloudFormation console and creates your SageMaker Studio domain and a user named studio-user. It also adds the required permissions to your SageMaker Studio account. In the CloudFormation console, confirm that US East (N. Virginia) is the Region displayed in the upper right corner. Stack name should be CFN-SM-IM-Lambda-catalog, and should not be changed. This stack takes about 10 minutes to create all the resources.

This stack assumes that you already have a public VPC set up in your account. If you do not have a public VPC, see VPC with a single public subnet to learn how to create a public VPC. 

Select I acknowledge that AWS CloudFormation might create IAM resources, and then choose Create stack.

On the CloudFormation pane, choose Stacks. It takes about 10 minutes for the stack to be created. When the stack is created, the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE

Step 2: Create a new SageMaker Data Wrangler flow

SageMaker Data Wrangler accepts data from a wide variety of sources, including Amazon S3, Amazon Athena, Amazon Redshift, Snowflake, and Databricks. In this step, you will create a new SageMaker Data Wrangler flow using the UCI German credit risk dataset stored in Amazon S3. This dataset contains demographic and financial information about individuals along with a label indicating the credit risk level of the individual.

Enter SageMaker Studio into the console search bar, and then choose SageMaker Studio.

Choose US East (N. Virginia) from the Region dropdown list on the upper right corner of the SageMaker console. For Launch app, select Studio to open SageMaker Studio using the studio-user profile.

Open the SageMaker Studio interface. On the navigation bar, choose FileNewData Wrangler Flow

In the Import tab, under Import data, choose Amazon S3.

 

Step 2

In the S3 URI path field, enter s3://sagemaker-sample-files/datasets/tabular/uci_statlog_german_credit_data/german_credit_data.csv, and then choose Go. Under Object name, click german_credit_data.csv, and then choose Import.

 

Step 3: Profile the data

In this step, you use SageMaker Data Wrangler to assess the quality of the training dataset. You can use the Quick Model feature to roughly estimate the expected prediction quality and the predictive power of the features in your dataset.

 

On the Data Flow tab, in the data flow diagram, choose the + icon, Add analysis.

Step 3

Under the Create analysis panel, for Analysis type, select Histogram

Step 3

For X axis, select age

For Color by, select risk

Choose Preview to generate a histogram of the credit risk field, color-coded by the age bracket. 

Choose Save to save this analysis to the flow.

 

Step 3

To understand how well the dataset is suited to train a model that predicts the risk target variable, run the Quick Model analysis. From the Analysis tab, choose Create new analysis

Step 3

Under the Create analysis pane, for Analysis type, choose Quick Model. For Label, select risk, and then choose Preview. The Quick Model pane shows you a brief overview of the model used and some basic statistics, including the F1 score and feature importance, to help you evaluate the quality of the dataset. Choose Save.

Step 3

Step 4: Add transformations to the data flow

SageMaker Data Wrangler simplifies data processing by providing a visual interface with which you can add a wide variety of pre-built transformations. You can also write your custom transformations using SageMaker Data Wrangler. In this step, you flatten complex string data, encode categories, rename columns, and drop unnecessary columns using the visual editor. You then split the status_sex column into two new columns, marital_status and sex.

 

To navigate to the data flow diagram, choose Data flow

On the data flow diagram, choose the + icon, Add transform

Under the ALL STEPS pane, choose Add step.

From the ADD TRANSFORM list, choose Search and edit, which is a transformation used to manipulate string data.

Under the SEARCH AND EDIT pane, for Transform, select Split string by delimiter. For Input columns, select status_sex. In the Delimiter box, enter the : symbol. In the Output column, enter vec. Choose Preview, and then Add

This transformation creates a new column called vec at the end of the dataframe by splitting the status_sex column. The status_sex column contains strings delimited by colons, and the new vec column contains vectors delimited by commas.

To split the vec column and create two new columns, sex_split_0 and sex_split_1

Under ALL STEPS, choose + Add step.

From the ADD TRANSFORM list, choose Manage vectors

Under the MANAGE VECTORS pane, for Transform, select Flatten. For Input columns, select vec. For output_prefix, enter sex_split.

Choose Preview, then Add.

 

 

Step 4

To rename the columns created by the split transformation:

Under the ALL STEPS pane, choose + Add step.

From the ADD TRANSFORM list, choose Manage columns.

Under the MANAGE COLUMNS pane, for Transform, select Rename column. For Input column, select sex_split_0. In the New name box, enter sex.

Choose Preview, then Add.

Repeat this procedure to rename sex_split_1 to marital_status.

Step 5: Add categorical encoding

In this step, you create a modeling target and encode categorical variables. Categorical encoding transforms string data type categories into numerical labels. It's a common preprocessing task because the numerical labels can be used in a wide variety of model types.

In the dataset, the credit risk classification is represented by the strings high risk and low risk. In this step, you convert this classification to a binary representation, 0 or 1. 

Under the ALL STEPS pane, choose + Add Step. From the ADD TRANSFORM list, choose Encode categorical. SageMaker Data Wrangler provides three transformation types: Ordinal encode, One hot encode, and Similarity encode. Under the ENCODE CATEGORICAL pane, for Transform, leave the default Ordinal encode. For Input columns, select risk. In the Output column, enter target. Ignore the Invalid handling strategy box for this tutorial. Choose Preview, then Add.

To encode the savings categorical column as you did in the previous procedure, but this time using a custom transformation written in Python and Pandas: 
 
Under the ALL STEPS pane, choose + Add step. From the ADD TRANSFORM list, choose Custom transformThe dataframe is available in this environment as the variable “df”. Under the TRANSFORMS pane, select Python (Pandas) from the kernel dropdown list. Copy and paste the following code into the code block. Choose Preview to check the output, and then Add . Custom transformations give you fine-grained control over the transformation.
# Table is available as variable ‘df’
savings_map = {"unknown":0, "little":1, "moderate":2, "high":3, "very high":4}
df["savings"] = df["savings"].map(savings_map).fillna(df["savings"])

Use the Encode categorical transformation to encode the remaining columns, housing, job, sex, and marital_status as follows: Under ALL STEPS, choose + Add Step. From the ADD TRANSFORM list, choose Encode categorical. Under the ENCODE CATEGORICAL pane, for Transform, leave the default Ordinal encode. For Input columns, select housing, job, sex, and marital_status. Leave the Output column blank so that the encoded values replace the categorical values. Choose Preview, then Add.

To scale the numerical column creditamount, apply a scaler to the credit amount to normalize the distribution of the data in this column: Under the ALL STEPS pane, Choose + Add Step. From the ADD TRANSFORM list, choose Process numeric. For Scaler, select the default option Standard scaler. For Input columns, select creditamount. Choose Preview, and then Add.

To drop the original columns that you transformed: Under the ALL STEPS pane, choose + Add step. From the ADD TRANSFORM list, choose Manage columns. Under the MANAGE COLUMNS pane, for Transform, select Drop Column. For Columns to drop, select status_sex, existingchecking, employmentsince, risk, and vec. Choose Preview, and then Add.

 

 

Step 6: Run a data bias check

In this step, check your data for bias using Amazon SageMaker Clarify, which provides you with greater visibility into your training data and models so you can identify and limit bias and explain predictions.

Choose Data flow in the upper left to return to the data flow diagram. Choose the + icon, Add analysis. In the Create analysis pane, for Analysis type, select Bias Report. For Analysis name, enter any name. For Select the column your model predicts (target), select target. Leave the Value checkbox selected. In the Predicted value(s) box, enter 1. For Select the column to analyze for bias, select sex. For Choose bias metrics, keep the default selections. Choose Check for bias.

 

Step 6
Step 6

After a few seconds, SageMaker Clarify generates a report, which shows how the target and test columns score on a number of bias-related metrics including Class Imbalance (CI) and Difference in Positive Proportions in Labels (DPL). In this case, the data is slightly biased in terms of sex (-0.38), and not very biased in terms of labels (0.075). Based on this report, you might consider a bias remediation method, such as using SageMaker Data Wrangler’s built-in SMOTE transformation. For the purpose of this tutorial, skip the remediation step. Choose Save to save the bias report to the data flow.

Step 7: Export your data flow

Export your data flow to a Jupyter notebook to run the steps as SageMaker Processing jobs. These steps process the data according to your defined data flow and store the outputs in Amazon S3 or Amazon SageMaker Feature Store.

From the data flow diagram, choose the + icon, Export to, Amazon S3 (via Jupyter Notebook). This creates a notebook in SageMaker Studio where you can run the generated SageMaker Processing jobs to create the transformed dataset. Run this notebook to store the results in the default S3 bucket.

 

Step 8: Clean up the resources

It is a best practice to delete resources that you are no longer using so that you don't incur unintended charges.

To delete the S3 bucket, do the following: 

  • Open the Amazon S3 console. On the navigation bar, choose Buckets, sagemaker-<your-Region>-<your-account-id>, and then select the checkbox next to data_wrangler_flows. Then, choose Delete
  • On the Delete objects dialog box, verify that you have selected the proper object to delete and enter permanently delete into the Permanently delete objects confirmation box. 
  • Once this is complete and the bucket is empty, you can delete the sagemaker-<your-Region>-<your-account-id> bucket by following the same procedure again.

The Data Science kernel used for running the notebook image in this tutorial will accumulate charges until you either stop the kernel or perform the following steps to delete the apps. For more information, see Shut Down Resources in the Amazon SageMaker Developer Guide.

To delete the SageMaker Studio apps, do the following: On the SageMaker Studio console, choose studio-user, and then delete all the apps listed under Apps by choosing Delete app. Wait until the Status changes to Deleted.

If you used an existing SageMaker Studio domain in Step 1, skip the rest of Step 8 and proceed directly to the conclusion section. 

If you ran the CloudFormation template in Step 1 to create a new SageMaker Studio domain, continue with the following steps to delete the domain, user, and the resources created by the CloudFormation template.  

To open the CloudFormation console, enter CloudFormation into the AWS console search bar, and choose CloudFormation from the search results.

In the CloudFormation pane, choose Stacks. From the status dropdown list, select Active. Under Stack name, choose CFN-SM-IM-Lambda-catalog to open the stack details page.

On the CFN-SM-IM-Lambda-catalog stack details page, choose Delete to delete the stack along with the resources it created in Step 1.

Conclusion

Congratulations! You have completed the Prepare Training Data for Machine Learning with Minimal Code tutorial.

You have successfully used Amazon SageMaker Data Wrangler to prepare data for training a machine learning model. SageMaker Data Wrangler offers 300+ preconfigured data transformations, such as convert column type, one-hot encoding, impute missing data with mean or median, re-scale columns, and date/time embeddings, so you can transform your data into formats that can be effectively used for models without writing a single line of code.

Was this page helpful?

Train a deep learning model

Learn how to build, train, and tune a TensorFlow deep learning model.
Next »

Create an ML model automatically

Learn how to use AutoML to develop ML models without writing code.
Next »

Find more hands-on tutorials

Explore other machine learning tutorials to dive deeper.
Next »