Authoring custom transformations in Amazon SageMaker Data Wrangler using NLTK and SciPy

“Instead of focusing on the code, companies should focus on developing systematic engineering practices for improving data in ways that are reliable, efficient, and systematic. In other words, companies need to move from a model-centric approach to a data-centric approach.” – Andrew Ng

A data-centric AI approach involves building AI systems with quality data involving data preparation and feature engineering. This can be a tedious task involving data collection, discovery, profiling, cleansing, structuring, transforming, enriching, validating, and securely storing the data.

Amazon SageMaker Data Wrangler is a service in Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data using little to no coding. You can integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify data preprocessing and feature engineering, taking data preparation to production faster without the need to author PySpark code, install Apache Spark, or spin up clusters.

For scenarios where you need to add your own custom scripts for data transformations, you can write your transformation logic in Pandas, PySpark, PySpark SQL. Data Wrangler now supports NLTK and SciPy libraries for authoring custom transformations to prepare text data for ML and perform constraint optimization.

You might run into scenarios where you have to add your own custom scripts for data transformation. With the Data Wrangler custom transform capability, you can write your transformation logic in Pandas, PySpark, PySpark SQL.

In this post, we discuss how you can write your custom transformation in NLTK to prepare text data for ML. We will also share some example custom code transform using other common frameworks such as NLTK, NumPy, SciPy, and scikit-learn as well as AWS AI Services. For the purpose of this exercise, we use the Titanic dataset, a popular dataset in the ML community, which has now been added as a sample dataset within Data Wrangler.

Solution overview

Data Wrangler provides over 40 built-in connectors for importing data. After data is imported, you can build your data analysis and transformations using over 300 built-in transformations. You can then generate industrialized pipelines to push the features to Amazon Simple Storage Service (Amazon S3) or Amazon SageMaker Feature Store. The following diagram shows the end-to-end high-level architecture.

Prerequisites

Data Wrangler is a SageMaker feature available within Amazon SageMaker Studio. You can follow the Studio onboarding process to spin up the Studio environment and notebooks. Although you can choose from a few authentication methods, the simplest way to create a Studio domain is to follow the Quick start instructions. The Quick start uses the same default settings as the standard Studio setup. You can also choose to onboard using AWS IAM Identity Center (successor to AWS Single Sign-On) for authentication (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).

Import the Titanic dataset

Start your Studio environment and create a new Data Wrangler flow. You can either import your own dataset or use a sample dataset (Titanic) as shown in the following screenshot. Data Wrangler allows you to import datasets from different data sources. For our use case, we import the sample dataset from an S3 bucket.

Once imported, you will see two nodes (the source node and the data type node) in the data flow. Data Wrangler automatically identifies the data type for all the columns in the dataset.

Custom transformations with NLTK

For data preparation and feature engineering with Data Wrangler, you can use over 300 built-in transformations or build your own custom transformations. Custom transforms can be written as separate steps within Data Wrangler. They become part of the .flow file within Data Wrangler. The custom transform feature supports Python, PySpark, and SQL as different steps in code snippets. After notebook files (.ipynb) are generated from the .flow file or the .flow file is used as recipes, the custom transform code snippets persist without requiring any changes. This design of Data Wrangler allows custom transforms to become part of a SageMaker Processing job for processing massive datasets with custom transformations.

Titanic dataset has couple of features (name and home.dest) that contain text information. We use NLTK to split the name column and extract the last name, and print the frequency of last names. NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial-strength natural language processing (NLP) libraries.

To add a new transform, complete the following steps:

Choose the plus sign and choose Add Transform.
Choose Add Step and choose Custom Transform.

You can create a custom transform using Pandas, PySpark, Python user-defined functions, and SQL PySpark.

Choose Python (Pandas) and add the following code to extract the last name from the name column:

import nltk
nltk.download('punkt')
tokens = [nltk.word_tokenize(name) for name in df['Name']]

# Extract the last names of the passengers
df['last_name'] = [token[0] for token in tokens]

Choose Preview to review the results.

The following screenshot shows the last_name column extracted.

Add another custom transform step to identify the frequency distribution of the last names, using the following code:
```
import nltk
fd = nltk.FreqDist(df["last_name"])
print(fd.most_common(10))
```
Choose Preview to review the results of the frequency.

Custom transformations with AWS AI services

AWS pre-trained AI services provide ready-made intelligence for your applications and workflows. AWS AI services easily integrate with your applications to address many common use cases. You can now use the capabilities for AWS AI services as a custom transform step in Data Wrangler.

Amazon Comprehend uses NLP to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document.

We use Amazon Comprehend to extract the entities from the name column. Complete the following steps:

Add a custom transform step.
Choose Python (Pandas).

Enter the following code to extract the entities:

import boto3
comprehend = boto3.client("comprehend")

response = comprehend.detect_entities(LanguageCode = 'en', Text = df['name'].iloc[0])

for entity in response['Entities']:
print(entity['Type'] + ":" + entity["Text"])

Choose Preview and visualize the results.

We have now added three custom transforms in Data Wrangler.

Choose Data Flow to visualize the end-to-end data flow.

Custom transformations with NumPy and SciPy

NumPy is an open-source library for Python offering comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more. SciPy is an open-source Python library used for scientific computing and technical computing, containing modules for optimization, linear algebra, integration, interpolation, special functions, fast Fourier transform (FFT), signal and image processing, solvers, and more.

Data Wrangler custom transforms allow you to combine Python, PySpark, and SQL as different steps. In the following Data Wrangler flow, different functions from Python packages, NumPy, and SciPy are applied on the Titanic dataset as multiple steps.

NumPy transformations

The fare column of the Titanic dataset has boarding fares of different passengers. The histogram of the fare column shows uniform distribution, except for the last bin. By applying NumPy transformations like log or square root, we can change the distribution (as shown by the square root transformation).

See the following code:

import pandas as pd
import numpy as np
df["fare_log"] = np.log(df["fare_interpolate"])
df["fare_sqrt"] = np.sqrt(df["fare_interpolate"])
df["fare_cbrt"] = np.cbrt(df["fare_interpolate"])

SciPy transformations

SciPy functions like z-score are applied as part of the custom transform to standardize fare distribution with mean and standard deviation.

See the following code:

df["fare_zscore"] = zscore(df["fare_interpolate"])
from scipy.stats import zscore

Constraint optimization with NumPy and SciPy

Data Wrangler custom transforms can handle advanced transformations like constraint optimization applying SciPy optimize functions and combining SciPy with NumPy. In the following example, fare as a function of age doesn’t show any observable trend. However, constraint optimization can transform fare as a function of age. The constraint condition in this case is that the new total fare remains the same as the old total fare. Data Wrangler custom transforms allow you to run the SciPy optimize function to determine the optimal coefficient that can transform fare as a function of age under constraint conditions.

Optimization definition, objective definition, and multiple constraints can be mentioned as different functions while formulating constraint optimization in a Data Wrangler custom transform using SciPy and NumPy. Custom transforms can also bring different solver methods that are available as part of the SciPy optimize package. A new transformed variable can be generated by multiplying the optimal coefficient with the original column and added to existing columns of Data Wrangler. See the following code:

import numpy as np
import scipy.optimize as opt
import pandas as pd

df2 = pd.DataFrame({"Y":df["fare_interpolate"], "X1":df["age_interpolate"]})

# optimization defination
def main(df2):
x0 = [0.1]
res = opt.minimize(fun=obj, x0=x0, args=(df2), method="SLSQP", bounds=[(0,50)], constraints=cons)
return res

# objective function
def obj(x0, df2):
sumSquares = np.sum(df2["Y"] - x0*df2["X1"])
return sumSquares

# constraints
def constraint1(x0):
sum_cons1 = np.sum(df2["Y"] - x0*df2["X1"]) - 0
return sum_cons1
con1 = {'type': 'eq', 'fun': constraint1}
cons = ([con1])

print(main(df2))

df["new_fare_age_optimized"]=main(df2).x*df2["X1"]

The Data Wrangler custom transform feature has the UI capability to show the results of SciPy optimize functions like value of optimal coefficient (or multiple coefficients).

Custom transformations with scikit-learn

scikit-learn is a Python module for machine learning built on top of SciPy. It’s an open-source ML library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

Discretization

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes. One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, preprocessing with a discretizer can introduce nonlinearity to linear models.

In the following code, we use KBinsDiscretizer to discretize the age column into 10 bins:

# Table is available as variable `df`
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np
# discretization transform the raw data
df = df.dropna()
kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
ages = np.array(df["age"]).reshape(-1, 1)
df["age"] = kbins.fit_transform(ages)
print(kbins.bin_edges_)

You can see the bin edges printed in the following screenshot.

One-hot encoding

Values in the Embarked columns are categorical values. Therefore, we have to represent these strings as numerical values in order to perform our classification with our model. We could also do this using a one-hot encoding transform.

There are three values for Embarked: S, C, and Q. We represent these with numbers. See the following code:

# Table is available as variable `df`
from sklearn.preprocessing import LabelEncoder

le_embarked = LabelEncoder()
le_embarked.fit(df["embarked"])

encoded_embarked_training = le_embarked.transform(df["embarked"])
df["embarked"] = encoded_embarked_training

Clean up

When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees.

Data Wrangler automatically saves your data flow every 60 seconds. To avoid losing work, save your data flow before shutting Data Wrangler down.

To save your data flow in Studio, choose File, then choose Save Data Wrangler Flow.
To shut down the Data Wrangler instance, in Studio, choose Running Instances and Kernels.
Under RUNNING APPS, choose the shutdown icon next to the sagemaker-data-wrangler-1.0 app.
Choose Shut down all to confirm.

Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from RUNNING INSTANCES when you shut down the Data Wrangler app.

After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler flow file. This can take a few minutes.

Conclusion

In this post, we demonstrated how you can use custom transformations in Data Wrangler. We used the libraries and framework within the Data Wrangler container to extend the built-in data transformation capabilities. The examples in this post represent a subset of the frameworks used. The transformations in the Data Wrangler flow can now be scaled in to a pipeline for DataOps.

To learn more about using data flows with Data Wrangler, refer to Create and Use a Data Wrangler Flow and Amazon SageMaker Pricing. To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler. To learn more about Autopilot and AutoML on SageMaker, visit Automate model development with Amazon SageMaker Autopilot.

About the authors

Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience in end-to-end designs and solutions for machine learning; business analytics within financial, operational, and marketing analytics; healthcare; supply chain; and IoT. Outside work, Sovik enjoys traveling and watching movies.

Abigail is a Software Development Engineer at Amazon SageMaker. She is passionate about helping customers prepare their data in DataWrangler and building distributed machine learning systems. In her free time, Abigail enjoys traveling, hiking, skiing, and baking.

Artificial Intelligence