Prepare JSON and ORC data, balance and encode data sets, and launch data processing jobs in one click with Amazon SageMaker Data Wrangler

Posted On: Feb 2, 2022

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. With SageMaker Data Wrangler’s data selection tool, you can quickly select data from multiple data sources, such as Amazon S3, Amazon Athena, Amazon Redshift, AWS Lake Formation, Amazon SageMaker Feature Store, and Snowflake.

Today we are announcing the general availability of support for JSON, JSONL, and ORC file formats in Data Wrangler. You can now browse, preview, and import your data in these file formats using Data Wrangler. The ORC file format provides a highly efficient way to store Hive data, however it can be difficult to preview this data using a text editor. With support for the ORC file format in Data Wrangler, you can now easily browse data in an ORC file just as you would a csv file. To learn more about importing ORC files and preparing JSON data with Data Wrangler, read the blog.

Additionally, we are announcing the general availability of several new transforms including: transforms to handle class imbalance in your data sets, transforms to process columns with arrays and JSON-formatted strings, and a similarity encoding transform to efficiently encode categorical data with high cardinality. These transforms add to Data Wrangler’s collection of over 300 transforms, which includes many transforms for processing time series data. Below is a detailed description of these new transforms:

Balance data. Datasets can frequently be imbalanced favoring one target class over the other. The new balance transform can help you oversample a sparse minority class depending on your requirements. Additionally, you can now generate new samples of the minority class using the synthetic minority oversampling technique (SMOTE) now generally available in Data Wrangler. SMOTE automatically generates new observations of your minority class from groups of similar rows in your data set. To learn more about how to handle imbalanced data sets with Data Wrangler, read the blog.
Handle Structured Columns. For columns that contain arrays, a new explode array transform generates a new row for each value in the array. For JSON-formatted strings, a new flatten structured column transform creates new columns for each key-value pair in the JSON-formatted string. To learn more about handling structured columns with Data Wrangler, read the blog.
Encode Categorical Variables. Using a new similarity encoding transform, you can now efficiently encode categorical variables with high cardinality. Many data scientists frequently apply a one-hot encoding to their categorical variables, which converts each categorical value into a separate column. The process of one-hot encoding can turn a single column with US states into 50 new binary valued variables (one for each state). With similarity encoding now available in Data Wrangler, you can encode a categorical variable into a much smaller number of columns while retaining or possibly increasing model performance.

Finally, we are announcing the general availability of a one-click “Create job” experience to launch data processing jobs. Starting today, you can click a “Create job” button which allows you to start a data processing job using the steps specified in your Data Wrangler flow. You can still use the data processing notebooks in Data Wrangler to launch data processing jobs and integrate Data Wrangler into your MLOps pipelines. To learn more about how to launch a data processing job with Data Wrangler, read the blog.

To get started with new capabilities of Amazon SageMaker Data Wrangler, you can open Amazon SageMaker Studio after upgrading to the latest release and click File > New > Flow from the menu or “new data flow” from the SageMaker Studio launcher. To learn more about the new features, view the documentation.

Prepare JSON and ORC data, balance and encode data sets, and launch data processing jobs in one click with Amazon SageMaker Data Wrangler

Ending Support for Internet Explorer