Posted On: Nov 30, 2022

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, visualization, cleansing, and preparation from a low-code visual interface. Many ML practitioners want to explore datasets directly in notebooks to spot potential data-quality issues, like missing information, extreme values, skewed datasets, or biases, so they can correct those issues to prepare data for training ML model faster. ML practitioners can spend weeks writing boilerplate code to visualize and examine different parts of their dataset to identify and fix potential issues.

Starting today, Data Wrangler offers a built-in data preparation capability in Amazon SageMaker Studio notebooks that allows ML practitioners to visually review data characteristics, identify issues, and remediate data-quality problems—in just a few clicks directly within the notebooks. When users display a data frame (a tabular representation of data) in their notebooks, SageMaker Studio notebooks automatically generate charts to help users understand their data distribution patterns, identify potential issues such as incorrect data, missing data, or outliers, and suggests data transformations to fix these issues. The new capability also enables users to identify target column data quality issues that will affect the ML model performance such as imbalanced data or mixed data types, and suggests data transformations to fix these issues. Once the ML practitioner selects a data transformation, SageMaker Studio notebooks generates the corresponding codes within the notebook so the data transformation can be repeatedly applied every time the notebook is run.

This feature is generally available in all the regions currently supported by SageMaker Studio notebooks at no additional charge.