Posted On: Apr 27, 2022
Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. With SageMaker Data Wrangler’s data selection tool, you can quickly select data from multiple data sources, such as Amazon S3, Amazon Athena, Amazon Redshift, AWS Lake Formation, Amazon SageMaker Feature Store, Databricks Delta Lake, and Snowflake.
Today we are announcing the general availability of a Data Quality and Insights Report feature within Data Wrangler. Previously, to get insights into data and data quality for ML, data scientists would have to write a significant amount of code to import, process and analyze, and finally export these insights - a time consuming and laborious process. Today, with support for insights into data and data quality, data scientists now have instant access to these insights with a few clicks. This new report automatically verifies data quality and detects abnormalities in your data. Data scientists and data engineers can use this tool to efficiently and quickly apply domain knowledge to process datasets for ML model training.
The report includes the following sections:
- Summary statistics. This section provides insights into the number of rows, features, % missing, % valid, duplicate rows, and a breakdown of the type of feature (e.g. numeric vs. text).
- Data Quality Warnings. This section provides warnings that point to abnormalities in the data and includes items such as: presence of small minority class, high target cardinality, rare target label, imbalanced class distribution, skewed target, heavy tailed target, outliers in target, regression frequent label, invalid values and more.
- Target Column Insights. This section provides statistics on the target column including % valid, % missing, % outliers, univariate statistics such as min/median/max, and also presents examples of observations with outlier or invalid target values.
- Quick Model. The data insights report automatically trains a model on your data to provide a directional check on feature engineering progress and provides associated model statistics in the report.
- Feature Importance. This section provides a ranking of features by feature importance which are automatically calculated when preparing the data insights and data quality report.
- Anomalous and duplicate rows. The data quality and insights report detects anomalous samples using the Isolation forest algorithm and also surfaces duplicate rows that may be present in the data set.
- Feature details. This section provides summary statistics for each feature in the data set as well as the corresponding distribution of the target variable.
To learn more about how to create the data quality and insights report and how to use it as part of your data preparation workflow read the blog.
To get started with new capabilities of Amazon SageMaker Data Wrangler, you can open Amazon SageMaker Studio after upgrading to the latest release and click File > New > Flow from the top menu or “New data flow” from the SageMaker Studio Launcher. To learn more about the new features, view the documentation.