Posted On: Oct 14, 2021

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface.

Starting today, you can query data on Amazon Athena using workgroups, enable multi-key joins for datasets, visualize correlation and duplicate rows, and provide customer managed keys when exporting your data flows, which make it easier and faster to prepare data for ML. Below is a detailed description of these features:

  • Support for Athena Workgroups. Amazon Athena Workgroups are a resource type that can be used to separate query execution and query history between Users, Teams, or Applications running under the same AWS account. Starting today, you can now query data with Athena from SageMaker Data Wrangler using the workgroup of your choice.
  • Two new visualizations to help with data preparation:
    • With SageMaker Data Wrangler’s feature correlation visualization you can easily calculate the correlation of features in your data set and visualize them as a correlation matrix.
    • With the new duplicate row detection visualization, you can quickly detect if your data set has any duplicate rows.
  • Multi-key joins. You can now specify multiple columns when joining together two data sets in SageMaker Data Wrangler and delete intermediate steps inside of SageMaker Data Wrangler flows.
  • Support for Customer Managed Keys (CMKs) using Amazon Key Management Service (KMS). Starting today, you can now specify the KMS key when using both the “Export to S3” feature in addition to the exported notebooks from within SageMaker Data Wrangler.

To get started with new capabilities of Amazon SageMaker Data Wrangler, you can open Amazon SageMaker Studio after upgrading to the latest release and click File > New > Flow from the menu or “new data flow” from the SageMaker Studio launcher. To learn more about the new features, view the documentation