Posted On: Apr 1, 2022

Amazon SageMaker Data Wrangler reduces the time that it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio, the first fully integrated development environment (IDE) for ML. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, from a single visual interface. You can import data from multiple data sources such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and Snowflake. Starting today, you can now use Databricks as a data source in SageMaker Data Wrangler to easily prepare data in Databricks for machine learning. Databricks, an AWS Partner, helps organizations prepare their data for analytics, empower data science and data-driven decisions across the organization, and rapidly adopt ML.

With Databricks as a data source for SageMaker Data Wrangler, you can now quickly and easily connect to Databricks, interactively query data stored in Databricks using SQL, and preview data before importing. Additionally, you can join your data in Databricks with data stored in Amazon S3, and data queried through Amazon Athena, Amazon Redshift, and Snowflake to create the right dataset for your ML use case. Once you import the data, you can explore and analyze your data with SageMaker Data Wrangler built-in visualizations to identify potential errors and extreme values. You can quickly cleanse your data and engineer features with 300+ built-in data transformations, including ML specific transformations, such as one-hot encoding and balancing data, without writing a single line of code. You can also detect bias with Amazon SageMaker Clarify, find target leakage, do “what if” analysis with a quick model to understand feature importance and other data quality issues that will affect the ML model even before training and deploying ML models into production. Finally, you can export the processed data directly in the Amazon SageMaker Feature Store or to Amazon S3 in a few clicks to train ML models with SageMaker Autopilot or SageMaker Training. You can also export your data preparation workflow to run on larger datasets on SageMaker Processing job or as a step in Amazon SageMaker Pipelines.

To learn more about Databricks integration with SageMaker Data Wrangler, view our blog or AWS documentation. To get started with SageMaker Data Wrangler, visit our AWS documentation and pricing page.