Amazon SageMaker Data Wrangler
The fastest and easiest way to prepare data for machine learning
Select data, understand data insights, and transform data to prepare it for ML in minutes.
Quickly estimate ML model accuracy and diagnose issues before models are deployed into production.
Take data preparation to production faster without the need to author PySpark code, install Apache Spark, or spin up clusters.
Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface. You can use SQL to select the data you want from a wide variety of data sources and import it quickly. Next, you can use the Data Quality and Insights report to automatically verify data quality and detect anomalies, such as duplicate rows and target leakage. SageMaker Data Wrangler contains over 300 built-in data transformations so you can quickly transform data without writing any code. Once you have completed your data preparation workflow, you can scale it to your full datasets using SageMaker data processing jobs; train, tune, and deploy models using SageMaker Autopilot; or deploy your data preparation flow for inference, all from the SageMaker Data Wrangler UI
Access, select, and query data faster
With the SageMaker Data Wrangler data selection tool, you can quickly access and select data from a wide variety of popular sources (such as Amazon Simple Storage Service [S3], Amazon Athena, Amazon Redshift, AWS Lake Formation, Snowflake, and Databricks Delta Lake) and over 40 other third-party sources (such as Salesforce, SAP, Facebook Ads, and Google Analytics). You can also write queries for data sources using SQL and import data directly into SageMaker from various file formats, such as CSV, Parquet, ORC, and JSON, and database tables.
Generate data insights and understand data quality
SageMaker Data Wrangler provides a Data Quality and Insights report that automatically verifies data quality (such as missing values, duplicate rows, and data types) and helps detect anomalies (such as outliers, class imbalance, and data leakage) in your data. Once you can effectively verify data quality, you can quickly apply domain knowledge to process datasets for ML model training.
Understand your data with visualizations
SageMaker Data Wrangler helps you understand your data and identify potential errors and extreme values with a set of robust preconfigured visualization templates. Histograms, scatter plots, box and whisker plots, line plots, and bar charts are all available out of the box for applying on your data. We also have more advanced ML-specific visualizations (such as bias report, feature correlation, multicollinearity, target leakage, and time series) that show feature importance and feature correlations. Those can be accessed by selecting the corresponding tools in the Analysis tab.
Transform data more efficiently
SageMaker Data Wrangler offers a selection of 300+ prebuilt, PySpark-based data transformations so you can transform your data and scale your data preparation workflow without writing a single line of code. Preconfigured transformations cover common use cases such as flattening JSON files, deleting duplicate rows, imputing missing data with mean or medium, one hot encoding, and time-series–specific transformers to accelerate the preparation of time-series data for ML. You can also author custom transformations in PySpark, SQL, and Pandas. SageMaker Data Wrangler also offers a rich library of code snippets to make it easier to author these custom transformations.
Understand the predictive power of your data
The SageMaker Data Wrangler Quick Model feature provides an estimate of the expected predictive power of your data. Quick Model automatically splits your data into training and testing datasets and trains the data on an XGBoost model with default hyperparameters. Based on the task you are solving (for example, classification or regression), SageMaker Data Wrangler provides a model summary, feature summary, and confusion matrix, which help you quickly iterate on your data preparation flows.
Automate and deploy ML data preparation workflows
With the SageMaker Data Wrangler UI, you can launch scale to large datasets without the need to author PySpark code, install Apache Spark, or spin up clusters. You can launch or schedule a job to quickly process your data or export it to a SageMaker Studio notebook. SageMaker Data Wrangler offers several export options, including Amazon SageMaker Data Wrangler jobs, Amazon SageMaker Feature Store, Amazon SageMaker Autopilot, and Amazon SageMaker Pipelines, providing you the ability to integrate your data preparation flow into your ML workflow. Alternatively, you can deploy your data preparation workflow to a SageMaker hosted endpoint.
“At INVISTA, we are driven by transformation and look to develop products and technologies that benefit customers around the globe. We see machine learning as a way to improve the customer experience, but with datasets that span hundreds of millions of rows, we needed a solution to help us prepare data, and develop, deploy, and manage ML models at scale…With Amazon SageMaker Data Wrangler, we can now interactively select, clean, explore, and understand our data effectively, empowering our data science team to create feature engineering pipelines that can scale effortlessly to datasets that span hundreds of millions of rows… with Amazon SageMaker Data Wrangler, we can operationalize our ML workflows faster.”
Caleb Wilkinson, Lead Data Scientist - INVISTA
“Using ML, 3M is improving tried-and-tested products, like sandpaper, and driving innovation in several other spaces, including healthcare. As we plan to scale machine learning to more areas of 3M, we see the amount of data and models growing rapidly – doubling every year. We are enthusiastic about the new SageMaker features because they will help us scale. Amazon SageMaker Data Wrangler makes it much easier to prepare data for model training, and Amazon SageMaker Feature Store will eliminate the need to create the same model features over and over. Finally, Amazon SageMaker Pipelines will help us automate data prep, model building, and model deployment into an end to end workflow so we can speed time to market for our models. Our researchers are looking forward to the taking advantage of the new speed of science at 3M.”
David Frazee, Technical Director - 3M Corporate Systems Research Lab
"Amazon SageMaker Data Wrangler enables us to hit the ground running to address our data preparation needs with a rich collection of transformation tools that accelerate the process of machine learning data preparation needed to take new products to market. In turn, our clients benefit from the rate at which we scale deployed models enabling us to deliver measurable, sustainable results that meet the needs of our clients in a matter of days rather than months.”
Frank Farrall, Principal, AI Ecosystems and Platforms Leader - Deloitte
"As an AWS Premier Consulting Partner, our engineering teams are working very closely with AWS to build innovative solutions to help our customers continuously improve the efficiency of their operations. Machine learning is the core of our innovative solutions, but our data preparation workflow involves sophisticated data preparation techniques which, as a result, take a significant amount of time to become operationalized in a production environment. With Amazon SageMaker Data Wrangler, our data scientists can complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, which helps us accelerate the data preparation process and easily prepare our data for machine learning. With Amazon SageMaker Data Wrangler, we can prepare data for machine learning faster.”
Shigekazu Ohmoto, Senior Managing Director - NRI Japan
"As our footprint in the population health management market continues to expand into more health payors, providers, pharmacy benefit managers, and other healthcare organizations, we needed a solution to automate end to end processes for data sources that feed our machine learning models, including claims data, enrollment data, and pharmacy data. With Amazon SageMaker Data Wrangler, we can now accelerate the time it takes to aggregate and prepare data for machine learning using a set of workflows that are easier to validate and reuse. This has dramatically improved the delivery time and quality of our models, increased the effectiveness of our data scientists, and reduced data preparation time by nearly 50%. In addition, SageMaker Data Wrangler has helped us save multiple machine learning iterations and significant GPU time, speeding the entire end to end process for our clients as we can now build data marts with thousands of features including pharmacy, diagnosis codes, ER visits, inpatient stays, as well as demographic and other social determinants. With SageMaker Data Wrangler, we can transform our data with superior efficiency for building training datasets, generate data insights on datasets prior to running machine learning models, and prepare real-world data for inference/predictions at scale.”
Lucas Merrow, CEO - Equilibrium Point IoT