Why SageMaker Data Wrangler?
Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface. You can use SQL to select the data that you want from various data sources and import it quickly. Next, you can use the data quality and insights report to automatically verify data quality and detect anomalies, such as duplicate rows and target leakage. SageMaker Data Wrangler contains over 300 built-in data transformations, so you can quickly transform data without writing any code.
Benefits of SageMaker Data Wrangler
Data preparation in minutes
Improve model accuracy
Access, select, and query data faster
With the SageMaker Data Wrangler data selection tool, you can quickly access and select your tabular and image data from various popular sources (such as Amazon Simple Storage Service [Amazon S3], Amazon Athena, Amazon Redshift, AWS Lake Formation, Snowflake, and Databricks) and over 50 other third-party sources (such as Salesforce, SAP, Facebook Ads, and Google Analytics). You can also write queries for data sources using SQL and import data directly into SageMaker from various file formats, such as CSV, Parquet, and JSON, and database tables.
Generate data insights and understand data quality
SageMaker Data Wrangler provides a data quality and insights report that automatically verifies data quality (such as missing values, duplicate rows, and data types) and helps detect anomalies (such as outliers, class imbalance, and data leakage) in your data. Once you can effectively verify data quality, you can quickly apply domain knowledge to process datasets for ML model training.
Understand your data with visualizations
SageMaker Data Wrangler helps you understand your data and identify potential errors and extreme values with a set of robust preconfigured visualization templates. Histograms, scatter plots, box and whisker plots, line plots, and bar charts are all built in for applying to your data. More advanced ML-specific visualizations (such as bias report, feature correlation, multicollinearity, target leakage, and time series) are also available that show feature importance and feature correlations. Those tools can be accessed from the Analysis tab.
Transform data more efficiently
SageMaker Data Wrangler offers a selection of over 300 prebuilt, PySpark-based data transformations, so you can transform your data and scale your data preparation workflow without writing a single line of code. Preconfigured transformations cover common use cases such as flattening JSON files, deleting duplicate rows, imputing missing data with mean or median, one hot encoding, and time-series–specific transformers to accelerate the preparation of time-series data for ML. For your image data, SageMaker Data Wrangler offers common image augmentations (such as Blur, Enhance, and Resize) and cleaning operations (like dropping corrupted images and duplicates). You can also author custom transformations in PySpark, SQL, and Pandas. SageMaker Data Wrangler offers image (imgaug, OpenCV) libraries for creating custom transforms for CV use cases and a rich library of code snippets to streamline custom transformation authoring.
Understand the predictive power of your data
The SageMaker Data Wrangler Quick Model feature provides an estimate of the expected predictive power of your data. Quick Model automatically splits your data into training and testing datasets and trains the data on an XGBoost model with default hyperparameters. Based on the task that you are solving (for example, classification or regression), SageMaker Data Wrangler provides a model summary, feature summary, and confusion matrix, which help you quickly iterate on your data preparation flows.
Automate and deploy ML data preparation workflows
With the SageMaker Data Wrangler UI, you can launch scale to large datasets without the need to author PySpark code, install Apache Spark, or spin up clusters. You can launch or schedule a job to quickly process your data or export it to a SageMaker Studio notebook. SageMaker Data Wrangler offers several export options, including SageMaker Data Wrangler jobs, SageMaker Feature Store, and SageMaker Pipelines, so you can integrate your data preparation flow into your ML workflow. Alternatively, you can deploy your data preparation workflow to a SageMaker hosted endpoint. Finally, you can export data directly to train ML model using a visual interface with SageMaker Canvas
"At INVISTA, we are driven by transformation and look to develop products and technologies that benefit customers around the globe. We see ML as a way to improve the customer experience. But, with datasets that span hundreds of millions of rows, we needed a solution to help us prepare data, and develop, deploy, and manage ML models at scale. With Amazon SageMaker Data Wrangler, we can now interactively select, clean, explore, and understand our data effectively, empowering our data science team to create feature engineering pipelines that can scale effortlessly to datasets that span hundreds of millions of rows. With Amazon SageMaker Data Wrangler, we can operationalize our ML workflows faster."
Caleb Wilkinson, Former Lead Data Scientist, INVISTA
"Using ML, 3M is improving tried-and-tested products, like sandpaper, and driving innovation in several other spaces, including healthcare. As we plan to scale ML to more areas of 3M, we see the amount of data and models growing rapidly—doubling every year. We are enthusiastic about the new SageMaker features because they will help us scale. Amazon SageMaker Data Wrangler makes it much easier to prepare data for model training, and Amazon SageMaker Feature Store will eliminate the need to create the same model features over and over. Finally, Amazon SageMaker Pipelines will help us automate data prep, model building, and model deployment into an end-to-end workflow so we can speed time to market for our models. Our researchers are looking forward to taking advantage of the new speed of science at 3M."
David Frazee, Former Technical Director, 3M Corporate Systems Research Lab
"Amazon SageMaker Data Wrangler enables us to hit the ground running to address our data preparation needs with a rich collection of transformation tools that accelerate the process of ML data preparation needed to take new products to market. In turn, our clients benefit from the rate at which we scale deployed models, enabling us to deliver measurable, sustainable results that meet the needs of our clients in a matter of days rather than months."
Frank Farrall, Principal, AI Ecosystems and Platforms Leader, Deloitte
"As an AWS Premier Consulting Partner, our engineering teams are working very closely with AWS to build innovative solutions to help our customers continuously improve the efficiency of their operations. ML is the core of our innovative solutions, but our data preparation workflow involves sophisticated data preparation techniques which, as a result, take a significant amount of time to become operationalized in a production environment. With Amazon SageMaker Data Wrangler, our data scientists can complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, which helps us accelerate the data preparation process and easily prepare our data for ML. With Amazon SageMaker Data Wrangler, we can prepare data for ML faster."
Shigekazu Ohmoto, Senior Corporate Managing Director, NRI Japan
"As our footprint in the population health management market continues to expand into more health payors, providers, pharmacy benefit managers, and other healthcare organizations, we needed a solution to automate end-to-end processes for data sources that feed our ML models, including claims data, enrollment data, and pharmacy data. With Amazon SageMaker Data Wrangler, we can now accelerate the time it takes to aggregate and prepare data for ML using a set of workflows that are easier to validate and reuse. This has dramatically improved the delivery time and quality of our models, increased the effectiveness of our data scientists, and reduced data preparation time by nearly 50%. In addition, SageMaker Data Wrangler has helped us save multiple ML iterations and significant GPU time, speeding the entire end-to-end process for our clients as we can now build data marts with thousands of features including pharmacy, diagnosis codes, ER visits, inpatient stays, as well as demographic and other social determinants. With SageMaker Data Wrangler, we can transform our data with superior efficiency for building training datasets, generate data insights on datasets prior to running ML models, and prepare real-world data for inference/predictions at scale.”
Lucas Merrow, CEO, Equilibrium Point IoT