Select data, understand data insights, and easily transform data to prepare data for ML in minutes
Quickly estimate ML model accuracy and diagnose issues before models are deployed into production
Take data preparation workflows from preparation to production with a single click and automate workflows
Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. Using SageMaker Data Wrangler’s data selection tool, you can choose the data you want from various data sources and import it with a single click. Once data is imported, you can use the data quality and insights report to automatically verify data quality and detect abnormalities, such as duplicate rows and target leakage. SageMaker Data Wrangler contains over 300 built-in data transformations so you can quickly normalize, transform, and combine features without having to write any code. With SageMaker Data Wrangler’s visualization templates, you can quickly preview and inspect that these transformations are completed as you intended by viewing them in Amazon SageMaker Studio, the first fully integrated development environment (IDE) for ML. Once your data is prepared, you can build fully automated ML workflows with Amazon SageMaker Pipelines and save them for reuse in the Amazon SageMaker Feature Store.
How it works
Select and query data with just a few clicks
With SageMaker Data Wrangler’s data selection tool, you can quickly select data from multiple data sources, such as Amazon S3, Amazon Athena, Amazon Redshift, AWS Lake Formation, Snowflake, and Databricks Delta Lake. You can also write queries for data sources and import data directly into SageMaker from various file formats, such as CSV files, Parquet files, ORC and JSON files, and database tables.
Generate data insights and understand data quality
SageMaker Data Wrangler provides a quality and insights report that automatically verifies data quality and helps detect abnormalities in your data. Once you can effectively verify data quality, you can quickly apply domain knowledge to process datasets for ML model training.
Understand your data with visualizations
SageMaker Data Wrangler helps you understand your data and identify potential errors and extreme values with a set of robust pre-configured visualization templates. Histograms, scatter plots, box and whisker plots, line plots, and bar charts are all available. Templates such as the histogram make it simple to create and edit your own visualizations without writing code. You can also use Amazon SageMaker Clarify to detect potential bias during data preparation, after model training, and in your deployed ML model.
Easily transform data
SageMaker Data Wrangler offers a selection of 300+ pre-configured data transformations so you can transform your data into formats that can be effectively used for models without writing a single line of code. Pre-configured transformations cover common use cases such as impute missing data with mean or medium, one hot encoding, and time-series specific transformers to accelerate the preparation of time series data for ML. You can also author custom transformations in PySpark, SQL, and Pandas.
Diagnose and fix ML data preparation issues faster
SageMaker Data Wrangler enables you to quickly identify inconsistencies in your data preparation workflow and diagnose issues before models are deployed into production. You can quickly identify if your prepared data will result in an accurate model so you can determine if additional feature engineering is needed to improve performance.
Automate ML data preparation workflows
You can export your data preparation workflow to a SageMaker Data Wrangler job notebook, SageMaker Autopilot training experiment, SageMaker Pipelines notebook, code script, or create a SageMaker Data Wrangler processing job with one click. SageMaker Data Wrangler provides a unified experience between data preparation and ML model training in SageMaker Autopilot. With just a few clicks, you can automatically build, train, and tune ML models. SageMaker Data Wrangler also seamlessly integrates your data preparation workflow with Amazon SageMaker Pipelines to automate model deployment and management. In addition, SageMaker Data Wrangler publishes features in Amazon SageMaker Feature Store so you can share features across your team and others can reuse them for their own models and analysis.
“At INVISTA, we are driven by transformation and look to develop products and technologies that benefit customers around the globe. We see machine learning as a way to improve the customer experience, but with datasets that span hundreds of millions of rows, we needed a solution to help us prepare data, and develop, deploy, and manage ML models at scale…With Amazon SageMaker Data Wrangler, we can now interactively select, clean, explore, and understand our data effectively, empowering our data science team to create feature engineering pipelines that can scale effortlessly to datasets that span hundreds of millions of rows… with Amazon SageMaker Data Wrangler, we can operationalize our ML workflows faster.”
Caleb Wilkinson, Lead Data Scientist - INVISTA
“Using ML, 3M is improving tried-and-tested products, like sandpaper, and driving innovation in several other spaces, including healthcare. As we plan to scale machine learning to more areas of 3M, we see the amount of data and models growing rapidly – doubling every year. We are enthusiastic about the new SageMaker features because they will help us scale. Amazon SageMaker Data Wrangler makes it much easier to prepare data for model training, and Amazon SageMaker Feature Store will eliminate the need to create the same model features over and over. Finally, Amazon SageMaker Pipelines will help us automate data prep, model building, and model deployment into an end to end workflow so we can speed time to market for our models. Our researchers are looking forward to the taking advantage of the new speed of science at 3M.”
David Frazee, Technical Director - 3M Corporate Systems Research Lab
"Amazon SageMaker Data Wrangler enables us to hit the ground running to address our data preparation needs with a rich collection of transformation tools that accelerate the process of machine learning data preparation needed to take new products to market. In turn, our clients benefit from the rate at which we scale deployed models enabling us to deliver measurable, sustainable results that meet the needs of our clients in a matter of days rather than months.”
Frank Farrall, Principal, AI Ecosystems and Platforms Leader - Deloitte
"As an AWS Premier Consulting Partner, our engineering teams are working very closely with AWS to build innovative solutions to help our customers continuously improve the efficiency of their operations. Machine learning is the core of our innovative solutions, but our data preparation workflow involves sophisticated data preparation techniques which, as a result, take a significant amount of time to become operationalized in a production environment. With Amazon SageMaker Data Wrangler, our data scientists can complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, which helps us accelerate the data preparation process and easily prepare our data for machine learning. With Amazon SageMaker Data Wrangler, we can prepare data for machine learning faster.”
Shigekazu Ohmoto, Senior Managing Director - NRI Japan
"As our footprint in the population health management market continues to expand into more health payors, providers, pharmacy benefit managers, and other healthcare organizations, we needed a solution to automate end to end processes for data sources that feed our machine learning models, including claims data, enrollment data, and pharmacy data. With Amazon SageMaker Data Wrangler, we can now accelerate the time it takes to aggregate and prepare data for machine learning using a set of workflows that are easier to validate and reuse. This has dramatically improved the delivery time and quality of our models, increased the effectiveness of our data scientists, and reduced data preparation time by nearly 50%. In addition, SageMaker Data Wrangler has helped us save multiple machine learning iterations and significant GPU time, speeding the entire end to end process for our clients as we can now build data marts with thousands of features including pharmacy, diagnosis codes, ER visits, inpatient stays, as well as demographic and other social determinants. With SageMaker Data Wrangler, we can transform our data with superior efficiency for building training datasets, generate data insights on datasets prior to running machine learning models, and prepare real-world data for inference/predictions at scale.”
Lucas Merrow, CEO - Equilibrium Point IoT