Posted On: Sep 30, 2022
Amazon SageMaker Data Wrangler reduces the time to aggregate and prepare data for machine learning (ML) from weeks to minutes. Amazon SageMaker Autopilot automatically builds, trains, and tunes the best machine learning models based on your data, while allowing you to maintain full control and visibility. Data Wrangler enables a unified data preparation and model training experience with Amazon SageMaker Autopilot in just a few clicks. This integration is now enhanced to include and reuse Data Wrangler feature transforms such as missing value imputers, ordinal/one-hot encoders etc., along with the Autopilot models for ML inference. When you prepare data in Data Wrangler and train a model by invoking Autopilot, you can now deploy the trained model along with all the Data Wrangler feature transforms as a SageMaker Serial Inference Pipeline. This will enable automatic preprocessing of the raw data with the reuse of Data Wrangler feature transforms at the time of inference. This feature is currently only supported for Data Wrangler flows that do not use join, group by, concatenate and time series transformations.
Before this launch, when using Autopilot models trained on prepared data from Data Wrangler, data presented for inference required preprocessing in SageMaker Data Wrangler. Such preprocessing was necessary before you presented the data for inference both in real-time or batch mode. Starting today, after preparing data with Data Wrangler and training a model in SageMaker Autopilot, you can either make batch predictions that include data wrangling transforms or deploy the trained model along with data wrangler transforms behind a SageMaker endpoint. This automatic inclusion of data wrangling transforms enables inference that eliminates the need for manual data preprocessing is available in both real-time and batch inference.
This new experience is now available in all regions where both SageMaker Data Wrangler and SageMaker Autopilot are available. To get started, see Automatically Train Models on Your Data Flow or review the blog post.