Posted On: Jun 9, 2022
Today we are announcing the general availability of splitting data into train and test splits with Amazon SageMaker Data Wrangler. Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. With SageMaker Data Wrangler’s data selection tool, you can quickly select data from multiple data sources, such as Amazon S3, Amazon Athena, Amazon Redshift, AWS Lake Formation, Snowflake, and Databricks Delta Lake.
Starting today, you can now split your data into train and test sets in just a few clicks with Data Wrangler. Previously data scientists had to write code to split their data into train and test sets before training ML models. With SageMaker Data Wrangler’s new train-test split transform, you can now split your data into train, test, and validation sets for use in downstream model training and validation. SageMaker Data Wrangler also provides various types of splits including: randomized, ordered, stratified, and key-based splits along with the option to specify how much data should go in each split. For example, if you create a random split of your data into a training set and test set, you can train a machine learning model on the training set and then evaluate your machine learning model on the test set. Evaluating the model on data seen during the training can be biased, thus setting test data aside prior to training is crucial. As a result, evaluating model accuracy on the test set data provides a real-world estimate of model performance.