Posted On: Apr 27, 2022

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. With SageMaker Data Wrangler’s data selection tool, you can quickly select data from multiple data sources, such as Amazon S3, Amazon Athena, Amazon Redshift, AWS Lake Formation, Amazon SageMaker Feature Store, Databricks Delta Lake, and Snowflake.

Today we are announcing the general availability of random sampling of data when importing from S3 and new transforms to create random or stratified samples of your datasets with Amazon SageMaker Data Wrangler in Amazon SageMaker Studio. Previously, you would have to write code to create random samples or stratified samples of their data when preparing data for ML applications. Today, with the random sampling option on import, you can now create a random sample of your data on S3 when importing your data into Data Wrangler. Additionally, with our new transforms for random and stratified sampling, you can create the following types of samples for your data set:

  • Random sample. Random samples are helpful when you have a data set that is too large to prepare interactively. With the random sampling transform you can randomly sample a proportion of your data set to prepare it for machine learning.
  • Stratified sample. Stratified samples are helpful when your data contains a rare event (such as fraudulent credit card transactions which occur much less than one percent of all credit card transactions) and you want to preserve the proportion of the rare event in your sampled data set.
  • First K sample. First K samples create a sample using the first K rows of your data set where K is some number. For example, if K is 1,000 then a sample of would be created containing the first 1,000 rows of your data set. First K sampling are helpful when you only need the correct column schema to prepare your data. An additional benefit of First K sample is that it is an extremely time-efficient operation.

To learn more about how to sample your data with Amazon SageMaker Data Wrangler read the blog.

To get started with new capabilities of Amazon SageMaker Data Wrangler, you can open Amazon SageMaker Studio after upgrading to the latest release and click File > New > Flow from the menu or “new data flow” from the SageMaker Studio launcher. To learn more about the new features, view the documentation.