Posted On: Oct 21, 2022
Today, we are excited to announce the ability to dynamically support different datasets stored on S3 through use of parameters in Amazon SageMaker Data Wrangler. Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. Previously, customers did not have an easy way to dynamically refer to data sets when running Data Wrangler processing jobs on a schedule. Customers also lacked a way to more easily filter down files in an S3 bucket to be used for processing. Finally, customers lacked a simple way to change data sources when running a Data Wrangler processing job from the Create Job workflow or from a Data Wrangler processing notebook.
With support for Parametrized data sets in Data Wrangler, you can use parameters to specify which data sets to process with your Data Wrangler flow. A parameter is a variable that you can save in your Data Wrangler flow. You can specify date-time parameters to refer to a specific date-time range of data sets. With pattern parameters you can specify a Python regular expression to match filenames conforming to a specific pattern. String or number parameters can be used to match file names with a matching string or numeric value. You can access parameters in Data Wrangler by clicking on the node “+” menu and selecting “Edit dataset”. Highlighting any part of the S3 path brings up the “Create custom parameter” menu which can be used to easily add a new parameter. The full list of parameters can be accessed by clicking on the “{{ }}” icon next to the S3 path.
This feature is generally available in all AWS Regions that Data Wrangler currently supports at no additional charge. To get started scheduling your data processing jobs with SageMaker Data Wrangler read the AWS documentation.