Posted On: Oct 17, 2022
Today, we are excited to announce support to refit transforms with Amazon SageMaker Data Wrangler. To make data usable by algorithms such as XgBoost, data scientists must transform non-numeric values to numeric values using transforms such as one-hot encoding. Since transforms like one-hot encoding depend on the data, these transforms are frequently referred to as fitted transforms. These transforms must be updated or re-fitted to account for changes in the data as data continues to change over time. Additionally, when working on a sample data set, transforms must be updated to account for changes between a sample data set and the larger data set. Use of transforms like one-hot encoding generates additional information, which needs to be tracked and captured in the data preparation pipeline. Omitting or incorrectly tracking this information can lead to errors in the data preparation process. Without support to refit transforms, many data scientists did not have an easy way to specify when to use a fitted version of a transform or to refit their transform on new data. Data scientists also lacked an easy way to generate updated versions of their transformation pipelines when refitting on new datasets.
Data Wrangler now tracks fitted transforms in data flows for all applicable transforms. These fitted transforms can now be used to more easily prepare new data as required. Users can specify when they want to re-use transforms or refit new transforms on their data. The refit feature is available both in the Data Wrangler visual interface when launching a Data Wrangler processing job and also within the create job notebook. Simply select “refit” under “trained parameters” in the create job workflow to refit transforms in your flow. Data Wrangler will also automatically generate a new flow file containing updated values for refit transforms.
This feature is generally available in all AWS Regions that Data Wrangler currently supports at no additional charge. To get started with SageMaker Data Wrangler read the AWS documentation.