Posted On: Nov 30, 2022
Today, we are excited to announce support for deploying data preparation flows created in Data Wrangler to real-time and batch serial inference pipelines, and additional configurations for Data Wrangler processing jobs in Amazon SageMaker Data Wrangler.
Amazon SageMaker Data Wrangler reduces the time to rapidly prototype and deploy data processing workloads to production and easily integrates with CI/CD pipelines and MLOps production environments through SageMaker Processing APIs. When running and scheduling data processing workloads with Data Wrangler to prepare data to train ML models, customers asked to customize Spark memory and output partition settings for their data preparation workloads at scale. Next, once customers process their data and train a ML model, they need to deploy both the data transformation pipeline and ML model behind a SageMaker Endpoint for real-time inference and batch inference use-cases. Customers then need to create data processing scripts from scratch to run the same data processing steps at inference that were applied when training the model, and once their model is deployed they need to ensure their training and deployment scripts are kept in sync.
With this release, you can now easily configure Spark memory configurations and output partition format when running a Data Wrangler processing job to process data at scale. After preparing your data and training an ML model, you can now easily deploy your data transformation pipeline (also called a “data flow”) together with an ML model as part of a serial inference pipeline for both batch and real-time inference applications. You can also now register your Data Wrangler data flows with SageMaker Model Registry. You can begin deploying your Data Wrangler flow for real-time inference by clicking on “Export to > Inference Pipeline (via Jupyter Notebook)" from the Data Flow view in Data Wrangler. Spark memory settings can now be configured as part of the Create job workflow and partitions can be configured as part of the destination node settings.
This feature is generally available in all AWS Regions that Data Wrangler currently supports at no additional charge. To get started with SageMaker Data Wrangler read the blog and AWS documentation.