Posted On: Nov 2, 2023
You can now launch Amazon SageMaker Data Wrangler from Amazon EMR Studio for low code data preparation for machine learning (ML). Amazon EMR is the cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto. Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes. The new integration provides a simplified experience to launch SageMaker Data Wrangler from EMR Studio to prepare data for ML without writing code.
Analyzing, transforming, and preparing large amounts of data is a critical part and also the most time-consuming part of ML workflow. Starting today, customers can now launch SageMaker Data Wrangler from EMR Studio to discover and connect to existing EMR clusters. They can then use Data Wrangler visual interface to analyze data using Data Quality and Insights report, clean data and create features for ML using 300+ transformations backed by Spark. They can scale to process very large datasets with distributed processing jobs, automate data preparation using built-in scheduling capability, or integrate with SageMaker Pipeline for end to end training or inference workflow. They can also train and deploy ML models automatically using visual interface with SageMaker Autopilot integration from SageMaker Data Wrangler.
The new integration is available in all commercials regions where EMR and SageMaker Data Wrangler are available. For more information, see the AWS technical documentation.