Posted On: Dec 8, 2022

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization from a single visual interface. Starting today, you can connect to Amazon EMR Presto as a big query engine to bring in very large dataset, and prepare data for ML in minutes in Data Wrangler visual interactive.

Analyzing, transforming, and preparing large amounts of data is a critical part and also the most time-consuming part of ML workflow. Data scientists and data engineers leverage Apache Spark, Apache Hive, and Presto running on Amazon EMR for large scale data preparation. Starting today, customers can now use a visual interface to discover and connect to existing EMR clusters running Presto endpoint from Data Wrangler. They can browse the database, tables and schema, author Presto queries to select, preview and create a dataset for ML. They can then use Data Wrangler visual interface to analyze data using Data Quality and Insights report, and clean data and create features for ML using 300+ built-in transformations backed by Spark without the need to author Spark code. They can automatically train and deploy ML models using integration with SageMaker Autopilot. Finally, they can scale to process very large datasets with distributed processing jobs, automate data preparation using built-in scheduling capability, and run data prep in production workflows for training or inference with SageMaker Pipeline.

Data Wrangler supports for EMR Presto in all the regions currently supported by Data Wrangler at no additional charge. To learn more, see this blog post and the AWS technical documentation.