Posted On: Mar 10, 2023
Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio. Data Wrangler enables you to access data from a wide variety of popular sources (Amazon S3, Amazon Athena, Amazon Redshift, Amazon EMR Presto, Snowflake) and over 40 other third-party sources. Starting today, you can connect to Amazon EMR Hive as a big data query engine to bring in very large datasets for ML.
Aggregating and preparing large amounts of data is a critical part part of ML workflow. Data scientists and data engineers leverage Apache Spark, Apache Hive, and Presto running on Amazon EMR for large scale data processing. Starting today, customers can now use Data Wrangler’s visual interface to discover and connect to existing EMR clusters running a Hive endpoint. They can browse the database, tables and schema, author Hive queries to select, preview and create a dataset using Data Wrangler’s SQL explorer. They can then visually analyze data, and create ML features without writing any code using 300+ built-in analyses and transformations backed by Spark. Customers can also train and deploy model with SageMaker Autopilot, schedule job or operationalize data preparation in a SageMaker Pipeline from Data Wrangler visual interface.
Data Wrangler supports EMR Hive in all the regions currently supported by Data Wrangler. To learn more, see this blog post and the AWS technical documentation.