Posted On: Aug 22, 2023
Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio. SageMaker Data Wrangler enables you to access data from a wide variety of popular sources including Amazon S3, Amazon Athena, Amazon Redshift, Amazon EMR, Snowflake, and over 50 other third-party sources. Starting today, you can use role-based access control with AWS Lake Formation in EMR Hive and Presto connections to create datasets for ML in SageMaker Data Wrangler.
Once the administrators configure EMR role-based access with Lake Formation, and provide data access to the IAM role used in SageMaker Sudio, you can connect from SageMaker Data Wrangler to EMR using the same IAM role to authenticate and authorize with Lake Formation. You can use EMR Hive and Presto connections to browse data in your S3 data lake managed by Lake Formation, and create a dataset for ML. You can then quickly understand data quality, clean the data, and create features using SageMaker Data Wrangler’s visual interface and 300+ built in analysis and data transformations backed by Spark without writing code. You can also train and deploy model with SageMaker Autopilot, and operationalize the data preparation process in a feature engineering, training or inference pipeline using integration with SageMaker Pipeline, all from SageMaker Data Wrangler.
SageMaker Data Wrangler supports EMR and Lake Formation in all the regions currently supported by Data Wrangler. To learn more, see this blog post and the AWS technical documentation.