Posted On: Sep 22, 2022

Amazon SageMaker Data Wrangler reduces the time that it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio, the first fully integrated development environment (IDE) for ML. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, from a single visual interface. You can import data from multiple data sources such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Snowflake, and 26 Federated Query data sources supported by Amazon Athena. Starting today, customers importing data from Athena data sources can configure S3 query output location and data retention period to control where and how long Athena stores the intermediary data.

Amazon Athena is an interactive query service that makes it easy to browse Glue Data Catalog, and analyze data directly in Amazon S3 and 26 Federated Query data sources using standard SQL. Data Wrangler supports Athena workgroup to provide a custom S3 query output location. Starting today, you can specify a custom S3 location for Athena query outputs or continue to use the existing default bucket in Data Wrangler. You now have a default data retention period of 5 days for the Athena query output to control storage cost. You can change this data retention period to match your need and your organization’s data security guideline. Once you import the data through Athena, you can use Data Wrangler visual interface to join data from multiple sources, explore and analyze your data with Data Quality and Insights report and other built-in visualizations to identify potential errors and extreme values. You can quickly cleanse your data and engineer features with 300+ built-in data transformations. You can create a job to process a larger dataset, or kick off a SageMaker Autopilot training job directly from Data Wrangler to automatically find the best model for your business problem using the prepared data.

These features are generally available in the all the AWS Regions that Data Wrangler currently support at no additional charge. To get started with SageMaker Data Wrangler, visit the blog and the AWS documentation.