Posted On: May 6, 2022
Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio, the first fully integrated development environment (IDE) for ML. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, from a single visual interface. SageMaker Data Wrangler runs on ml.m5.4xlarge by default. SageMaker Data Wrangler includes built-in data transforms and analyses written in PySpark so you can process large data sets (up to hundreds of gigabytes (GB) of data) efficiently on the default instance.
Starting today, you can use additional M5 or R5 instance types with more CPU or memory in SageMaker Data Wrangler to improve performance for your data preparation workloads. Amazon EC2 M5 instances offer a balance of compute, memory, and networking resources for a broad range of workloads. Amazon EC2 R5 instances are the memory optimized instances. Both M5 and R5 instance types are well suited for CPU and memory intensive applications such as running built-in transforms for very large data sets (up to terabytes (TB) of data) or applying custom transforms written in Panda on medium data sets (up to tens of GBs).
To learn more about the newly supported instances with Amazon SageMaker Data Wrangler, visit the blog or the AWS document, and the pricing page. To get started with SageMaker Data Wrangler, visit the AWS documentation.