Posted On: Aug 24, 2023

We are excited to announce preview of Amazon SageMaker Profiler, an advanced observability tool for large deep learning workloads. With this new capability, you will be able to access granular compute hardware related profiling insights for optimizing model training performance.

For customers developing large deep learning models for computer vision, NLP or foundation model use cases, the number of compute instances needed and the associated costs are significant. They need visibility into active kernel times, launch latency or a other timelines related to GPU/CPU processes. SageMaker Profiler enables identifying optimization opportunities through GPU and CPU utilization metrics, high resolution GPU/CPU trace plots, custom annotations, and visibility into mixed precision utilization. It enables users to identify bottlenecks due to uneven resource utilization. It’s also more efficient in reducing overhead during training, scalable in supporting longer profiling duration and larger number of training instances profiled per workload. These help provide more reliable insights to data scientists while trying to optimize hardware performance for large scale distributed training workloads.

Amazon SageMaker Profiler is available in the following regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Europe (Frankfurt), and Europe (Ireland) using the default compute instance support. During this preview, SageMaker Profiler will be available without cost to customers in supported regions. 

To learn more, please see the ML blog and documentation page.