Announcing new capabilities for Amazon SageMaker Debugger with real-time monitoring of system resources and profiling training jobs

Posted on: Dec 8, 2020

We’re excited to announce new capabilities with Amazon SageMaker Debugger with real-time monitoring of system resources for efficient utilization. With these new capabilities, you can now get automatic recommendations to re-allocate resources for your training jobs, helping you train better and reduce time and costs.

Amazon SageMaker Debugger is a capability of Amazon SageMaker that makes it easy to train ML models faster by capturing real-time metrics such as learning gradients and weights, providing transparency into the training process, so you can correct anomalies such as losses, over-fitting, and over-training. SageMaker Debugger provides built-in techniques called rules to easily analyze emitted data including tensors that are critical for the success of training jobs such as identifying why your ML model is predicting a right traffic signal as left even though it trained at over 90% accuracy.  

With new profiling capabilities, SageMaker Debugger now automatically monitors system resources such as CPU, GPU, network, I/O, and memory providing a complete resource utilization view of training jobs. You can also profile your entire training job, or portions thereof, to emit detailed framework metrics during different phases of the training job. Framework metrics are metrics that are captured from within the training script such as step duration, data-loading, pre-processing, and operator execution time on CPUs and GPUs. SageMaker Debugger correlates system and framework metrics which helps you identify possible root causes to issues such as GPU utilization dropping down to zero so you can inspect your training scripts and troubleshoot suitably. You can reallocate resources based on recommendations from the profiling report resulting in improving training time and reducing costs. Metrics and insights are captured and monitored programmatically using the SageMaker Python SDK or visually through Amazon SageMaker Studio.  

Amazon SageMaker Debugger is now generally available in all AWS regions in the Americas and Europe, and some regions in Asia Pacific with additional regions coming soon. Read the documentation for more information and for sample notebooks. To learn how to use the new profiling functionality in SageMaker Debugger, visit the blog post.