Amazon SageMaker Debugger

Optimize ML models with real-time monitoring of training metrics and system resources

Amazon SageMaker Debugger makes it easy to optimize machine learning (ML) models by capturing training metrics in real-time such as data loss during regression and sending alerts when anomalies are detected. This helps you immediately rectify inaccurate model predictions such as an incorrect identification of an image. SageMaker Debugger automatically stops the training process when the desired accuracy is achieved, reducing the time and cost of training ML models.

NEW! Amazon SageMaker Debugger can now help you train models faster by automatically profiling and monitoring system resource utilization and sending alerts when resource bottlenecks such as over-utilized CPUs are identified. You can visually monitor and profile system resources including CPUs, GPUs, network, and memory during training within Amazon SageMaker Studio so you can continuously improve resource utilization. SageMaker Debugger correlates system resource usage to different phases of the training job and for specific points in time during training, and provides recommendations on how to adjust resource utilization to help you re-allocate resources for maximum efficiency. Monitoring and profiling works across all leading deep learning frameworks including PyTorch and TensorFlow, without requiring any code changes in your training scripts. Monitoring and profiling of system resources happens in real-time, helping you optimize your ML models faster and at scale.

Automatic detection, analysis, and alerts

Amazon SageMaker Debugger can reduce troubleshooting during training from days to minutes by automatically detecting and alerting you to remediate common training errors such as gradient values becoming too large or too small. Alerts can be viewed in Amazon SageMaker Studio or configured through Amazon CloudWatch. Additionally, the SageMaker Debugger SDK enables you to automatically detect new classes of model-specific errors such as data sampling, hyperparameter values, and out of bound values.

Monitoring and profiling

Amazon SageMaker Debugger automatically monitors utilization of system resources such as GPUs, CPUs, network, and memory, and profiles your training jobs to collect detailed ML framework metrics. You can inspect all resource metrics visually through SageMaker Studio. Anomalies in resource utilization are correlated to specific operations for identification of bottlenecks such as over-utilized CPUs so you can take corrective action quickly. Additionally, a detailed report can be downloaded for offline analysis. Training runs can be profiled either at the start of the training job or at any point when training is in progress.

Built-in analytics

Amazon SageMaker Debugger comes with built-in analytics that automatically analyze data emitted during training such as inputs, outputs, and transformations known as tensors. As a result, you can detect whether a model is overfitting or overtraining, whether gradients are getting too large or too small, whether GPU resources are underutilized, and other bottlenecks during training. With SageMaker Debugger, you can also create your own custom conditions to test for specific behavior in your training jobs. These conditions can invoke actions such as stopping a training job and sending an SMS or email. Early stopping of training jobs will help reduce training costs for suboptimal models and develop better prototypes faster.

Broad support across ML algorithms and DL frameworks

Amazon SageMaker Debugger supports ML frameworks including TensorFlow, PyTorch, Apache MXNet, Keras, and XGBoost. SageMaker’s built-in containers for these frameworks come pre-installed with SageMaker Debugger, enabling you to monitor, profile, and debug your training scripts easily. By default, SageMaker Debugger monitors system hardware utilization and losses during training without writing additional code to monitor each resource separately.

Integration with AWS Lambda

Amazon SageMaker Debugger is integrated with AWS Lambda so you can act on results from alerts. For example, AWS Lambda functions can automatically stop a training job when a non-converging action such as losses continuously increasing rather than decreasing over time, is detected. AWS Lambda provides notifications to stop training jobs so you can reduce costs and achieve desired results during the early stages of ML development and training.


Mobileye is a global leader in driver assistance and autonomous vehicle technology with over 60M vehicles with Mobileye technology.

“Many of the assisted driving and autonomous vehicle technologies that we develop at Mobileye, (officially known as Mobileye, an Intel Company), rely on training deep neural network models to detect a wide variety of road artifacts, including vehicles, pedestrians, speed bumps, road signs and more. Often, these models train on extremely large datasets, on multiple machines, and for periods of up to several days. For us, at Mobileye, it is imperative that we have a toolkit of advanced performance profiling capabilities, for analyzing the flow of data across the network, CPU, and GPU resources, and for pinpointing performance issues. Amazon SageMaker Debugger’s profiling functionality provides just that, taking performance profiling out of the domain of a few specialized experts, and empowering our algorithm developers to maximize training resource utilization, accelerate model convergence, and reduce cost.”

Chaim Rand, ML Algorithm Developer - Mobileye, an Intel company

Autodesk is a global leader in 3D design, engineering, and entertainment software. Autodesk helps users unlock their creativity to solve design challenges and turn ideas into realities.

“At Autodesk, we leverage machine learning to enhance our design and manufacturing solutions to enable greater degrees of creative freedom for our customers. Using machine learning, we developed a new filter that identifies and groups outcomes with similar visual characteristics to make it easier to find the best options. Amazon SageMaker Debugger allows us to iterate on this model much more efficiently by helping close the feedback loop, saving valuable data scientist time and cutting training hours by more than 75%.”

Alexander Carlson, ML Engineer - Autodesk

Change Healthcare is a leading independent healthcare technology company that provides data and analytics-driven solutions to improve clinical, financial, and patient engagement outcomes in the U.S. healthcare system.

"At Change Healthcare, we are continuously working with our healthcare providers to remove inefficiencies from the processing of healthcare claims. We often receive claim forms from our healthcare providers which have unreadable labels and fixing these forms manually adds time and cost to the claim settlement process. We have developed a multi-layer deep learning model that superimposes labels that helps us with this process. Amazon SageMaker Debugger helps us improve the accuracy of the model with rapid iterations. With SageMaker Debugger, we can gain deeper insights into tensors, achieve resilient model training, assist in detecting inconsistencies in real-time, and tune the model parameters for better accuracy."

Jayant Thomas, Sr. Director, AI Engineering - Change Healthcare

Get started with Amazon SageMaker Debugger