Easily monitor and visualize metrics while training models on Amazon SageMaker

Data scientists and developers can now quickly and easily access, monitor, and visualize metrics that are computed while training machine learning models on Amazon SageMaker. You can now specify the metrics you want to track by using the AWS Management Console for Amazon SageMaker or by using the Amazon SageMaker Python SDK APIs. After the model training starts, Amazon SageMaker will automatically monitor and stream the specified metrics in real time to the Amazon CloudWatch console for visualizing time-series curves, such as loss curves and accuracy curves. You can also access the metrics programmatically using Amazon SageMaker Python SDK APIs.

Model training is an iterative process of teaching a model to make predictions by presenting examples from a training dataset. Typically a training algorithm computes several metrics such as training loss and prediction accuracy that help diagnose whether the model is learning well and will generalize well for making predictions on unseen data. This diagnosis is especially helpful when you are tuning your model’s hyperparameters or evaluating whether your model has the potential for deploying to production.

Now let’s dive into few examples so you can see how you can monitor and visualize these metrics on Amazon SageMaker.

Amazon SageMaker algorithms provide built-in support for metrics

All Amazon SageMaker built-in algorithms automatically compute and emit a variety of model training, evaluation, and validation metrics. For example, the Amazon SageMaker Object2Vec algorithm emits the validation:cross_entropy metric. Object2Vec is a supervised learning algorithm that can learn low dimensional dense embeddings of high dimensional objects such as words, phrases, and sentences. It also learns how similar two embeddings are in vector space. This is a technique that has applications in assessing whether a given pair of sentences in a text are similar. The validation:cross_entropy metric emitted by the algorithm measures the extent to which the prediction made by the model diverges from the actual label in the validation data set. If the model is learning well, the cross_entropy should decrease over the progression of model training.

Now let’s walk through the AWS Management Console step by step. We’ll also show you how to use the code snippets from the sample notebook for training an Amazon SageMaker Object2Vec model.

Step 1: Start the training job on Amazon SageMaker

The sample notebook has step-by-step instructions for creating the training job. You can find all the metrics emitted by the training algorithm on the AWS Management Console. In the console, open the Amazon SageMaker console and choose Training Jobs in the left navigation pane. Then, choose the training job name to open the details page for the training job.

On the training job details page, scroll down to the Metrics section to find all the metrics published by the training algorithm to your Amazon CloudWatch Logs and Amazon CloudWatch Metrics streams. You can use the regex patterns that you see next to each metric to quickly parse and filter the metric values from your Amazon CloudWatch Log files created by Amazon SageMaker.

In the next step we’ll show you how you can avoid doing the manual parsing from log files, and monitor the metric directly on your Amazon CloudWatch metrics dashboard.

Step 2: Visit the Amazon CloudWatch metrics dashboard to monitor and visualize the metrics

The training jobs details page now has a direct link to the Amazon CloudWatch metrics dashboard for the metrics emitted by the training algorithm.

Choose the link to go to your Amazon CloudWatch metrics dashboard. Use this dashboard to select the validation:cross_entropy metric for graphing and visualization.

Step 3: Using Amazon SageMaker Python SDK APIs to visualize metrics

You can also visualize the metrics inline in your Amazon SageMaker Jupyter notebooks using the Amazon SageMaker Python SDK APIs. Here is a sample code snippet.

%matplotlib inline
from sagemaker.analytics import TrainingJobAnalytics

training_job_name = '<insert job name>'
metric_name = 'validation:cross_entropy'

metrics_dataframe = TrainingJobAnalytics(training_job_name=training_job_name,metric_names=[metric_name]).dataframe()
plt = metrics_dataframe.plot(kind='line', figsize=(12,5), x='timestamp', y='value', style='b.', legend=False)
plt.set_ylabel(metric_name);

Step 4: Using the DescribeTrainingJob API action

In addition to visualizing the running value of the metric, you can also access the final value of the metric using the DescribeTrainingJob API action.

Monitoring and visualizing metrics for your own training algorithm

If you are performing model training on Amazon SageMaker using either one of the built-in deep learning framework containers such as the TensorFlow or PyTorch containers, or running your own algorithm container, you can now easily specify the metrics you want Amazon SageMaker to monitor and publish to your Amazon CloudWatch metrics dashboard.

Using the Amazon SageMaker console

While you are creating your model training job on the console, you can now specify the regex pattern for the metrics that your algorithm or model training script publishes to logs. Amazon SageMaker will automatically parse the metrics from logs and publish them to your Amazon CloudWatch metrics dashboard for graphing and visualization.

Using the AWS SDK

You can also add the MetricsDefinition for the metrics you want to track while creating a training job using the CreateTrainingJob API action.

trainingJobParams = {
   "AlgorithmSpecification": { 
      "TrainingImage": "string",
      "TrainingInputMode": "string"
   }, 
...............
...............
MetricDefinitions: [
  {
   "Name": "validation:rmse",
   "Regex": ".*\\[[0-9]+\\].*#011validation-rmse:(\\S+)"
  },
  {
   "Name": "validation:auc",
   "Regex": ".*\\[[0-9]+\\].*#011validation-auc:(\\S+)"
  },
  {
   "Name": "train:auc",
   "Regex": ".*\\[[0-9]+\\]#011train-auc:(\\S+).*"
  }
 ]
...............
...............
}

Get started with more examples and developer support

Now that you have seen examples of how to monitor and visualize metrics on Amazon SageMaker, you can try out the sample notebooks that we mentioned earlier or add metrics visualization to your own training algorithm. You can refer our developer guide for a complete listing of metrics computed by our built-in Amazon SageMaker algorithms or post your questions on our developer forum. Happy modeling!

About the Authors

Sifei Li is a Software Engineer in Amazon AI where she’s working on building Amazon Machine Learning Platforms and was part of the launch team for Amazon SageMaker.

Sumit Thakur is a Senior Product Manager for AWS Machine Learning Platforms where he loves working on products that make it easy for customers to get started with machine learning on cloud. He is product manager for Amazon SageMaker and AWS Deep Learning AMI. In his spare time, he likes connecting with nature and watching sci-fi TV series.

Andrew Packer is a Software Engineer in Amazon AI where he is excited about building scalable, distributed machine learning infrastructure for the masses. In his spare time, he likes playing guitar and exploring the PNW.

Select your cookie preferences

AWS Machine Learning Blog