AWS News Blog

Amazon SageMaker Debugger – Debug Your Machine Learning Models

Voiced by Polly

Today, we’re extremely happy to announce Amazon SageMaker Debugger, a new capability of Amazon SageMaker that automatically identifies complex issues developing in machine learning (ML) training jobs.

Building and training ML models is a mix of science and craft (some would even say witchcraft). From collecting and preparing data sets to experimenting with different algorithms to figuring out optimal training parameters (the dreaded hyperparameters), ML practitioners need to clear quite a few hurdles to deliver high-performance models. This is the very reason why be built Amazon SageMaker : a modular, fully managed service that simplifies and speeds up ML workflows.

As I keep finding out, ML seems to be one of Mr. Murphy’s favorite hangouts, and everything that may possibly go wrong often does! In particular, many obscure issues can happen during the training process, preventing your model from correctly extracting and learning patterns present in your data set. I’m not talking about software bugs in ML libraries (although they do happen too): most failed training jobs are caused by an inappropriate initialization of parameters, a poor combination of hyperparameters, a design issue in your own code, etc.

To make things worse, these issues are rarely visible immediately: they grow over time, slowly but surely ruining your training process, and yielding low accuracy models. Let’s face it, even if you’re a bonafide expert, it’s devilishly difficult and time-consuming to identify them and hunt them down, which is why we built Amazon SageMaker Debugger.

Let me tell you more.

Introducing Amazon SageMaker Debugger
In your existing training code for TensorFlow, Keras, Apache MXNet, PyTorch and XGBoost, you can use the new SageMaker Debugger SDK to save internal model state at periodic intervals; as you can guess, it will be stored in Amazon Simple Storage Service (Amazon S3).

This state is composed of:

  • The parameters being learned by the model, e.g. weights and biases for neural networks,
  • The changes applied to these parameters by the optimizer, aka gradients,
  • The optimization parameters themselves,
  • Scalar values, e.g. accuracies and losses,
  • The output of each layer,
  • Etc.

Each specific set of values – say, the sequence of gradients flowing over time through a specific neural network layer – is saved independently, and referred to as a tensor. Tensors are organized in collections (weights, gradients, etc.), and you can decide which ones you want to save during training. Then, using the SageMaker SDK and its estimators, you configure your training job as usual, passing additional parameters defining the rules you want SageMaker Debugger to apply.

A rule is a piece of Python code that analyses tensors for the model in training, looking for specific unwanted conditions. Pre-defined rules are available for common problems such as exploding/vanishing tensors (parameters reaching NaN or zero values), exploding/vanishing gradients, loss not changing, and more. Of course, you can also write your own rules.

Once the SageMaker estimator is configured, you can launch the training job. Immediately, it fires up a debug job for each rule that you configured, and they start inspecting available tensors. If a debug job detects a problem, it stops and logs additional information. A CloudWatch Events event is also sent, should you want to trigger additional automated steps.

So now you know that your deep learning job suffers from say, vanishing gradients. With a little brainstorming and experience, you’ll know where to look: maybe the neural network is too deep? Maybe your learning rate is too small? As the internal state has been saved to S3, you can now use the SageMaker Debugger SDK to explore the evolution of tensors over time, confirm your hypothesis and fix the root cause.

Let’s see SageMaker Debugger in action with a quick demo.

Debugging Machine Learning Models with Amazon SageMaker Debugger
At the core of SageMaker Debugger is the ability to capture tensors during training. This requires a little bit of instrumentation in your training code, in order to select the tensor collections you want to save, the frequency at which you want to save them, and whether you want to save the values themselves or a reduction (mean, average, etc.).

For this purpose, the SageMaker Debugger SDK provides simple APIs for each framework that it supports. Let me show you how this works with a simple TensorFlow script, trying to fit a 2-dimension linear regression model. Of course, you’ll find more examples in this Github repository.

Let’s take a look at the initial code:

import argparse
import numpy as np
import tensorflow as tf
import random

parser = argparse.ArgumentParser()
parser.add_argument('--model_dir', type=str, help="S3 path for the model")
parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001)
parser.add_argument('--steps', type=int, help="Number of steps to run", default=100)
parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0)

args = parser.parse_args()

with tf.name_scope('initialize'):
    # 2-dimensional input sample
    x = tf.placeholder(shape=(None, 2), dtype=tf.float32)
    # Initial weights: [10, 10]
    w = tf.Variable(initial_value=[[10.], [10.]], name='weight1')
    # True weights, i.e. the ones we're trying to learn
    w0 = [[1], [1.]]
with tf.name_scope('multiply'):
    # Compute true label
    y = tf.matmul(x, w0)
    # Compute "predicted" label
    y_hat = tf.matmul(x, w)
with tf.name_scope('loss'):
    # Compute loss
    loss = tf.reduce_mean((y_hat - y) ** 2, name="loss")

optimizer = tf.train.AdamOptimizer(args.lr)
optimizer_op = optimizer.minimize(loss)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(args.steps):
        x_ = np.random.random((10, 2)) * args.scale
        _loss, opt = sess.run([loss, optimizer_op], {x: x_})
        print (f'Step={i}, Loss={_loss}')

Let’s train this script using the TensorFlow Estimator. I’m using SageMaker local mode, which is a great way to quickly iterate on experimental code.

bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000}

estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name='debugger-simple-demo',
    train_instance_count=1,
    train_instance_type='local',
    entry_point='script-v1.py',
    framework_version='1.13.1',
    py_version='py3',
    script_mode=True,
    hyperparameters=bad_hyperparameters)

Looking at the training log, things did not go well.

Step=0, Loss=7.883463958023267e+23
algo-1-hrvqg_1 | Step=1, Loss=9.502028841062608e+23
algo-1-hrvqg_1 | Step=2, Loss=nan
algo-1-hrvqg_1 | Step=3, Loss=nan
algo-1-hrvqg_1 | Step=4, Loss=nan
algo-1-hrvqg_1 | Step=5, Loss=nan
algo-1-hrvqg_1 | Step=6, Loss=nan
algo-1-hrvqg_1 | Step=7, Loss=nan
algo-1-hrvqg_1 | Step=8, Loss=nan
algo-1-hrvqg_1 | Step=9, Loss=nan

Loss does not decrease at all, and even goes to infinity… This looks like an exploding tensor problem, which is one of the built-in rules defined in SageMaker Debugger. Let’s get to work.

Using the Amazon SageMaker Debugger SDK
In order to capture tensors, I need to instrument the training script with:

  • A SaveConfig object specifying the frequency at which tensors should be saved,
  • A SessionHook object attached to the TensorFlow session, putting everything together and saving required tensors during training,
  • An (optional) ReductionConfig object, listing tensor reductions that should be saved instead of full tensors,
  • An (optional) optimizer wrapper to capture gradients.

Here’s the updated code, with extra command line arguments for SageMaker Debugger parameters.

import argparse
import numpy as np
import tensorflow as tf
import random
import smdebug.tensorflow as smd

parser = argparse.ArgumentParser()
parser.add_argument('--model_dir', type=str, help="S3 path for the model")
parser.add_argument('--lr', type=float, help="Learning Rate", default=0.001 )
parser.add_argument('--steps', type=int, help="Number of steps to run", default=100 )
parser.add_argument('--scale', type=float, help="Scaling factor for inputs", default=1.0 )
parser.add_argument('--debug_path', type=str, default='/opt/ml/output/tensors')
parser.add_argument('--debug_frequency', type=int, help="How often to save tensor data", default=10)
feature_parser = parser.add_mutually_exclusive_group(required=False)
feature_parser.add_argument('--reductions', dest='reductions', action='store_true', help="save reductions of tensors instead of saving full tensors")
feature_parser.add_argument('--no_reductions', dest='reductions', action='store_false', help="save full tensors")
args = parser.parse_args()
args = parser.parse_args()

reduc = smd.ReductionConfig(reductions=['mean'], abs_reductions=['max'], norms=['l1']) if args.reductions else None

hook = smd.SessionHook(out_dir=args.debug_path,
                       include_collections=['weights', 'gradients', 'losses'],
                       save_config=smd.SaveConfig(save_interval=args.debug_frequency),
                       reduction_config=reduc)

with tf.name_scope('initialize'):
    # 2-dimensional input sample
    x = tf.placeholder(shape=(None, 2), dtype=tf.float32)
    # Initial weights: [10, 10]
    w = tf.Variable(initial_value=[[10.], [10.]], name='weight1')
    # True weights, i.e. the ones we're trying to learn
    w0 = [[1], [1.]]
with tf.name_scope('multiply'):
    # Compute true label
    y = tf.matmul(x, w0)
    # Compute "predicted" label
    y_hat = tf.matmul(x, w)
with tf.name_scope('loss'):
    # Compute loss
    loss = tf.reduce_mean((y_hat - y) ** 2, name="loss")
    hook.add_to_collection('losses', loss)

optimizer = tf.train.AdamOptimizer(args.lr)
optimizer = hook.wrap_optimizer(optimizer)
optimizer_op = optimizer.minimize(loss)

hook.set_mode(smd.modes.TRAIN)

with tf.train.MonitoredSession(hooks=[hook]) as sess:
    for i in range(args.steps):
        x_ = np.random.random((10, 2)) * args.scale
        _loss, opt = sess.run([loss, optimizer_op], {x: x_})
        print (f'Step={i}, Loss={_loss}')

I also need to modify the TensorFlow Estimator, to use the SageMaker Debugger-enabled training container and to pass additional parameters.

bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000, 'debug_frequency': 1}

from sagemaker.debugger import Rule, rule_configs
estimator = TensorFlow(
    role=sagemaker.get_execution_role(),
    base_job_name='debugger-simple-demo',
    train_instance_count=1,
    train_instance_type='ml.c5.2xlarge',
    image_name=cpu_docker_image_name,
    entry_point='script-v2.py',
    framework_version='1.15',
    py_version='py3',
    script_mode=True,
    hyperparameters=bad_hyperparameters,
    rules = [Rule.sagemaker(rule_configs.exploding_tensor())]
)

estimator.fit()
2019-11-27 10:42:02 Starting - Starting the training job...
2019-11-27 10:42:25 Starting - Launching requested ML instances
********* Debugger Rule Status *********
*
* ExplodingTensor: InProgress 
*
****************************************

Two jobs are running: the actual training job, and a debug job checking for the rule defined in the Estimator. Quickly, the debug job fails!

Describing the training job, I can get more information on what happened.

description = client.describe_training_job(TrainingJobName=job_name)
print(description['DebugRuleEvaluationStatuses'][0]['RuleConfigurationName'])
print(description['DebugRuleEvaluationStatuses'][0]['RuleEvaluationStatus'])

ExplodingTensor
IssuesFound

Let’s take a look at the saved tensors.

Exploring Tensors
I can easily grab the tensors saved in S3 during the training process.

s3_output_path = description["DebugConfig"]["DebugHookConfig"]["S3OutputPath"]
trial = create_trial(s3_output_path)

Let’s list available tensors.

trial.tensors()

['loss/loss:0',
'gradients/multiply/MatMul_1_grad/tuple/control_dependency_1:0',
'initialize/weight1:0']

All values are numpy arrays, and I can easily iterate over them.

tensor = 'gradients/multiply/MatMul_1_grad/tuple/control_dependency_1:0'
for s in list(trial.tensor(tensor).steps()):
    print("Value: ", trial.tensor(tensor).step(s).value)

Value:  [[1.1508383e+23] [1.0809098e+23]]
Value:  [[1.0278440e+23] [1.1347468e+23]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]
Value:  [[nan] [nan]]

As tensor names include the TensorFlow scope defined in the training code, I can easily see that something is wrong with my matrix multiplication.

# Compute true label
y = tf.matmul(x, w0)
# Compute "predicted" label
y_hat = tf.matmul(x, w)

Digging a little deeper, the x input is modified by a scaling parameter, which I set to 100000000000 in the Estimator. The learning rate doesn’t look sane either. Bingo!

x_ = np.random.random((10, 2)) * args.scale

bad_hyperparameters = {'steps': 10, 'lr': 100, 'scale': 100000000000, 'debug_frequency': 1}

As you probably knew all along, setting these hyperparameters to more reasonable values will fix the training issue.

Now Available!
We believe Amazon SageMaker Debugger will help you find and solve training issues quicker, so it’s now your turn to go bug hunting.

Amazon SageMaker Debugger is available today in all commercial regions where Amazon SageMaker is available. Give it a try and please send us feedback, either on the AWS forum for Amazon SageMaker, or through your usual AWS support contacts.

- Julien

 

 

Julien Simon

Julien Simon

As an Artificial Intelligence & Machine Learning Evangelist for EMEA, Julien focuses on helping developers and enterprises bring their ideas to life.