Amazon SageMaker automatic model tuning produces better models, faster

Amazon SageMaker recently released a feature that allows you to automatically tune the hyperparameter values of your machine learning model to produce more accurate predictions. Hyperparameters are user-defined settings that dictate how an algorithm should behave during training. Examples include how large a decision tree should be grown, the number of clusters desired from a segmentation, or how much you should incrementally update neural network weights as you iterate through the data.

Selecting the right hyperparameter values for your machine learning model is important because it can have an enormous impact on final accuracy and performance. However, the process of setting hyperparameter values can be difficult. The right answer depends on your data. Some algorithms have many different hyperparameters that can be tweaked. Some hyperparameters are more sensitive than others to the values selected. Most have a non-linear relationship between model fit and hyperparameter values.

There are a variety of strategies for selecting hyperparameter values. Some scientists use domain knowledge, heuristics, intuition, or manual experimentation. Other scientists use brute force searches. Some build meta models to predict what hyperparameter values might perform well. To learn more about this approach, see Shahriari et al. (2016). Regardless of the method, the selection of hyperparameter values often requires a specialized skill set that could instead be focused on solving a new machine learning problem.

Amazon SageMaker automatic model tuning eases this task. It uses Gaussian Process regression to predict which hyperparameter values might be most effective at improving fit. It also uses Bayesian optimization to balance exploring the hyperparameter space and exploiting specific hyperparameter values when appropriate. And importantly, automatic model tuning can be used with the Amazon SageMaker built-in algorithms, pre-built deep learning frameworks, and bring-your-own-algorithm containers.

In this blog post, we’ll introduce how to perform hyperparameter tuning within Amazon SageMaker and compare two methods: automatic model tuning and random search. Despite sounding naive, randomly selecting hyperparameter values is often competitive with cutting edge methods and serves as a good baseline both for final fit and the number of jobs needed to reach that.

Problem overview

This post walks through our overall approach. For more details and a full step-by-step walkthrough, see the corresponding example notebook.

The dataset we use is the CIFAR-10 image dataset, which is a common computer vision benchmark. It consists of 60K 32×32 pixel color images (50K train, 10K test) evenly distributed across 10 classes. Its size is small enough to quickly download and train in a single instance. However, the task is challenging enough to highlight differences across a variety of approaches.

We’ll use the Amazon SageMaker MXNet container to train a ResNet-34 convolutional image classification neural network, using stochastic gradient descent to find our network weights. We fix epochs at 50 and mini-batch size at 1,024 on a single ml.p3.8xlarge instance for all job runs. Our goal will be to show how much holdout classification accuracy can improve through hyperparameter tuning, and to see how each tuning method performs.

Please note that running this example end-to-end yourself will cost about $400.

Training

Let’s start simple and train our model once using the default hyperparameter values. We’re using stochastic gradient descent (SGD) to train our convolutional neural network (CNN), which is an iterative method to minimize our training loss by finding the direction to change our network weights that improves training loss. Then SGD makes a small update to the weights in that direction and repeats. We focus on these three hyperparameters:

learning_rate: which controls how large updates will be that we make to the weights.
momentum: which uses information from the direction of our previous update to inform our current update. The default value of 0 means weight updates are based only on the information in the current batch.
wd: which penalizes weights when they grow too large. The default value of 0 means no penalty.

To start a training job on Amazon SageMaker, we’ll create an MXNet estimator and pass in information like the following:

The training script (cifar10.py is standard MXNet Gluon code to define and train our model)
IAM role
Hardware setup
Hyperparameters (the default values for SGD in MXNet)

from sagemaker.mxnet import MXNet

m = MXNet('cifar10.py', 
          role=role, 
          train_instance_count=1, 
          train_instance_type='ml.p3.8xlarge',
          hyperparameters={'batch_size': 1024, 
                           'epochs': 50, 
                           'learning_rate': 0.01, 
                           'momentum': 0.,
                           'wd': 0.})
m.fit('s3://sagemaker-<region>-<account>/data/DEMO-gluon-cifar10')

The logs for this training job show that with the default hyperparameter values, we only get an accuracy of about 53 percent on our validation dataset. CIFAR-10 can be challenging, but we’d like our accuracy better than being right half the time.

Random search

One method of hyperparameter tuning that performs surprisingly well considering that it’s very simple, is to try sets of independent training jobs across a variety of random hyperparameter values within set ranges. For our use case, we’ve created a helper script random_tuner.py to help us do this. You can review the raw code for details, but at its core, we’ll need to supply the following:

A function that trains our MXNet model given a job name and list of hyperparameters. Note, wait is set to false in our fit() call so that we can train multiple jobs at once.

def fit_random(job_name, hyperparameters):
    m = MXNet('cifar10.py', 
              role=role, 
              train_instance_count=1, 
              train_instance_type='ml.p3.8xlarge',
              hyperparameters=hyperparameters)
    m.fit(inputs, wait=False, job_name=job_name)

A dictionary of hyperparameters that defines the ones we want to tune as one of three types (ContinuousParameter, IntegerParameter, or CategoricalParameter) and provides appropriate minimum and maximum ranges or a list of possible values.

import random_tuner as rt

hyperparameters = {'batch_size': 1024,
                   'epochs': 50,
                   'learning_rate': rt.ContinuousParameter(0.001, 0.5),
                   'momentum': rt.ContinuousParameter(0., 0.99),
                   'wd': rt.ContinuousParameter(0., 0.001)}

Next, we can kick off our random search. We’ll define the total number of training jobs to be 120, with up to 8 jobs run in parallel. Theoretically, we could run all 120 jobs at once without any loss in performance. However, there’s some risk in starting that many jobs at once. Even if we successfully tested our code once, it’s possible that different hyperparameter values would exhaust system resources or fail catastrophically for some other reason. If that ended up being the case for a large portion of our hyperparameter space, we might end up paying for a lot of failed jobs. Running 8 jobs is still ambitious, but it allows us to track this and stop the search part way if needed.

jobs = rt.random_search(fit_random,
                        hyperparameters,
                        max_jobs=120,
                        max_parallel_jobs=8)

Summarizing the top and bottom performing jobs we see there’s a huge variation in accuracy. Had we initially (unknowingly) set our learning rate near 0.5, momentum at 0.15, and weight decay to 0.0004, we would have an accuracy just over 20 percent (randomly guessing image classes would produce 10% accuracy). However, we also found many successful hyperparameter value combinations, and reached a peak validation accuracy of 73.7 percent.

Automatic model tuning

Now, let’s see how Amazon SageMaker automatic model tuning compares to random search. We’ll use the tuner functionality already built-into the Amazon SageMaker Python SDK. We’ll define a new MXNet estimator (specifying pre-defined batch_size and epochs, but leaving out the hyperparameters we want to tune).

mt = MXNet('cifar10.py', 
           role=role, 
           train_instance_count=1, 
           train_instance_type='ml.p3.8xlarge',
           hyperparameters={'batch_size': 1024, 
                            'epochs': 50})

Then let’s define our ranges for the hyperparameters we do want to tune.

from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter

hyperparameter_ranges = {'learning_rate': ContinuousParameter(0.001, 0.5),
                         'momentum': ContinuousParameter(0., 0.99),
                         'wd': ContinuousParameter(0., 0.001)}

Finally, we’ll define our objective metric and provide the regex needed to scrape it from our training jobs’ logs in Amazon CloudWatch Logs.

objective_metric_name = 'Validation-accuracy'
metric_definitions = [{'Name': 'Validation-accuracy',
                       'Regex': 'validation: accuracy=([0-9\\.]+)'}]

We pass these into our HyperparameterTuner class and run .fit() on that. This will submit 30 training jobs, two at a time, and Amazon SageMaker will orchestrate and run everything in the background for you. Notice, we used the same 15x ratio for total training jobs to parallel jobs as in random search. This means both should take about the same time to complete. Because automatic model tuning uses previous jobs’ run performance to inform its future hyperparameter values, running fewer jobs at a time can actually produce better model fits.

from sagemaker.tuner import HyperparameterTuner

tuner = HyperparameterTuner(mt,
                            objective_metric_name,
                            hyperparameter_ranges,
                            metric_definitions,
                            max_jobs=30,
                            max_parallel_jobs=2)

tuner.fit('s3://sagemaker-<region>-<account>/data/DEMO-gluon-cifar10')

Our results show that with one quarter of the total training jobs, Amazon SageMaker automatic model tuning produced a model with better accuracy (74 percent) than random search. Note that on subsequent runs, the overall best accuracy of both methods may change slightly, but in our trials the overall message of “better accuracy in fewer runs” didn’t change.

Conclusion

This blog post shows the importance of hyperparameter tuning. You can use this technique to improve overall model performance. In addition, you can tune using efficient methods like Amazon SageMaker automatic model tuning.

There are ways that we could build on top of or extend our results here. We sought to keep the neural network code straightforward and aligned to our existing examples. However, we could further improve accuracy by moving to a deeper version of ResNet, or trying a different architecture. We could test another tuning method, like grid search. Grid search is an important technique because it’s easy to conceptualize and can help give you an intuition about the hyperparameter space, but it tends to be less effective than random search (Bergstra and Bengio, 2012). You could also try tuning additional hyperparameters. Or, you could use this first round of hyperparameter tuning to inform a secondary round where ranges have been narrowed down further. The most important point to remember is that Amazon SageMaker automatic model tuning works with built-in algorithms, deep learning frameworks, and bring-your-own-container algorithms in Amazon SageMaker, so you can start applying it to whatever model you might be training.

For more information on using Amazon SageMaker automatic model tuning, see the corresponding notebook for this post, our other example notebooks and the documentation.

About the Author

David Arpin is a Data Science Practice Manager for AWS Professional Services. He has managed data science teams within Amazon and worked as a Product Manager on the Amazon SageMaker team.

AWS Machine Learning Blog