Fast CNN Tuning with AWS GPU Instances and SigOpt

Update: SigOpt was acquired by Intel in October of 2020

By Steven Tartakovsky, Michael McCourt, and Scott Clark of SigOpt

Compared with traditional machine learning models, neural networks are computationally more complex and introduce many additional parameters. This often prevents machine learning engineers and data scientists from getting the best performance from their models. In some cases, it might even dissuade data scientists from using neural networks.

In this post, we show how to tune a Convolutional Neural Network (CNN) for a Natural Language Processing (NLP) task 400 times faster than with traditional random search on a CPU. Additionally, this method also achieves greater accuracy. We accomplish this by using the combined power of SigOpt and NVIDIA GPUs on AWS. To replicate the technical portions of this post, use the associated instructions and code on GitHub.

How MXNet, GPU-enabled AWS P2 instances, and SigOpt work

MXNet is a deep learning framework that machine learning engineers and data scientists can use to quickly create sophisticated deep learning models. MXNet makes it easy to use NVIDIA GPU-enabled AWS P2 instances, which significantly speed up training neural network models. In our example, we observed a 50x decrease in training time compared to training on a CPU. This reduces the average time to train the neural network in this example from 2 hours to less than 3 minutes!

In complex machine learning models and data processing pipelines, like the NLP CNN described in this post, many parameters determine how effective a predictive model will be. Choosing these parameters, fitting the model, and determining how well the model performs is a time-consuming, trial-and-error process called hyperparameter optimization or, more generally, model tuning. Black-box optimization tools like SigOpt increase the efficiency of hyperparameter optimization without introspecting the underlying model or data. SigOpt wraps the underlying pipeline and optimizes the parameters to maximize some metric, such as accuracy.

Although you need domain expertise to prepare data, generate features, and select metrics, you don’t need special knowledge of the problem domain for hyperparameter tuning. SigOpt can significantly speed up and reduce the cost of this tuning step compared to standard hyperparameter tuning approaches like random search and grid search. In our example, SigOpt is able to achieve better results with 10x fewer model trainings compared to random search. Combined with the decreased training time from using NVIDIA GPU-enabled AWS P2 instances this results in a total speed up in model tuning of over 400x.

What we did

To show how these tools can get you faster results, we ran them on a sentiment analysis task. We used an open dataset of 10,622 labeled movie reviews from Rotten Tomatoes to predict whether the review is positive (4 or 5) or negative (1 or 2).

We performed the following tasks:

Randomly split the data into a training set (9,662 reviews) and a validation set (1,000 reviews).
Embedded the vocabulary of the entire dataset (as word2vec does).
Trained a CNN using a specific architecture and set of hyperparameters.
Evaluated the predictive performance of the model on the validation set.

In a production setting, a more robust workflow is critical to avoid overfitting of hyperparameters. Cross-validation and adding Gaussian noise to your dataset are some common techniques for avoiding overfitting to any one dataset. To focus only on hyperparameter optimization, we keep the training and validation sets fixed. For best practices for parameter optimization, see this blog.

Using GPUs for 50X faster performance

The first factor of the combined speed up is optimizing hardware. We used Amazon EC2’s P2 instances. A single NVIDIA K80 GPU increases training speed approximately 50X compared to the standard distributed CPU workflow:

NVIDIA GPU	vCPU
3 seconds per epoch	146 seconds per epoch

Hyperparameter tuning

We tune hyperparameters in the following categories:

data preprocessing (in other words, translating words to vectors)
CNN architecture definition (for example, filter sizes, number of filters)
stochastic gradient descent parameters (for example, learning rate)
regularization (for example, dropout probability)

In the preprocessing step, we embed all of the words in the dataset into a lower dimensional space of a certain size (similar to what word2vec does). The size of this space is a parameter to be tuned.

The architecture of the CNN contains many tunable parameters. Filter Sizes represent an interpretation of the reviews that correspond to the size of a sentence fragment that will be analyzed. In computational linguistics, this is known as n-gram size. This CNN uses three different filter sizes, which represent potentially different n-gram sizes. The number of filters per filter size corresponds to the depth of the filter. Each filter attempts to learn something different from the sentence structure. In the convolutional layer, the activation function is a rectified linear unit and the pooling type is max pooling. The results are then concatenated into a single dimensional vector, and the last layer is fully connected onto a 2-dimensional output. This corresponds to the binary classification to which the softmax function is applied.

We use an implementation of the RMSProp (Root Mean Square Propagation) method of gradient descent, provided by MXNet. Hyperparameters include: Learning Rate, Batch Size, Max Grad Norm, and Epochs.

With neural networks, regularization is an extremely important consideration. With a vocabulary size of approximately 20k and a review count of 10k, the raw data is very sparse. The main hyperparameter that we use is dropout at the penultimate layer, with a default value of .5. This represents a proportion of the nodes that will not “fire” at each training cycle. For more information on dropout, see this paper, coauthored by Geoffrey Hinton.

The following table lists the hyperparameters and the values they can take. The default column stems from the code in the original MXNet tutorial. The Cartesian product of the intervals given by the low and high values for each hyperparameter define the hyperparameter space over which the optimization occurs.

Category	Hyperparameter	Type	Defaults	Low	High	Description
Preprocessing	`embed_dim`	integer	300	100	500	The dimensionality of space in which to embed all words in corpus (similar to word2vec)
Gradient descent	`learning rate`	real	0.0005	1.00E-08	1.00E+00	The step size in gradient descent
Gradient descent	`batch_size`	integer	50	10	100	The mini-batch size
Gradient descent	`max_grad_norm`	real	5	1	10	The threshold under which to scale the gradient norm; used to prevent exploding gradients
Gradient descent	`epochs`	integer	50	5	100	The number of passes through the training data
Regularization	`dropout`	real	0.5	0.1	0.9	The fraction of the random parameter updates to ignore per iteration
Architecture	`filter_size_1`	integer	3	1	7	The first CNN layer filter size
Architecture	`filter_size_2`	integer	4	1	7	The second CNN layer filter size
Architecture	`filter_size_3`	integer	5	1	7	The third hidden layer filter size
Architecture	`num_feature_maps`	integer	100	10	200	The number of feature maps between layers

For an explanation on the inner workings of this CNN, see WildML. Some follow-on work focused exclusively on the tuning of hyperparameters of CNNs in NLPs. The following excerpt from that paper is especially relevant:

…From the results, one can see that each dataset has its own optimal filter region size. Practically, this suggests performing a coarse grid search over a range of region sizes; the figure here suggests that a reasonable range for sentence classification might be from 1 to 10. However, for datasets comprising longer sentences, such as CR (maximum sentence length is 105, whereas it ranges from 36- 56 on the other sentiment datasets used here), the optimal region size may be larger.

….

We recognize that manual and grid search over hyperparameters is sub-optimal, and note that our suggestions here may also inform hyperparameter ranges to explore in random search or Bayesian optimization frameworks.

Data scientists and machine learning engineers often implement complex models and model training configurations from the literature (or open-source), then try applying it to some new data and become frustrated with the suboptimal predictive capacity. Different datasets have different optimal hyperparameter configurations, requiring tuning to get the most out of the models.

Training and evaluation

SigOpt efficiently suggests different hyperparameter configurations based on feedback on the performance of previous configurations. The model is trained with a proposed configuration, it is evaluated on some validation set, and the performance is reported to SigOpt. This process is repeated as SigOpt trades off exploration (learning more about different configurations) and exploitation (leveraging previous knowledge to achieve better results).

Time and cost savings

Throughout this example, we used the same training and validation sets. We compared the performance each method was able to achieve and the corresponding computational costs.

Because training CNNs can be parallelized and accessing GPU-enabled AWS P2 instances is so easy, we were able to test multiple optimization strategies in two different scenarios. The Complex scenario allows tuning the model architecture and the preprocessing and stochastic gradient descent parameters. This expands the model configuration space. In the Basic scenario, we tune only the preprocessing and stochastic gradient descent parameters.

	SigOpt	Random Search	Grid Search
Basic	240 trials	2400 trials	729 trials
Complex	400 trials	4000 trials	59049 trials

When using the default hyperparameters, accuracy on the validation set was 75.7. Under the Basic scenario (without architecture tuning), SigOpt reached 80.4% accuracy on the validation set after 240 model trainings, and 81.0% in the Complex scenario with 400 model trainings. Random search attained only 79.9% accuracy after 2400 model trainings, and 80.1% accuracy after 4000 model trainings. Grid search resulted in 79.3% accuracy after 729 model trainings.

*Absolute Performance*	SigOpt	Random Search	Grid Search
Basic	.804	.799	.793
Complex	.810	.801	Not feasible

SigOpt got an additional 5% improvement in performance compared with the default settings, and achieved these results with far fewer trials than grid and random search.

*Relative Performance*	SigOpt	Random Search	Grid Search
Basic Scenario	+6.2%	+5.6%	+4.8%
Complex Scenario	+7.0%	+5.8%	Not feasible

The number of configurable parameters increased from 6 in the Basic scenario to 10 in the Complex scenario. SigOpt is capable of effectively tuning this joint space in a linear number of steps (as opposed to exponential). We iterated through the optimization loop 40 times for each. A coarse grid search requires exponentially more evaluations (3^6 for Basic and 3^10 for Complex).

As for cost, model training on the NVIDIA-K80-enabled p2.xlarge AWS instances (90¢/hour was $0.05 per iteration, and a bit over $2.50 on the m4.4xlarge CPU instance (86.2¢/hour).

Cost increases significantly for random search and grid search, without any performance gain. After 2400 random trials on the Basic scenario and 4000 on the Complex scenario, random search still hadn’t beat out SigOpt’s performance. Random search yielded 79.9% accuracy and 80.1% accuracy, respectively, but cost 8 times as much as the SigOpt trials. Grid search was more expensive and produced worse results than random search. Moreover, because of the exponential increase in configurations with grid search, we didn’t run it for the more complex scenario.

*Basic Scenario*	SigOpt	Random Search	Grid Search
NVIDIA K80 GPU	$11 ~ 80.4% acc.	$91 ~ 79.9% acc.	$28 ~ 79.3 % acc.
vCPU	$576	$4639	$1401

There are many benefits to using SigOpt and NVIDIA GPUs. SigOpt provides results at 1/8 the cost of random search. Using the K80 GPU accelerator for model training is roughly 2% the cost of working without NVIDIA (as observed during the random search). Using SigOpt and GPUs provide results with only $11 of compute cost, while using random search on standard infrastructure was over 400 times more time consuming and expensive.

Detailed results

We calculated cost based on two comparable instance types and the total number of epochs involved in each respective optimization run. The CPU instance was an m4.4xlarge instance with no GPUs, which cost $0.862 per hour. This resulted in an average model training speed of 146 seconds per epoch. The GPU instance was a p2.xlarge instance with an NVIDIA single K80 GPU, which cost $0.90 per hour and has an average training speed of 3 seconds per epoch. We used a custom, public AMI (ami-193e860f) for both instances.

Experiment Type	Accuracy	Trials	Epochs	CPU Time	CPU Cost	GPU Time	GPU Cost	Link
Default (No Tuning)	75.70	1	50	2 hours	$1.82	0 hours	$0.04	NA
Grid Search (SGD Only)	79.30	729	38394	64 days	$1401.38	32 hours	$27.58	here
Random Search (SGD Only)	79.94	2400	127092	214 days	$4638.86	106 hours	$91.29	here
SigOpt Search (SGD Only)	80.40	240	15803	27 days	$576.81	13 hours	$11.35	here
Grid Search (SGD + Architecture)	Not Feasible	59049	3109914	5255 days	$113511.86	107 days	$2233.95	NA
Random Search (SGD + Architecture)	80.12	4000	208729	353 days	$7618.61	174 hours	$149.94	here
SigOpt Search (SGD + Architecture)	81.00	400	30060	51 days	$1097.19	25 hours	$21.59	here

Analysis

We analyzed how well SigOpt performs against random and grid search by fixing the number of model tuning attempts and seeing how each optimization method performed, on average.

This graph shows the evolution of optimizer performance as each optimization progresses. The interquartile range is plotted along with the median behavior as observed over 20 instances for each strategy. SigOpt consistently outperforms the other strategies.

SigOpt significantly outperforms random search. That difference becomes more significant at higher computational budgets. Namely, the middle range of values that SigOpt reported for this experiment was [79.39, 80.30], and the middle range of values for random search was [79.17, 79.76] providing a .6% difference in the median of values.

Conclusion

When designing the modeling process, you have to make significant choices. Although we usually think of neural network hyperparameters only in the context of stochastic gradient descent, SigOpt can also suggest different model architecture parameters. This increases the complexity of the configuration, ultimately providing better results than with a Basic configuration.

At a very high level, while you can configure complex processes, simulations, machine learning pipelines, and neural networks and determine how well they perform, it’s expensive and time-consuming to evaluate configuration choices. SigOpt provides a cost effective way to do this.

MXNet makes it easy to create and use neural network models. Easy access to high quality compute resources, like AWS instances with NVIDIA GPUs, allows you to quickly train and evaluate your models. Tools like SigOpt enable you to quickly and easily perform model tuning tasks, so that you have the very best model to work with as quickly as possible.

Next steps:

Sign up for a SigOpt account in the AWS Marketplace.
Run the experiments from this post using the example code on GitHub.
Learn more about the research behind SigOpt.