Fast CNN Tuning with AWS GPU Instances and SigOpt
Update: SigOpt was acquired by Intel in October of 2020
By Steven Tartakovsky, Michael McCourt, and Scott Clark of SigOpt
Compared with traditional machine learning models, neural networks are computationally more complex and introduce many additional parameters. This often prevents machine learning engineers and data scientists from getting the best performance from their models. In some cases, it might even dissuade data scientists from using neural networks.
In this post, we show how to tune a Convolutional Neural Network (CNN) for a Natural Language Processing (NLP) task 400 times faster than with traditional random search on a CPU. Additionally, this method also achieves greater accuracy. We accomplish this by using the combined power of SigOpt and NVIDIA GPUs on AWS. To replicate the technical portions of this post, use the associated instructions and code on GitHub.
How MXNet, GPU-enabled AWS P2 instances, and SigOpt work
MXNet is a deep learning framework that machine learning engineers and data scientists can use to quickly create sophisticated deep learning models. MXNet makes it easy to use NVIDIA GPU-enabled AWS P2 instances, which significantly speed up training neural network models. In our example, we observed a 50x decrease in training time compared to training on a CPU. This reduces the average time to train the neural network in this example from 2 hours to less than 3 minutes!
In complex machine learning models and data processing pipelines, like the NLP CNN described in this post, many parameters determine how effective a predictive model will be. Choosing these parameters, fitting the model, and determining how well the model performs is a time-consuming, trial-and-error process called hyperparameter optimization or, more generally, model tuning. Black-box optimization tools like SigOpt increase the efficiency of hyperparameter optimization without introspecting the underlying model or data. SigOpt wraps the underlying pipeline and optimizes the parameters to maximize some metric, such as accuracy.
Although you need domain expertise to prepare data, generate features, and select metrics, you don’t need special knowledge of the problem domain for hyperparameter tuning. SigOpt can significantly speed up and reduce the cost of this tuning step compared to standard hyperparameter tuning approaches like random search and grid search. In our example, SigOpt is able to achieve better results with 10x fewer model trainings compared to random search. Combined with the decreased training time from using NVIDIA GPU-enabled AWS P2 instances this results in a total speed up in model tuning of over 400x.
What we did
To show how these tools can get you faster results, we ran them on a sentiment analysis task. We used an open dataset of 10,622 labeled movie reviews from Rotten Tomatoes to predict whether the review is positive (4 or 5) or negative (1 or 2).
We performed the following tasks:
- Randomly split the data into a training set (9,662 reviews) and a validation set (1,000 reviews).
- Embedded the vocabulary of the entire dataset (as word2vec does).
- Trained a CNN using a specific architecture and set of hyperparameters.
- Evaluated the predictive performance of the model on the validation set.
In a production setting, a more robust workflow is critical to avoid overfitting of hyperparameters. Cross-validation and adding Gaussian noise to your dataset are some common techniques for avoiding overfitting to any one dataset. To focus only on hyperparameter optimization, we keep the training and validation sets fixed. For best practices for parameter optimization, see this blog.
Using GPUs for 50X faster performance
The first factor of the combined speed up is optimizing hardware. We used Amazon EC2’s P2 instances. A single NVIDIA K80 GPU increases training speed approximately 50X compared to the standard distributed CPU workflow:
|3 seconds per epoch||146 seconds per epoch|
We tune hyperparameters in the following categories:
- data preprocessing (in other words, translating words to vectors)
- CNN architecture definition (for example, filter sizes, number of filters)
- stochastic gradient descent parameters (for example, learning rate)
- regularization (for example, dropout probability)
In the preprocessing step, we embed all of the words in the dataset into a lower dimensional space of a certain size (similar to what word2vec does). The size of this space is a parameter to be tuned.
The architecture of the CNN contains many tunable parameters. Filter Sizes represent an interpretation of the reviews that correspond to the size of a sentence fragment that will be analyzed. In computational linguistics, this is known as n-gram size. This CNN uses three different filter sizes, which represent potentially different n-gram sizes. The number of filters per filter size corresponds to the depth of the filter. Each filter attempts to learn something different from the sentence structure. In the convolutional layer, the activation function is a rectified linear unit and the pooling type is max pooling. The results are then concatenated into a single dimensional vector, and the last layer is fully connected onto a 2-dimensional output. This corresponds to the binary classification to which the softmax function is applied.
With neural networks, regularization is an extremely important consideration. With a vocabulary size of approximately 20k and a review count of 10k, the raw data is very sparse. The main hyperparameter that we use is dropout at the penultimate layer, with a default value of .5. This represents a proportion of the nodes that will not “fire” at each training cycle. For more information on dropout, see this paper, coauthored by Geoffrey Hinton.
The following table lists the hyperparameters and the values they can take. The default column stems from the code in the original MXNet tutorial. The Cartesian product of the intervals given by the low and high values for each hyperparameter define the hyperparameter space over which the optimization occurs.
|Preprocessing||embed_dim||integer||300||100||500||The dimensionality of space in which to embed all words in corpus (similar to word2vec)|
|Gradient descent||learning rate||real||0.0005||1.00E-08||1.00E+00||The step size in gradient descent|
|Gradient descent||batch_size||integer||50||10||100||The mini-batch size|
|Gradient descent||max_grad_norm||real||5||1||10||The threshold under which to scale the gradient norm; used to prevent exploding gradients|
|Gradient descent||epochs||integer||50||5||100||The number of passes through the training data|
|Regularization||dropout||real||0.5||0.1||0.9||The fraction of the random parameter updates to ignore per iteration|
|Architecture||filter_size_1||integer||3||1||7||The first CNN layer filter size|
|Architecture||filter_size_2||integer||4||1||7||The second CNN layer filter size|
|Architecture||filter_size_3||integer||5||1||7||The third hidden layer filter size|
|Architecture||num_feature_maps||integer||100||10||200||The number of feature maps between layers|
For an explanation on the inner workings of this CNN, see WildML. Some follow-on work focused exclusively on the tuning of hyperparameters of CNNs in NLPs. The following excerpt from that paper is especially relevant:
…From the results, one can see that each dataset has its own optimal filter region size. Practically, this suggests performing a coarse grid search over a range of region sizes; the figure here suggests that a reasonable range for sentence classification might be from 1 to 10. However, for datasets comprising longer sentences, such as CR (maximum sentence length is 105, whereas it ranges from 36- 56 on the other sentiment datasets used here), the optimal region size may be larger.
We recognize that manual and grid search over hyperparameters is sub-optimal, and note that our suggestions here may also inform hyperparameter ranges to explore in random search or Bayesian optimization frameworks.
Data scientists and machine learning engineers often implement complex models and model training configurations from the literature (or open-source), then try applying it to some new data and become frustrated with the suboptimal predictive capacity. Different datasets have different optimal hyperparameter configurations, requiring tuning to get the most out of the models.
Training and evaluation
SigOpt efficiently suggests different hyperparameter configurations based on feedback on the performance of previous configurations. The model is trained with a proposed configuration, it is evaluated on some validation set, and the performance is reported to SigOpt. This process is repeated as SigOpt trades off exploration (learning more about different configurations) and exploitation (leveraging previous knowledge to achieve better results).
Time and cost savings
Throughout this example, we used the same training and validation sets. We compared the performance each method was able to achieve and the corresponding computational costs.
Because training CNNs can be parallelized and accessing GPU-enabled AWS P2 instances is so easy, we were able to test multiple optimization strategies in two different scenarios. The Complex scenario allows tuning the model architecture and the preprocessing and stochastic gradient descent parameters. This expands the model configuration space. In the Basic scenario, we tune only the preprocessing and stochastic gradient descent parameters.
|SigOpt||Random Search||Grid Search|
|Basic||240 trials||2400 trials||729 trials|
|Complex||400 trials||4000 trials||59049 trials|
When using the default hyperparameters, accuracy on the validation set was 75.7. Under the Basic scenario (without architecture tuning), SigOpt reached 80.4% accuracy on the validation set after 240 model trainings, and 81.0% in the Complex scenario with 400 model trainings. Random search attained only 79.9% accuracy after 2400 model trainings, and 80.1% accuracy after 4000 model trainings. Grid search resulted in 79.3% accuracy after 729 model trainings.
|Absolute Performance||SigOpt||Random Search||Grid Search|
SigOpt got an additional 5% improvement in performance compared with the default settings, and achieved these results with far fewer trials than grid and random search.
|Relative Performance||SigOpt||Random Search||Grid Search|
|Complex Scenario||+7.0%||+5.8%||Not feasible|
The number of configurable parameters increased from 6 in the Basic scenario to 10 in the Complex scenario. SigOpt is capable of effectively tuning this joint space in a linear number of steps (as opposed to exponential). We iterated through the optimization loop 40 times for each. A coarse grid search requires exponentially more evaluations (3^6 for Basic and 3^10 for Complex).
As for cost, model training on the NVIDIA-K80-enabled p2.xlarge AWS instances (90¢/hour was $0.05 per iteration, and a bit over $2.50 on the m4.4xlarge CPU instance (86.2¢/hour).
Cost increases significantly for random search and grid search, without any performance gain. After 2400 random trials on the Basic scenario and 4000 on the Complex scenario, random search still hadn’t beat out SigOpt’s performance. Random search yielded 79.9% accuracy and 80.1% accuracy, respectively, but cost 8 times as much as the SigOpt trials. Grid search was more expensive and produced worse results than random search. Moreover, because of the exponential increase in configurations with grid search, we didn’t run it for the more complex scenario.
|Basic Scenario||SigOpt||Random Search||Grid Search|
|NVIDIA K80 GPU||$11 ~ 80.4% acc.||$91 ~ 79.9% acc.||$28 ~ 79.3 % acc.|
There are many benefits to using SigOpt and NVIDIA GPUs. SigOpt provides results at 1/8 the cost of random search. Using the K80 GPU accelerator for model training is roughly 2% the cost of working without NVIDIA (as observed during the random search). Using SigOpt and GPUs provide results with only $11 of compute cost, while using random search on standard infrastructure was over 400 times more time consuming and expensive.
We calculated cost based on two comparable instance types and the total number of epochs involved in each respective optimization run. The CPU instance was an m4.4xlarge instance with no GPUs, which cost $0.862 per hour. This resulted in an average model training speed of 146 seconds per epoch. The GPU instance was a p2.xlarge instance with an NVIDIA single K80 GPU, which cost $0.90 per hour and has an average training speed of 3 seconds per epoch. We used a custom, public AMI (ami-193e860f) for both instances.
|Experiment Type||Accuracy||Trials||Epochs||CPU Time||CPU Cost||GPU Time||GPU Cost||Link|
|Default (No Tuning)||75.70||1||50||2 hours||$1.82||0 hours||$0.04||NA|
|Grid Search (SGD Only)||79.30||729||38394||64 days||$1401.38||32 hours||$27.58||here|
|Random Search (SGD Only)||79.94||2400||127092||214 days||$4638.86||106 hours||$91.29||here|
|SigOpt Search (SGD Only)||80.40||240||15803||27 days||$576.81||13 hours||$11.35||here|
|Grid Search (SGD + Architecture)||Not Feasible||59049||3109914||5255 days||$113511.86||107 days||$2233.95||NA|
|Random Search (SGD + Architecture)||80.12||4000||208729||353 days||$7618.61||174 hours||$149.94||here|
|SigOpt Search (SGD + Architecture)||81.00||400||30060||51 days||$1097.19||25 hours||$21.59||here|
We analyzed how well SigOpt performs against random and grid search by fixing the number of model tuning attempts and seeing how each optimization method performed, on average.
This graph shows the evolution of optimizer performance as each optimization progresses. The interquartile range is plotted along with the median behavior as observed over 20 instances for each strategy. SigOpt consistently outperforms the other strategies.
SigOpt significantly outperforms random search. That difference becomes more significant at higher computational budgets. Namely, the middle range of values that SigOpt reported for this experiment was [79.39, 80.30], and the middle range of values for random search was [79.17, 79.76] providing a .6% difference in the median of values.
When designing the modeling process, you have to make significant choices. Although we usually think of neural network hyperparameters only in the context of stochastic gradient descent, SigOpt can also suggest different model architecture parameters. This increases the complexity of the configuration, ultimately providing better results than with a Basic configuration.
At a very high level, while you can configure complex processes, simulations, machine learning pipelines, and neural networks and determine how well they perform, it’s expensive and time-consuming to evaluate configuration choices. SigOpt provides a cost effective way to do this.
MXNet makes it easy to create and use neural network models. Easy access to high quality compute resources, like AWS instances with NVIDIA GPUs, allows you to quickly train and evaluate your models. Tools like SigOpt enable you to quickly and easily perform model tuning tasks, so that you have the very best model to work with as quickly as possible.