Speeding up Apache MXNet using the NNPACK library

Apache MXNet is an open source library developers can use to build, train, and re-use deep learning networks. In this blog post, I’ll show you to speed up inference by using the NNPACK library. Indeed, when GPU inference is not available, adding NNPACK to Apache MXNet might be a simple option to extract more performance from your instances. As always, “your mileage may vary,” and you should always run your own tests.

Before we start, let’s look at some training and inference fundamentals.

Training

Training is the step where a neural network learns how to correctly predict the right label for each sample in the data set. One batch at a time (typically from 32 to 256 samples), the data set is fed into the network, which proceeds to minimize total error by adjusting weights thanks to the backpropagation algorithm.

Going through the full data set is called an epoch. Large networks might be trained for hundreds of epochs to reach the highest accuracy possible. This might take days or even weeks. By using GPUs, with their formidable parallel processing power, training times can be significantly cut down, compared to even the most powerful of CPUs.

Inference

Inference is the step where you actually use the trained network to predict new data samples. You could be predicting one sample at a time, for example trying to identify objects in a single picture as Amazon Rekognition does, or you could be predicting multiple samples at a time when processing requests coming from multiple users.

Of course, GPUs are equally efficient at inference. However, many systems are not able to accommodate a GPU because of cost, power consumption, or form-factor constraints. Thus, being able to run fast, CPU-based inference remains an important topic. This is where the NNPACK library comes into play because it will help us speed up CPU inference in Apache MXNet.

The NNPACK Library

NNPACK is an Open Source library available on GitHub. How can it help? Well, you’ve surely read about Convolution Neural Networks. These networks are built from multiple layers applying convolution and pooling to detect features in the input image.

We won’t go into the actual theory in this post, but let’s just say that NNPACK implements these operations (and others, like matrix multiplication) in a highly-optimized fashion. If you’re curious about the underlying theory, please refer to the research papers mentioned by the author in this Reddit post.

NNPACK is available for Linux and MacOS X platforms. It’s optimized for the Intel x86-64 processor with the AVX2 instruction set, as well as the ARMv7 processor with the NEON instruction set and the ARM v8.

In this post, I use a c5.9xlarge instance running the Deep Learning AMI. Here’s what we’re going to do:

Build NNPACK library from source.
Build Apache MXNet from source with NNPACK
Run some image classification benchmarks using a variety of networks

Let’s get to work.

Building NNPACK

NNPACK uses the Ninja build tool. Unfortunately, the Ubuntu repository does not host the latest version, so we need to build it from source as well.

cd ~
git clone git://github.com/ninja-build/ninja.git && cd ninja
git checkout release
./configure.py --bootstrap
sudo cp ninja /usr/bin

Now let’s prepare the NNPACK build, following the instructions.

cd ~
sudo -H pip install --upgrade git+https://github.com/Maratyszcza/PeachPy
sudo -H pip install --upgrade git+https://github.com/Maratyszcza/confu
git clone https://github.com/Maratyszcza/NNPACK.git
cd NNPACK
confu setup
python ./configure.py

Before we actually build, we need to tweak the configuration file. The reason for this is that NNPACK only builds as a static library whereas MXNET builds as a dynamic library. This means that they won’t link properly. The MXNet documentation suggests using an older version of NNPACK, but there’s another way.

We need to edit the build.ninja file and the ‘-fPIC’ flag, in order to build C and C++ files as position-independent code, which is really all we need to link with the MXNet shared library.

cflags = -std=gnu99 -g -pthread -fPIC
cxxflags = -std=gnu++11 -g -pthread -fPIC

Now, let’s build NNPACK and run some basic tests.

ninja
ninja smoketest

We’re done with NNPACK. You should see the library in ~/NNPACK/lib.

Building Apache MXNet with NNPACK

First, let’s install dependencies as well as the latest MXNet sources (1.0 at the time of writing). Detailed build instructions are available on the MXNet website.

cd ~
sudo apt-get install -y libopenblas-dev liblapack-dev libopencv-dev
git clone --recursive https://github.com/apache/incubator-mxnet.git
cd incubator-mxnet/
git checkout 1.0.0

Now, we need to configure the MXNet build. You should edit the make/config.mk file and set the variables that follow in order to include NNPACK in the build, as well as the dependencies we installed earlier. Just copy everything at the end of the file.

NNPACK = /home/ubuntu/NNPACK
# the additional link flags you want to add
ADD_LDFLAGS = -L$(NNPACK)/lib/ -lnnpack -lpthreadpool
# the additional compile flags you want to add
ADD_CFLAGS = -I$(NNPACK)/include/ -I$(NNPACK)/deps/pthreadpool/include/

USE_NNPACK=1
USE_BLAS=openblas
USE_OPENCV=1

Now, we’re ready to build MXNet. Our instance has 36 vCPUs, so let’s put them to good use.

make -j72

About four minutes later, the build is complete. Let’s install our new MXNet library and its Python bindings.

sudo apt-get install -y python-dev python-setuptools python-numpy python-pip
cd python
sudo -H pip install --upgrade pip
sudo -H pip install -e .

We can quickly check that we have the proper version by importing MXNet in Python.

Python 2.7.12 (default, Nov 20 2017, 18:23:56)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import mxnet
>>> mxnet.__version__
'1.0.0'

We’re all set. Time to run some benchmarks.

Benchmarking

Benchmarking with a couple of images isn’t going to give us a reliable view on whether NNPACK makes a difference. Fortunately, the MXNet sources include a benchmarking script which feeds randomly generated images in a variety of batch sizes through the following models: AlexNet, VGG16, Inception-BN, Inception v3, ResNet-50, and ResNet-152. Of course, the point here is not to perform predictions, only to measure inference time.

Before we begin, we need to fix a line of code in the script. Our instance doesn’t have a GPU installed (which is the whole point here) and the script is unable to properly detect that fact. Here’s the modification you need to make in ~/incubator-mxnet/example/image-classification/benchmark_score.py. While we’re at it, let’s add additional batch sizes.

#devs = [mx.gpu(0)] if len(get_gpus()) > 0 else []
devs = []
devs.append(mx.cpu())
batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128, 256]

Time to run some benchmarks. Let’s use eight threads for NNPACK, which is the largest recommended value.

cd ~/incubator-mxnet/example/image-classification/
export MXNET_CPU_NNPACK_NTHREADS=8
python benchmark_score.py

As a reference, I also ran the same script on an identical instance running the vanilla MXNet 1.0. The graphs that follow plot the number of images per second vs. batch size. As you can guess, higher images per second is better.

As you can see, NNPACK delivers very significant speedups for AlexNet, VGG, and Inception-BN, especially for single picture inference (up to 4x faster).

Note: For reasons beyond the scope of this article, there is no speedup for Inception v3 and ResNet, so I didn’t provide graphs for these networks.

Conclusion

I hope you enjoyed this article, and I welcome your feedback. For more deep learning and Apache MXNet content, feel free to follow me on Medium and Twitter.

About the Author

Julien is the Artificial Intelligence & Machine Learning Evangelist for EMEA. He focuses on helping developers and enterprises bring their ideas to life. In his spare time, he reads the works of JRR Tolkien again and again.