Use pre-trained models with Apache MXNet
In this blog post, I’ll show you how to use multiple pre-trained models with Apache MXNet. Why would you want to try multiple models? Why not just pick the one with the best accuracy? As we will see later in the blog post, even though these models have been trained on the same data set and optimized for maximum accuracy, they do behave slightly differently on specific images. In addition, prediction speed can vary, and that’s an important factor for many applications. By trying a few pretrained models, you have an opportunity to find a model that can be a good fit for solving your business problem.
First, let’s download three image classification models from the Apache MXNet model zoo.
- VGG-16 (research paper), the 2014 classification winner at the ImageNet Large Scale Visual Recognition Challenge.
- Inception v3 (research paper), an evolution of GoogleNet, the 2014 winner for object detection.
- ResNet-152 (research paper), the 2015 winner in multiple categories.
For each model, we need to download two files:
- The symbol file containing the JSON definition of the neural network: layers, connections, activation functions, etc.
- The weights file storing values for all connection weights and biases, AKA parameters, learned by the network during the training phase.
Let’s take a look at the first lines of the VGG-16 symbol file. We can see the definition of the input layer (‘data’), the weights and biases for the first convolution layer. A convolution operation is defined (‘conv1_1’) as well as a Rectified Linear Unit activation function (‘relu1_1’).
All three models have been pre-trained on the ImageNet data set, which includes over 1.2 million pictures of objects and animals sorted in 1,000 categories. We can view these categories in the synset.txt file.
Now, let’s load a model.
First, we have to load the weights and model description from file. MXNet calls this a checkpoint. It’s good practice to save weights after each training epoch. Once training is complete, we can look at the training log and pick the weights for the best epoch, that is, the one with the highest validation accuracy. It’s quite likely it won’t be the very last one!
After loading is complete, we get a Symbol object and the weights, AKA model parameters. We then create a new Module and assign it the input Symbol. We could select the context where we want to run the model: the default behavior is to use a CPU context. There are two reasons for this:
- First, this will allow you to test the notebook even if your machine is not equipped with a GPU.
- Second, we’re going to predict a single image and we don’t have any specific performance requirements. For production applications where you’d want to predict large batches of images with the best possible throughput, a GPU would definitely be the way to go.
Then, we bind the input Symbol to input data. We have to call it ‘data’ because that’s its name in the input layer of the network (remember the first few lines of the JSON file).
Finally, we define the shape of ‘data’ as 1 x 3 x 224 x 224. 224 x 224’ is the image resolution : That’s how the model was trained. 3 is the number of channels: Red, green, and blue (in this order). 1 is the batch size: We’ll predict one image at a time.
We also need to load the 1,000 categories stored in the synset.txt file. We’ll need the actual descriptions at prediction time.
Now let’s write a function to load an image from file. Remember that the model expects a 4-dimension NDArray holding the red, green, and blue channels of a single 224 x 224 image. We’re going to use the OpenCV library to build this NDArray from our input image.
Here are the steps:
- Read the image: This will return a numpy array shaped as image height, image width, and 3. It has the three channels in BGR order (blue, green, red).
- Convert the image to RGB (red, green, blue).
- Resize the image to 224 x 224.
- Reshape the array from image height, image width, 3 to 3, image height, image width.
- Add a fourth dimension and build the NDArray.
Let’s take care of prediction. Our parameters are an image, a model, a list of categories, and the number of top categories we’d like to return.
Remember that a Module object must feed data to a model in batches. The common way to do this is to use a data iterator. Here, we want to predict a single image, so although we could use a data iterator, it’d probably be overkill. Instead, let’s create a named tuple, called Batch, which will act as a fake iterator by returning our input NDArray when its ‘data’ attribute is referenced.
After the image has been forwarded, the model outputs an NDArray holding 1,000 probabilities, corresponding to the 1,000 categories it has been trained on. The NDArray has only one line since batch size is equal to 1.
Let’s turn this into an array with squeeze(). Then, using argsort(), we create a second array holding the index of these probabilities sorted in descending order. Finally, we return the top n categories and their descriptions.
Time to put everything together. Let’s load all three models.
Before classifying images, let’s take a closer look at some of the VGG-16 parameters we just loaded from the ‘.params’ file. First, let’s print the names of all layers.
For each layer, we see two components: the weights and the biases. Count the weights and you’ll see that there are sixteen layers: thirteen convolutional layers and three fully connected layers. Now you know why this model is called VGG-16.
Now let’s print the weights for the last fully connected layer.
Did you notice the shape of this matrix? It’s 1000×4096. This layer contains 1,000 neurons: each of which will store the probability of the image belonging to a specific category. Each neuron is also fully connected to all 4,096 neurons in the previous layer (‘fc7’).
OK, enough exploring! Let’s use these models to classify our own images.
Let’s try again with a GPU context this time.
Note : If you get an error about GPU support, either your machine or instance is not equipped with a GPU, or you’re using a version of MXNet that hasn’t been built with GPU support (USE_CUDA=1).
The difference in performance is quite noticeable: between 15x and 20x. If we predicted multiple images at the same time, the gap would widen even more due to the massive parallelism of GPU architectures.
Now it’s time to try your own images. Just copy them in the same folder as this notebook, update the filename in the cell above and run the predict() calls again.
Have fun with pre-trained models!
About the Author
Julien is the Artificial Intelligence & Machine Learning Evangelist for EMEA. He focuses on helping developers and enterprises bring their ideas to life. In his spare time, he reads the works of JRR Tolkien again and again.