AWS Machine Learning Blog

AWS Deep Learning AMIs now include Horovod for faster multi-GPU TensorFlow training on Amazon EC2 P3 instances

The AWS Deep Learning AMIs for Ubuntu and Amazon Linux now come pre-installed and fully configured with Horovod, the popular open source distributed training framework to scale TensorFlow training on multiple GPUs.

This is an update to the optimized build of TensorFlow 1.8 that we launched in early May. This custom build of TensorFlow 1.8 is built directly from source with advanced optimizations, and it provides improved training performance compared to stock TensorFlow 1.8 on Amazon EC2 C5 and P3 instances. With the addition of Horovod on the AMI, machine learning developers can further boost their training performance by quickly scaling up TensorFlow training from a single GPU to multiple GPUs on an Amazon EC2 GPU instance such as P3. Developers can achieve higher multi-GPU training performance with fewer code changes compared to the standard TensorFlow distributed training model using parameter servers.

Faster multi-GPU TensorFlow training on your Amazon EC2 P3 instance

Horovod follows the Message Passing Interface (MPI) model. This is a popular standard for passing messages and managing communication between nodes in a high-performance distributed computing environment. Horovod’s MPI implementation provides a more simplified programming model compared to the parameter server based distributed training model. This model enables developers to easily scale their existing single GPU training programs with minimal code changes. In addition, Horovod leverages the NVIDIA Collective Communications Library (NCCL) installed on the Deep Learning AMI for optimized implementations of multi-GPU communication primitives, such as all-reduce, to achieve faster performance on NVIDIA GPUs powering Amazon EC2 GPU instances.

In our experiments using Horovod, training was 1.2X faster than only using TensorFlow 1.8. We trained a ResNet-50 model with the ImageNet dataset using our optimized build of TensorFlow 1.8 on AWS Deep Learning AMI. The AMI uses NVIDIA CUDA 9.0, cuDNN 7.0.5, NCCL 2.1, and OpenMPI 1.10.7 to train the model in mixed-precision (fp-16) mode with a batch size of 2048 on 8 NVIDIA Volta V100 GPUs on a single p3.16xlarge EC2 instance.

Using the standard TensorFlow distributed training model, training on 8 GPUs of a p3.16xlarge instance gave us a throughput of 4249 images per second with a total-time-to-train of 7.67 hours (27,621 seconds). The training program achieved 75.49% Top-1 validation accuracy in 90 epochs. With Horovod, our throughput improved to 5058 images per second (1.2X faster) and total-time-to-train came down to 6.36 hours (22,906 seconds) with a Top-1 validation accuracy of 75.59%. You can read the step by step guide to design and conduct this experiment in our developer guide.

The purpose of this experiment is to illustrate performance and usability benefits of Horovod. Visit the Horovod site for more details on ways to use Horovod for faster, easier distributed TensorFlow training.

Getting started with the Deep Learning AMIs

You can quickly get started with the AWS Deep Learning AMIs by using our getting started tutorial and our developer guide for more tutorials, resources, and release notes. The latest AMIs are now available on the AWS Marketplace. You can also subscribe to our discussion forum to get new launch announcements and post your questions.

About the Author

Sumit Thakur is a Senior Product Manager for AWS Deep Learning. He works on products that make it easy for customers to get started with deep learning on cloud, with a specific focus on making it easy to use engines on Deep Learning AMI. In his spare time, he likes connecting with nature and watching sci-fi TV series.