AWS Compute Blog

Deep Learning on AWS Batch

Thanks to my colleague Kiuk Chung for this great post on Deep Learning using AWS Batch.

—-

GPU instances naturally pair with deep learning as neural network algorithms can take advantage of their massive parallel processing power. AWS provides GPU instance families, such as g2 and p2, which allow customers to run scalable GPU workloads. You can leverage such scalability efficiently with AWS Batch.

AWS Batch manages the underlying compute resources on-your behalf, allowing you to focus on modeling tasks without the overhead of resource management. Compute environments (that is, clusters) in AWS Batch are pools of instances in your account, which AWS Batch dynamically scales up and down, provisioning and terminating instances with respect to the numbers of jobs. This minimizes idle instances, which in turn optimizes cost.

Moreover, AWS Batch ensures that submitted jobs are scheduled and placed onto the appropriate instance, hence managing the lifecycle of the jobs. With the addition of customer-provided AMIs, AWS Batch users can now take advantage of this elasticity and convenience for jobs that require GPU.

This post illustrates how you can run GPU-based deep learning workloads on AWS Batch. I walk you through an example of training a convolutional neural network (the LeNet architecture), using Apache MXNet to recognize handwritten digits using the MNIST dataset.

Running an MXNet job in AWS Batch

Apache MXNet is a full-featured, flexibly programmable, and highly scalable deep learning framework that supports state-of-the-art deep models, including convolutional neural networks (CNNs) and long short-term memory networks (LSTMs).

There are three steps to running an AWS Batch job:

  • Create a custom AMI
  • Create AWS Batch entities
  • Submit a training job

Create a custom AMI

Start by creating an AMI that includes the NVIDIA driver and the Amazon ECS agent. In AWS Batch, instances can be launched with the specific AMI of your choice by specifying imageId when you create your compute environment. Because you are running a job that requires GPU, you need an AMI that has the NVIDIA driver installed.

Choose Launch Stack to launch the CloudFormation template in us-east-1 in your account:

As shown below, take note of the AMI value in the Outputs tab of the CloudFormation stack. You use this as the imageId value when creating the compute environment in the next section.

Alternatively, you may follow the AWS Batch documentation to create a GPU-enabled AMI.

Create AWS Batch resources

After you have built the AMI, create the following resources:

A compute environment, is a collection of instances (compute resources) of the same or different instance types. In this case, you create a managed compute environment in which the instances are of type p2.xlarge. For imageId, specify the AMI you built in the previous section.

Then, create a job queue. In AWS Batch, jobs are submitted to a job queue that are associated to an ordered list of compute environments. After a lower order compute environment is filled, jobs spill over to the next compute environment. For this example, you associate a single compute environment to the job queue.

Finally, create a job definition, which is a template for a job specification. For those familiar with Amazon ECS, this is analogous to task definitions. You mount the directory containing the NVIDIA driver on the host to /usr/local/nvidia on the container. You also need to set the privileged flag on the container properties.

The following code creates the aforementioned resources in AWS Batch. For more information, see the AWS Batch User Guide.

git clone https://github.com/awslabs/aws-batch-helpers
cd aws-batch-helpers/gpu-example

python create-batch-entities.py\
 --subnets <subnet1,subnet2,…>\
 --security-groups <sg1,sg2,…>\
 --key-pair <ec2-key-pair>\
 --instance-role <instance-role>\
 --image-id <custom-AMI-image-id>\
 --service-role <service-role-arn> 

Submit a training job

Now you submit a job that trains a convolutional neural network model for handwritten digit recognition. Much like Amazon ECS tasks, jobs in AWS Batch are run as commands in a Docker container. To use MXNet as your deep learning library, you need a Docker image containing MXNet. For this example, use mxnet/python:gpu.

The submit-job.py script submits the job, and tails the output from CloudWatch Logs.

# cd aws-batch-helpers/gpu-example
python submit-job.py --wait

You should see an output that looks like the following:

Submitted job [train_imagenet - e1bccebc-76d9-4cd1-885b-667ef93eb1f5] to the job queue [gpu_queue]
Job [train_imagenet - e1bccebc-76d9-4cd1-885b-667ef93eb1f5] is RUNNING.
Output [train_imagenet/e1bccebc-76d9-4cd1-885b-667ef93eb1f5/12030dd3-0734-42bf-a3d1-d99118b401eb]:
 ================================================================================

[2017-04-25T19:02:57.076Z] INFO:root:Epoch[0] Batch [100]	Speed: 15554.63 samples/sec Train-accuracy=0.861077
[2017-04-25T19:02:57.428Z] INFO:root:Epoch[0] Batch [200]	Speed: 18224.89 samples/sec Train-accuracy=0.954688
[2017-04-25T19:02:57.755Z] INFO:root:Epoch[0] Batch [300]	Speed: 19551.42 samples/sec Train-accuracy=0.965313
[2017-04-25T19:02:58.080Z] INFO:root:Epoch[0] Batch [400]	Speed: 19697.65 samples/sec Train-accuracy=0.969531
[2017-04-25T19:02:58.405Z] INFO:root:Epoch[0] Batch [500]	Speed: 19705.82 samples/sec Train-accuracy=0.968281
[2017-04-25T19:02:58.734Z] INFO:root:Epoch[0] Batch [600]	Speed: 19486.54 samples/sec Train-accuracy=0.971719
[2017-04-25T19:02:59.058Z] INFO:root:Epoch[0] Batch [700]	Speed: 19735.59 samples/sec Train-accuracy=0.973281
[2017-04-25T19:02:59.384Z] INFO:root:Epoch[0] Batch [800]	Speed: 19631.17 samples/sec Train-accuracy=0.976562
[2017-04-25T19:02:59.713Z] INFO:root:Epoch[0] Batch [900]	Speed: 19490.74 samples/sec Train-accuracy=0.979062
[2017-04-25T19:02:59.834Z] INFO:root:Epoch[0] Train-accuracy=0.976774
[2017-04-25T19:02:59.834Z] INFO:root:Epoch[0] Time cost=3.190
[2017-04-25T19:02:59.850Z] INFO:root:Saved checkpoint to "/mnt/model/mnist-0001.params"
[2017-04-25T19:03:00.079Z] INFO:root:Epoch[0] Validation-accuracy=0.969148

================================================================================
Job [train_imagenet - e1bccebc-76d9-4cd1-885b-667ef93eb1f5] SUCCEEDED

In reality, you may want to modify the job command to save the trained model artifact to Amazon S3 so that subsequent prediction jobs can generate predictions against the model. For information about how to reference objects in Amazon S3 in your jobs, see the Creating a Simple “Fetch & Run” AWS Batch Job post.

Conclusion

In this post, I walked you through an example of running a GPU-enabled job in AWS Batch, using MXNet as the deep learning library. AWS Batch exposes primitives to allow you to focus on implementing the most efficient algorithm for your workload. It enables you to manage the lifecycle of submitted jobs and dynamically adapt the infrastructure requirements of your jobs within the specified bounds. It’s easy to take advantage of the horizontal scalability of compute instances provided by AWS in a cost-efficient manner.

MXNet, on the other hand, provides a rich set of highly optimized and scalable building blocks to start implementing your own deep learning algorithms. Together, you can not only solve problems requiring large neural network models, but also cut down on iteration time by harnessing the seemingly unlimited compute resources in Amazon EC2.

With AWS Batch managing the resources on your behalf, you can easily implement workloads such as hyper-parameter optimization to fan out tens or even hundreds of searches in parallel to find the best set of model parameters for your problem space. Moreover, because your jobs are run inside Docker containers, you may choose the tools and libraries that best fit your needs, build a Docker image, and submit your jobs using the image of your choice.

We encourage you to try it yourself and let us know what you think!