AWS Machine Learning Blog

Use the Amazon SageMaker local mode to train on your notebook instance

Amazon SageMaker recently launched support for local training using the pre-built TensorFlow and MXNet containers.  Amazon SageMaker is a flexible machine learning platform that allows you to more effectively build, train, and deploy machine learning models in production.  The Amazon SageMaker training environment is managed. This means that it spins up instances, loads algorithm containers, brings in data from Amazon S3, runs code, outputs results to Amazon S3, and tears down the cluster, without you having to think about it.  The ability to offload training to a separate multi-node GPU cluster is a huge advantage. Even though spinning up new hardware every time is good for repeatability and security, it can add friction when testing or debugging your algorithm code.

The Amazon SageMaker deep learning containers allow you to write TensorFlow or MXNet scripts as you typically would. However, now you deploy them to pre-built containers in a managed, production-grade environment for both training and hosting.  Previously, these containers were only available within these Amazon SageMaker-specific environments.  They’ve recently been open sourced, which means you can pull the containers into your working environment and use custom code built into the Amazon SageMaker Python SDK to test your algorithm locally, just by changing a single line of code.  This means that you can iterate and test your work without having to wait for a new training or hosting cluster to be built each time.  Iterating with a small sample of the dataset locally and then scaling to train on the full dataset in a distributed manner is common in machine learning.  Typically this would mean rewriting the entire process and hoping not to introduce any bugs.  The Amazon SageMaker local mode allows you to switch seamlessly between local and distributed, managed training by simply changing one line of code. Everything else works the same.

The local mode in the Amazon SageMaker Python SDK can emulate CPU (single and multi-instance) and GPU (single instance) SageMaker training jobs by changing a single argument in the TensorFlow or MXNet estimators.  To do this, it uses Docker compose and NVIDIA Docker.  It will also pull the Amazon SageMaker TensorFlow or MXNet containers from Amazon ECS, so you’ll need to be able to access a public Amazon ECR repository from your local environment.  If you choose to use a SageMaker notebook instance as your local environment, this script will install the necessary prerequisites.  Otherwise, you can install them yourself, and make sure you’ve upgraded to the latest version of the SageMaker Python SDK with  pip install -U sagemaker.

Example use case

We’ve have two example notebooks showing how to use local mode with MNIST, one each for TensorFlow and MXNet.  However, since Amazon SageMaker also recently announced the availability of larger notebook instance types, let’s move to a bigger example that trains an image classifier on 50,000 color images across 4 GPUs.  We’ll train a ResNet network using MXNet Gluon on the CIFAR-10 dataset, and we’ll do it entirely on an ml.p3.8xlarge notebook instance.  The same code we build here could easily transition to the Amazon SageMaker managed training environment, if we want it to run on multiple machines, or if we wanted to train on a recurring basis without managing any hardware.

Let’s start by creating a new notebook instance. Log into the AWS Management Console, select the Amazon SageMaker service, and choose Create notebook instance from the Amazon SageMaker console dashboard to open the following page.

After the notebook instance is running, you can create a new Jupyter notebook and begin setting up.  Or you can follow along with a predefined notebook here.  We’ll skip some background in order to focus on local mode.  If you’re new to deep learning, this blog post may be helpful in getting started.  Or, see the other SageMaker example notebook, which fits ResNet on CIFAR-10, but trains in the Amazon SageMaker managed training environment.

After you have the prerequisites installed, libraries loaded, and dataset downloaded, you’ll load the dataset to your Amazon S3 bucket.  Note that even though we’re training locally, we’ll still access data from Amazon S3 to maintain consistency with SageMaker training.

inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-gluon-cifar10')

Now, we’ll define our MXNet estimator.  The estimator points to the cifar10.py script that contains our network specification and  train() function.  It also supplies information about the job, such as the hyperparameters and IAM role.  But, most importantly, it sets  train_instance_type to  'local_gpu'.  This is the only change required to switch from the SageMaker managed training environment to training entirely within the local notebook instance.

m = MXNet('cifar10.py',
          role=role,
          train_instance_count=1,
          train_instance_type='local_gpu',
          hyperparameters={'batch_size': 1024,
                           'epochs': 50,
                           'learning_rate': 0.1,
                           'momentum': 0.9})

m.fit(inputs)

The first time the estimator runs, it needs to download the container image from its Amazon ECR repository, but then training can begin immediately.  There’s no need to wait for a separate training cluster to be provisioned.  In addition, on subsequent runs, which may be necessary when iterating and testing, changes to your MXNet or TensorFlow script will start to run instantaneously.

Training in local mode also allows us to easily monitor metrics like GPU consumption to ensure that our code is written properly to take advantage of the hardware we’re using.  And, in this case, we can confirm by running nvidia-smi in a terminal that we’re using for all four of the ml.p3.8xlarge’s GPUs to train our ResNet model very quickly.

After training our estimator finishes, we can create and test our endpoint locally.  Again, we’ll specify 'local_gpu' for the instance_type.

predictor = m.deploy(initial_instance_count=1, instance_type='local_gpu')

Now we can generate a few predictions to confirm that our inference code works.  It’s a good idea to do this before deploying to a production endpoint, but you could also generate a few predictions for one-time model accuracy evaluation.

from cifar10_utils import read_images

filenames = ['images/airplane1.png',
             'images/automobile1.png',
             'images/bird1.png',
             'images/cat1.png',
             'images/deer1.png',
             'images/dog1.png',
             'images/frog1.png',
             'images/horse1.png',
             'images/ship1.png',
             'images/truck1.png']

image_data = read_images(filenames)

for i, img in enumerate(image_data):
    response = predictor.predict(img)
    print('image {}: class: {}'.format(i, int(response)))

Our result should look similar to:

image 0: class: 0
image 1: class: 9
image 2: class: 2
image 3: class: 3
image 4: class: 4
image 5: class: 5
image 6: class: 6
image 7: class: 7
image 8: class: 8
image 9: class: 9

Now that we’ve validated our training and inference script, we can deploy to the SageMaker managed environments to train at large scale or on a recurring basis, and to generate predictions from a real-time hosted endpoint.

But first, let’s clean-up the local endpoint, since only one endpoint can be running locally at a time.

m.delete_endpoint()

You can shut down your notebook instance from the Amazon SageMaker console by navigating to the Notebook page and selecting Stop.  This will avoid incurring any compute charges until you choose to start it back up.  Or, you can delete your notebook instance by selecting Actions and Delete.

Conclusion

This blog post shows you how to use the Amazon SageMaker Python SDK local mode on a recently launched multi-GPU notebook instance type to quickly test a large scale image classification model.  You can use local mode training to accelerate your test and debugging cycle today.  Just make sure you have the latest version of the SageMaker Python SDK, install a few other tools, and change one line of code!


About the Author

David Arpin is AWS’s AI Platforms Selection Leader and has a background in managing Data Science teams and Product Management.