AWS Open Source Blog

Why use Docker containers for machine learning development?

diagram of host machine, container, code, and datasets and checkpoints

I like prototyping on my laptop, as much as the next person. When I want to collaborate, I push my code to GitHub and invite collaborators. And when I want to run experiments and need more compute power, I rent CPU and GPU instances in the cloud, copy my code and dependencies over, and run my experiments. If this process seems familiar, you may wonder: Why bother with Docker containers?

Aren’t containers exotic tools for your colleagues in the operations team, the brave IT professionals who make sure your code runs consistently and reliably, and scales with customer demand? The infrastructure experts who deploy your apps on Kubernetes, and manage them so you don’t have to wake up in the middle of night to troubleshoot a deployment?

In this article, I’ll try to make a case for why you should consider using Docker containers for machine learning development. In the first half of the article, I’ll discuss key challenges you encounter when working with complex open source machine learning software and how adopting containers will alleviate these pains. Then I’ll walk through setting up a Docker container-based development environment and show how you can use it for collaborating and scaling workloads on a cluster.

Machine learning development environment: Bare necessities

Let’s start with what the four basic ingredients you need for a machine learning development environment:

  1. Compute: High-performance CPUs and GPUs to train models.
  2. Storage: For large training datasets and metadata you generated during training.
  3. Frameworks and libraries: To provide APIs and execution environment for training.
  4. Source control: For collaboration, backup, and automation.

As a machine learning researcher, developer, or data scientist, you can set up an environment with these four ingredients on a single Amazon Elastic Compute Cloud (Amazon EC2) instance or a workstation at home.

Basic ingredients for a machine learning development environment

So, what’s wrong with this setup?

Nothing really, most development setups have looked like this for decades—no clusters, no shared file systems.

Except for a small community of researchers in High-Performance Computing (HPC) who develop code and run them on supercomputers, the rest of us rely on our own dedicated machines for development.

As it turns out, machine learning has more in common with HPC than it does with traditional software development. Like HPC workloads, machine learning workloads can benefit from faster execution and quicker experimentation when running on a large cluster. To take advantage of a cluster for machine learning training, you’ll need to make sure your development environment is portable and training is reproducible on a cluster.

Why you need portable training environments

At some point in your machine learning development process, you’ll hit one of these two walls:

  1. You’re experimenting and you have too many variations of your training scripts to run, and you’re bottlenecked by your single machine.
  2. You’re running training on a large model with a large dataset, and it’s not feasible to run on your single machine and get results in a reasonable amount of time.

These are two common reasons why you may want to run machine learning training on a cluster. These are also reasons why scientists use supercomputers such as Summit supercomputer to run their scientific experiments. To address the first wall, you can run every model independently and asynchronously on a cluster of computers. To address the second wall, you can distribute a single model on a cluster and train it faster.

Both these solutions require that you be able to successfully and consistently reproduce your development training setup on a cluster. And that’s challenging because the cluster could be running different operating systems and kernel versions; different GPUs, drivers and runtimes; and different software dependencies than your development machine.

Another reason why you need portable machine learning environments is for collaborative development. Sharing your training scripts with your collaborator through version control is easy. Guaranteeing reproducibility without sharing your full execution environment with code, dependencies, and configurations is harder, as we’ll see in the next section.

Machine learning, open source, and specialized hardware

A challenge with machine learning development environments is that they rely on complex and continuously evolving open source machine learning frameworks and toolkits, and complex and continuously evolving hardware ecosystems. Both are positive qualities that we desire, but they pose short-term challenges.

diagram showing Your code has more dependencies than you think.

How many times have you run machine learning training and asked yourselves these questions:

  • Is my code taking advantage of all available resources on CPUs and GPUs?
  • Do I have the right hardware libraries? Are they the right versions?
  • Why does my training code work fine on my machine, but crashes on my colleague’s, when the environments are more or less identical?
  • I updated my drivers today and training is now slower/errors out. Why?

If you examine your machine learning software stack, you will notice that you spend most of your time in the magenta box called My code in the accompanying figure. This includes your training scripts, your utility and helper routines, your collaborators’ code, community contributions, and so on. As if that were not complex enough, you also would notice that your dependencies include:

  • the machine learning framework API that is evolving rapidly;
  • the machine learning framework dependencies, many of which are independent projects;
  • CPU-specific libraries for accelerated math routines;
  • GPU-specific libraries for accelerated math and inter-GPU communication routines; and
  • GPU driver that needs to be aligned with the GPU compiler used to compile above GPU libraries.

Due to the high complexity of an open source machine learning software stack, when you move your code to a collaborator’s machine or a cluster environment, you introduce multiple points of failure. In the figure below, notice that even if you control for changes to your training code and the machine learning framework, there are lower-level changes that you may not account for, resulting in failed experiments.

Ultimately, this costs you the most precious commodity of all— your time.

Migrating training code isn't the same as migrating your entire execution environment. Dependencies potentially introduce multiple points of failure when moving from development environment to training infrastructure

Why not virtual Python environments?

You could argue that virtual environment approaches such as conda and virtualenv address these issues. They do, but only partially. Several non-Python dependencies are not managed by these solutions. Due to the complexity of a typical machine learning stack, a large part of framework dependencies, such as hardware libraries, are outside the scope of virtual environments.

Enter containers for machine learning development

Machine learning software is part of a fragmented ecosystem with multiple projects and contributors. That can be a good thing, as everyone one benefits from everyone’s contributions, and developers always have plenty of options. The downside is dealing with problems such as consistency, portability, and dependency management. This is where container technologies come in. In this article, I won’t discuss the general benefits of containers, but I will share how machine learning benefits from them.

Containers can fully encapsulate not just your training code, but the entire dependency stack down to the hardware libraries. What you get is a machine learning development environment that is consistent and portable. With containers, both collaboration and scaling on a cluster becomes much easier. If you develop code and run training in a container environment, you can conveniently share not just your training scripts, but your entire development environment by pushing your container image into a container registry, and having a collaborator or a cluster management service pull the container image and run it to reproduce your results.

Containers allow you to encapsulate all your dependencies into a single package that you can push to a registry and make available for collaborators and orchestrators on a training cluster

What you should and shouldn’t include in your machine learning development container

There isn’t a right answer and how your team operates is up to you, but there are a couple of options for what to include:

  1. Only the machine learning frameworks and dependencies: This is the cleanest approach. Every collaborator gets the same copy of the same execution environment. They can clone their training scripts into the container at runtime or mount a volume that contains the training code.
  2. Machine learning frameworks, dependencies, and training code: This approach is preferred when scaling workloads on a cluster. You get a single executable unit of machine learning software that can be scaled on a cluster. Depending on how you structure your training code, you could allow your scripts to execute variations of training to run hyperparameter search experiments.

Sharing your development container is also easy. You can share it as a:

  1. Container image: This is the easiest option. This allows every collaborator or a cluster management service, such as Kubernetes, to pull a container image, instantiate it, and execute training immediately.
  2. Dockerfile: This is a lightweight option. Dockerfiles contain instructions on what dependencies to download, build, and compile to create a container image. Dockerfiles can be versioned along with your training code. You can automate the process of creating container images from Dockerfiles by using continuous integration services, such as AWS CodeBuild.

Container images for widely used open source machine learning frameworks or libraries are available on Docker hub and are usually contributed by the framework maintainers. You’ll find TensorFlow, PyTorch, MXNet, and others on their repositories. Exercise caution where you download from and what type of container image you download.

Most upstream repositories build their containers to work everywhere, which means they have to be compatible with most CPU and GPU architectures. If you exactly know what system you’ll be running your container on, you’re better off selecting container images that’ve been optimized and qualified for your system configuration.

Setting up your machine learning development environment with Jupyter, using Docker containers

AWS hosts AWS Deep Learning Containers with popular open source deep learning frameworks, and that are qualified for compute optimized CPU and GPU instances. Next, I will explain how to set up a development environment using containers with just a few steps. For the purpose of this example, I will assume you are working with an Amazon EC2 instance.

AWS optimized deep learning container images for TensorFlow, PyTorch, MXNet. Choose a container image based on your requirements - Python 2, Python 3, CPU optimized, GPU optimized, training optimized, inference optimized.

Step 1: Launch your development instance.

C5, P3, or a G4 family instances are all ideal for machine learning workloads. The latter two offer upto eight NVIDIA GPUs per instance. For a short guide on launching your instance, read the Getting Started with Amazon EC2 documentation.

When selecting the Amazon Machine Image (AMI), choose the latest Deep Learning AMI, which includes all the latest deep learning frameworks, Docker runtime, and NVIDIA driver and libraries. While it may seem handy to use the deep learning framework natively installed on the AMI, working with deep learning containers gets you one step closer to a more portable environment.

Step 2: SSH to the instance and download a deep learning container.

ssh -i ~/.ssh/<pub_key> ubuntu@<IP_ADDR>

Log in to the Deep Learning Container registry:

$(aws ecr get-login --no-include-email --region YOUR_REGION --registry-ids 763104351884)

Select the framework you’ll be working with from the list on the AWS Deep Learning Container webpage and pull the container.

To pull the latest TensorFlow container with GPU support in us-west-2 region, run:

docker pull

Step 3: Instantiate the container and set up Jupyter.

docker run -it --runtime=nvidia -v $PWD:/projects --network=host --name=tf-dev

--runtime=nvidia instructs docker to make NVIDIA GPUs on the host available inside the container.

-v instructs docker to mount a directory so that it can be accessed inside the container environment. If you have datasets or code files, make them available on a specific directory and mount it with this option.

--network=host instructs the container to share the host’s network namespace. If the container runs a service at port 9999 (as shown below), this lets you access the service on the same port on the host’s IP.

pip install jupyterlab
jupyter lab --ip= --port=9999 --allow-root --NotebookApp.token='' --NotebookApp.password=''

Open up a second terminal window on your client machine and run the following to establish a tunnel at port 9999. This lets you access Jupyter notebook server running inside the container on your host machine.

ssh -N -L -i ~/.ssh/<pub_key> ubuntu@<IP_ADDR>

Open up your favorite browser and enter:

And voila, you’ve successfully set up your container-based development environment. Every piece of code you run on this Jupyter notebook will run within the deep learning container environment.

Jupyter lab client on your local laptop or desktop, that is connected to a Jupyter server running inside a Docker container on a powerful EC2 instance with an NVIDIA V100 GPU. Your very own sandboxed development environment.

Step 4: Using the container-based development environment.

Containers are meant to be stateless execution environments, so save your work on mounted directories that you specified with the -v flag when calling docker run. To exit a container, stop the Jupyter server and type exit on the terminal. To restart your stopped container, run:

docker start tf-dev

And set up your tunnel as described in Step 3 and you can resume your development.

Now, let’s say you made changes to the base container—for example, installing Jupyter into the container as in Step 3. The cleanest way to do this is to track all your custom installations and capture it in a Dockerfile. This allows you to recreate a container image with your changes from scratch. It also serves the purpose of documenting your changes and can be versioned along with the rest of your code.

The quicker way to do this with minimal disruption to your development process is to commit those changes into a new container image by running:

sudo docker commit tf-dev my-tf-dev:latest

Note: Container purists will argue that this isn’t a recommend way to save your changes, and they should be documented in a Dockerfile instead. That’s good advice and it’s good practice to track your customizations by writing a Dockerfile. If you don’t, the risk is that over time you’ll lose track of your changes and will become reliant on one “working” image. Much like relying on a compiled binary with no access to source code.

If you want to share the new container with your collaborators, push it to a container registry, such as Docker Hub or Amazon Elastic Container Registry (Amazon ECR). To push it to an Amazon ECR, first create a registry, log in to it, and push your container:

aws ecr create-repository --repository-name my-tf-dev
$(aws ecr get-login --no-include-email --region <REGION>)
docker tag my-tf-dev:latest <ACCOUNT_ID>.dkr.ecr.<REGION>
docker push <ACCOUNT_ID>.dkr.ecr.<REGION>

You can now share this container image with a collaborator, and your code should work as it did on your machine. The added benefit is that you can now use the same container to run large-scale workloads on a cluster. Let’s learn how.

Machine learning training containers and scaling them on clusters

To manage training clusters popular options are Kubernetes and KubeFlow or fully managed Amazon SageMaker.

Most cluster management solutions, such as Kubernetes or Amazon ECS, will schedule and run containers on a cluster. Alternatively, you could also use a fully managed service, such as Amazon SageMaker, where instances are provisioned when you need them and torn down automatically when the job is done. In addition, it also offers a fully managed suite of services for data labeling, hosted Jupyter notebook development environment, managed training clusters, hyperparameter optimization, managed model hosting services and an IDE that ties all of it together.

To leverage these solutions and run machine learning training on a cluster, you must build a container and push it to a registry.

If you’ve incorporated container-based machine learning development as described previously, you can feel assured that the same container environment you’ve been developing on will be scheduled and run at scale on a cluster—no framework version surprises, no dependency surprises.

To run a distributed training job using Kubernetes and KubeFlow on 2 nodes, you’ll need to write up a config file in YAML that looks something like this:

Excerpt of a Kubernetes config file for distributed training with TensorFlow and Horovod API. Note that the screenshot does not show the full file.

Excerpt of a Kubernetes config file for distributed training with TensorFlow and Horovod API that can be found here on Github. Note that the screenshot does not show the full file.

Under the image section, you’ll specify your docker image with your training scripts. Under command, you’ll specify the command required for training. Because this is a distributed training job, you’ll run an MPI job with mpirun command.

You can submit this job to a Kubernetes cluster as follows (assuming a cluster is set up and running, and you have KubeFlow installed):

kubectl apply -f eks_tf_training_job-cpu.yaml

To do the same with Amazon SageMaker on 8 nodes, you’ll use the Amazon SageMaker SDK and submit a distributed training job using the Estimator API as shown below:

distributions = {'mpi': {
                  'enabled': True,
                  'processes_per_host': hvd_processes_per_host,
                  'custom_mpi_options': '-verbose --NCCL_DEBUG=INFO -x OMPI_MCA_btl_vader_single_copy_mechanism=none'
                   } }
estimator_hvd = TensorFlow(entry_point='',
                           source_dir           = 'code',
                           role                 = role,
                           image_name           = <YOUR_DOCKER_IMAGE>,
                           hyperparameters      = hyperparameters,
                           train_instance_count = 8, 
                           train_instance_type  = ’p3.2xlarge’,
                           output_path          = output_path,
                           model_dir            = model_dir,
                           distributions        = distributions)

If you’re interested in the topic of distributed training and would like to try it out yourself, follow instructions in the Distributed Training Workshop.

Distributed training workshop homepage screenshot

To learn more about how you can use both Kubernetes and Amazon SageMaker together, read my blog post “Kubernetes and Amazon SageMaker for machine learning — best of both worlds”.

Be paranoid, but don’t panic

The machine learning community moves fast. New research gets implemented into APIs in open source frameworks within weeks or months of their publication. When software evolves this rapidly, keeping up with the latest and maintaining quality, consistency, and reliability of your products can be challenging. So be paranoid, but don’t panic because you’re not alone and there are plenty of best practices in the community that you can use to make sure you’re benefiting from the latest.

Moving to containerized machine learning development is one way to address these challenges, as I hope I’ve explained in this article.

If you have questions, please reach out to me on Twitter, LinkedIn, or leave a comment below.

Shashank Prasanna

Shashank Prasanna

Shashank Prasanna is an AI & Machine Learning Developer Advocate at Amazon Web Services (AWS) where he focuses on helping engineers, developers and data scientists solve challenging problems with machine learning. Prior to joining AWS, he worked at NVIDIA, MathWorks (makers of MATLAB & Simulink) and Oracle in product marketing, product management, and software development roles. Shashank holds an M.S. in electrical engineering from Arizona State University.