Distributed training libraries

Complete distributed training up to 40% faster

Fastest and easiest methods for training large deep learning models and datasets
With only a few lines of additional code, add either data parallelism or model parallelism to your PyTorch and TensorFlow training scripts
Optimize distributed training jobs to achieve near-linear scaling efficiency and complete training faster

Amazon SageMaker offers the fastest and easiest methods for training large deep learning models and datasets. Using partitioning algorithms, SageMaker's distributed training libraries automatically split large deep learning models and training datasets across AWS GPU instances in a fraction of the time it takes to do manually. SageMaker achieves these efficiencies through two techniques: data parallelism and model parallelism. Model parallelism splits models too large to fit on a single GPU into smaller parts before distributing across multiple GPUs to train, and data parallelism splits large datasets to train concurrently in order to improve training speed.

ML use cases such as image classification and text-to-speech demand increasingly larger computational requirements and datasets. For example BERT, a state-of-the-art natural language processing (NLP) model released in 2018, uses 340 million parameters. Now, state-of-the-art NLP models, such as T5, GPT-3, and Turing-NLG, have set new accuracy records, but require tens to hundreds of billions of parameters. Training models like T5 or GPT-3 on a single GPU instance can take several days, slowing your ability to deploy the latest iterations into production. Additionally, implementing your own data and model parallelism strategies manually can take weeks of experimentation.

With only a few lines of additional code, you can add either data parallelism or model parallelism to your PyTorch and TensorFlow training scripts and Amazon SageMaker will apply your selected method for you. SageMaker will determine the best approach to split your model by using graph partitioning algorithms to balance the computation of each GPU while minimizing the communication between GPU instances. SageMaker also optimizes your distributed training jobs through algorithms that are designed to fully utilize AWS compute and network infrastructure in order to achieve near-linear scaling efficiency, which allows you to complete training faster than manual implementations.


Data parallelism library

Reduce training time

Amazon SageMaker reduces training time by making it easy to split training data across GPUs. For example, training Mask R-CNN on p3dn.24xlarge instances runs 25% faster on SageMaker compared to open source data parallelism solutions like Horovod. The reduction in training time is possible because SageMaker manages the GPUs running in parallel to achieve optimal synchronization.

Optimized for AWS

SageMaker's data parallelism library provides communication algorithms that are designed to fully utilize the AWS network and infrastructure to achieve near-linear scaling efficiency. For example, BERT on p3dn.24xlarge instances achieves a scaling efficiency of 90% using SageMaker, a 26% improvement over the same model using Horovod.

Use your existing framework APIs

SageMaker provides data parallelism optimizations through the same APIs that are already common for distributed training so that you are not required to learn a new library. To enable data parallelism, you can use the DistributedDataParallel (DDP) API for PyTorch and Horovod API for TensorFlow.

Model parallelism library

Automatic and efficient model partitioning

Manually partitioning large models can take weeks of effort for even the most experienced data science teams. Amazon SageMaker can split your model in seconds by profiling it and finding the most efficient way to partition it across GPUs.

Minimal code changes

Amazon SageMaker requires changing fewer than 10 lines of code in your TensorFlow or PyTorch training scripts to split your models across multiple GPUs. You can reuse existing APIs from TensorFlow and PyTorch to quickly get up and running.

Optimize resources

Amazon SageMaker offers maximum utilization of your GPU instances by splitting your training batches into smaller microbatches. The smaller microbatches are fed to GPUs in an efficient pipeline to keep all GPU devices simultaneously active.

Use cases

Object detection

For object detection, model training time is often a bottleneck, slowing data science teams down as they wait several days or weeks for results. For example, autonomous vehicle object detection models need to train on up to thousands of gigabytes of data to improve vehicle perception. SageMaker's data parallelism library can help data science teams efficiently split training data and quickly scale to hundreds or even thousands of GPUs, reducing training time from days to minutes.  

Natural language processing

In natural language understanding, data scientists often improve model accuracy by increasing the number of layers and the size of the neural network, resulting in models with billions of parameters such as GPT-2, GPT-3, and T5. Splitting model layers and operations across GPUs can take weeks, but SageMaker's model parallelism library automatically analyzes and splits the model efficiently to enable data science teams to start training large models within minutes.

Computer vision

In computer vision, hardware constraints often force data scientists to pick batch sizes or input sizes that are smaller than they would prefer. For example, bigger inputs may improve model accuracy but may cause out-of-memory errors and poor performance with smaller batch sizes. Similarly, larger batch sizes improve GPU utilization and performance but may hinder model accuracy. SageMaker distribute training libraries offer the flexibility to easily train models efficiently with lower batch sizes or train with bigger inputs.


Latent Space
“At Latent Space we're building a neural rendered game engine where anyone can create at the speed of thought. Driven by advances in language modelling, we're working to incorporate semantic understanding of both text and images to determine what to generate. Our current focus is on utilizing information retrieval to augment large-scale model training, for which we have sophisticated ML pipelines. This setup presents a challenge on top of distributed training since there are multiple data sources and models being trained at the same time. As such, we're leveraging the new distributed training capabilities in Amazon SageMaker to efficiently scale training for large generative models.”

Sarah Jane Hong, Co-founder & Chief Science Officer, Latent Space

"Turbine is a simulation-driven drug discovery company delivering targeted cancer therapies to patients. We use machine learning to train our in silico human cell model, called Simulated Cell, based on a proprietary network architecture. By accurately predicting various interventions on the molecular level, Simulated Cell helps us to discover new cancer drugs and find combination partners for existing therapies. Training of our simulation is something we continuously iterate on, but on a single machine each training takes days, hindering our ability to iterate on new ideas quickly. We are very excited about distributed training on Amazon SageMaker, which we are expecting to decrease our training times by 90% and to help us focus on our main task: to write a best-of-the-breed codebase for the cell model training. SageMaker ultimately allows us to become more effective in our primary mission: to identify and develop novel cancer drugs for patients."

Kristóf Szalay, CTO, Turbine



Fast training and near-linear scaling with DataParallel in Amazon SageMaker (23:31)


Train billion-parameter models with model parallelism on Amazon SageMaker (28:52)