Posted On: Dec 20, 2023

Today, AWS announces a major version release of the Amazon SageMaker model parallel library (SMP), which now is compatible with PyTorch Fully Sharded Data Parallel (FSDP) APIs and can accelerate deep learning model training by up to 20%. SMP enables you to accelerate training of large models with billions of parameters by automatically partitioning and distributing the model across multiple accelerators and compute instances. You can get started with SMP in minutes and speed up your existing PyTorch FSDP training scripts with just a few lines of code.

PyTorch FSDP is a popular distributed training technique that reduces the memory footprint of training by sharding a model’s weights, gradients, and optimizer states across accelerators in a cluster. With this release, SageMaker model parallel library’s new APIs are now compatible with and further accelerate PyTorch FSDP training scripts, allowing customers to easily upgrade their existing workloads when training on SageMaker. With just a few lines of code, customers can enable state-of-the-art training techniques such as hybrid sharded data parallelism, which allows customers to change the degree of model sharding and thus control the memory and communication requirements of their training job. This new release also extends FSDP’s capabilities to include tensor parallel training techniques for SageMaker customers, enabling the training of models with hundreds of billions of parameters by partitioning and distributing layers of the model across different accelerator devices. To get started with SageMaker model parallel, see our documentation.