Amazon SageMaker HyperPod

Reduce time to train foundation models by up to 40% and scale across more than a thousand AI accelerators efficiently

What is SageMaker HyperPod?

Amazon SageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing machine learning (ML) infrastructure. It is pre-configured with SageMaker’s distributed training libraries that automatically split training workloads across more than a thousand AI accelerators, so workloads can be processed in parallel for improved model performance. SageMaker HyperPod ensures your FM training uninterrupted by periodically saving checkpoints. It automatically detects hardware failure when it happens, repairs or replaces the faulty instance, and resumes the training from the last saved checkpoint, removing the need for you to manually manage this process. The resilient environment allows you to train models for week or months in a distributed setting without disruption, saving training time by up to 40%. SageMaker HyperPod is also highly customizable, allowing you to efficiently run and scale FM workloads and easily share compute capacity between different workloads, from large scale training to inference.

Benefits of SageMaker HyperPod

Amazon SageMaker HyperPod is pre-configured with Amazon SageMaker distributed training libraries, allowing you to automatically split your models and training datasets across AWS cluster instances to help you efficiently scale training workloads.
SageMaker HyperPod supports popular cluster management and job scheduling systems such as Slurm and Amazon Elastic Kubernetes Service (EKS). It provides you a superior developer experience, ability to manage containerized apps, dynamic cluster scaling, and cloud native integration as you scale your FM training and inference workloads. In addition, you can seamlessly share resources between training and inference to further optimize resource utilization.
SageMaker HyperPod enables a more resilient training environment by automatically detecting, diagnosing, and recovering from faults, allowing you to continuously train FMs for months without disruption.