Amazon SageMaker HyperPod features

Reduce time to train foundation models by up to 40% and scale across more than a thousand AI accelerators efficiently

Automatic cluster health check and repair

If any instances become defective during a training workload, SageMaker HyperPod automatically detects and swaps faulty nodes with healthy ones. To detect faulty hardware, SageMaker HyperPod regularly runs an array of health checks for GPU and network integrity. 

High performing distributed training libraries

With SageMaker’s distributed training libraries, you can run highly scalable and cost-effective custom data parallel and model parallel deep learning training jobs. SageMaker HyperPod is preconfigured with SageMaker distributed libraries. With only a few lines of code, you can enable data parallelism in your training scripts. SageMaker HyperPod makes it faster to perform distributed training by automatically splitting your models and training datasets across AWS GPU instances.

Learn more

Advanced observability for improved performance

You can use built-in ML tools in SageMaker HyperPod to improve model performance. For example, Amazon SageMaker with TensorBoard helps you save development time by visualizing the model architecture to identify and remediate convergence issues and Amazon SageMaker Debugger captures metrics and profiles training jobs in real time. The integration with Amazon CloudWatch Container Insights provides deeper insights into cluster performance, health, and utilization. 

Workload scheduling and orchestration

The SageMaker HyperPod user interface is highly customizable using Slurm or Amazon EKS. You can select and install any needed frameworks or tools. All clusters are provisioned with the instance type and count you choose, and they are retained for your use across workloads.

Scalability and optimized resources utilization

You can manage and operate SageMaker HyperPod clusters with a consistent Kubernetes-based administrator experience. This allows you to efficiently run and scale FM workloads, from training, fine-tuning, experimentation, to inference. You can easily share compute capacity and switch between Slurm and EKS for different types of workloads.