Announcing multi-head node support in Slurm for Amazon SageMaker HyperPod clusters

Posted on: Mar 26, 2025

We’re excited to introduce multi-head node support for Amazon SageMaker HyperPod clusters. This new capability enhances fault tolerance and availability for large scale generative AI model development workloads.

When a single head node manages job scheduling and resource allocation, it can become a critical bottleneck for customers running large scale AI workloads. When this node fails or becomes unresponsive, it can lead to job failures and downtime ultimately impacting the time to train.

With this launch, customers can now configure multiple head nodes within a single HyperPod Slurm cluster—one primary head (controller) node to control all compute (worker) nodes and manage Slurm operations, and additional backup head nodes in standby. If the primary head node fails, Slurm automatically transitions cluster operations to a backup node minimizing downtime and ensuring continuous workload availability. Additionally, customers can still manage their own accounting databases and Slurm configuration while ensuring workloads remain continuously available.

This capability is available in all regions where HyperPod is generally available. To learn more about the new multi-head node feature and set up your first HyperPod cluster with multiple head nodes, visit the Amazon SageMaker HyperPod documentation.