SageMaker HyperPod now supports gang scheduling for distributed training workloads
Amazon SageMaker HyperPod task governance now supports gang scheduling, which ensures all pods required for a distributed training job are ready before training begins. Administrators can configure gang scheduling to prevent wasted compute from partial job runs and avoid deadlocks from jobs waiting for resources.
Data scientists running distributed AI/ML training jobs on Amazon SageMaker HyperPod clusters using the EKS orchestrator require multiple pods to work together across nodes with pod-to-pod communication. When some pods start but others do not, jobs can hold onto resources without making progress, block other workloads, and increase costs. Gang scheduling resolves this by monitoring all pods in a workload and pulling the workload back if not all pods are ready within a set time. Pulled-back workloads are automatically requeued to prevent stalling. Administrators can adjust settings on the HyperPod Console, such as how long to wait for pods to be ready, how to handle node failures, whether to admit workloads one at a time to avoid deadlocks on busy clusters, and how retries are scheduled.
This capability is currently available for Amazon SageMaker HyperPod clusters using the EKS orchestrator across the following AWS Regions: US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo), Asia Pacific (Jakarta), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Stockholm), Europe (Spain), and South America (São Paulo).
To learn more, visit SageMaker HyperPod webpage, and HyperPod task governance documentation.