Amazon SageMaker HyperPod now supports continuous provisioning for Slurm-orchestrated clusters
Amazon SageMaker HyperPod now extends continuous provisioning support to clusters using the Slurm orchestrator, enabling greater flexibility and efficiency for enterprise customers running large-scale AI/ML training workloads. AI/ML customers running Slurm-based clusters need to start training quickly, scale seamlessly, perform maintenance without disrupting operations, and have granular visibility into cluster operations. Previously, if any instance group could not be fully provisioned, the entire cluster creation or scaling operation failed and rolled back, causing delays and requiring manual intervention.
With continuous provisioning for Slurm, SageMaker HyperPod automatically provisions remaining capacity in the background while training jobs can begin immediately on available instances. The system uses priority-based provisioning to bring up the Slurm controller node first, followed by login and worker nodes in parallel, so your cluster reaches an operational state as quickly as possible. HyperPod retries failed node launches asynchronously and adds nodes to the Slurm cluster automatically as they become available, ensuring clusters reliably reach their desired scale without requiring manual intervention. You can now perform concurrent, non-blocking scaling operations across multiple instance groups simultaneously — a capacity shortage in one instance group no longer blocks scaling in others. These capabilities help customers reduce time-to-training, maximize resource utilization, and focus on innovation rather than infrastructure management.
This feature is available for new SageMaker HyperPod clusters using the Slurm orchestrator. You can enable continuous provisioning by setting the NodeProvisioningMode parameter to "Continuous" when creating new HyperPod clusters using the CreateCluster API. Continuous provisioning can also be enabled when creating new clusters through the AWS CLI and the SageMaker AI console.
This feature is available in all AWS Regions where Amazon SageMaker HyperPod is supported. To learn more about continuous provisioning for Slurm clusters, see the Amazon SageMaker HyperPod User Guide.