Amazon SageMaker HyperPod Slurm clusters now support specifying minimum capacity requirements with continuous provisioning

Posted on: May 27, 2026

Amazon SageMaker HyperPod now supports minimum capacity requirements (MinCount) for clusters using Slurm orchestration with continuous provisioning. With continuous provisioning, HyperPod provisions clusters with available partial capacity so you can start your AI/ML jobs quickly, while continuing to provision remaining instances asynchronously in the background. While this provides flexibility, some training workloads require a guaranteed minimum number of nodes before they can start effectively. MinCount lets you specify the minimum number of instances that must be successfully provisioned before an instance group transitions to InService status, giving you greater control over when your cluster becomes available for job scheduling.

This is particularly useful for distributed training workloads using frameworks such as PyTorch FSDP, Megatron-LM, or NVIDIA NeMo, where training jobs are commonly configured with a fixed number of participating nodes and may not start efficiently or correctly with partial cluster capacity. It also benefits teams that need to guarantee a baseline GPU count to meet SLA or cost-efficiency targets before committing to a training run.

You can specify MinInstanceCount in the CreateCluster or UpdateCluster API request to set a minimum capacity threshold for an instance group. The instance group remains in Creating or Updating status until the threshold is met, then transitions to InService and nodes become available for Slurm job scheduling. HyperPod continues launching additional instances beyond MinCount until the target count is reached. If MinCount cannot be satisfied within 3 hours, the system automatically rolls back the instance group to its last known good state.

MinCount for Slurm clusters with continuous provisioning is available in all AWS Regions where Amazon SageMaker HyperPod is supported. To get started on specifying minimum capacity requirements for your cluster, see Minimum capacity requirements (MinCount) in the Amazon SageMaker AI documentation.