Amazon SageMaker HyperPod now supports API-driven Slurm configuration
Amazon SageMaker HyperPod now supports API-driven Slurm configuration, enabling you to define Slurm topology and shared filesystem configurations directly in the cluster create and update APIs or through the AWS Console. SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs).
With this new API-driven configuration, you can now specify Slurm node types including Controller, Login, and Compute for cluster instance groups; instance group to partition mappings; and FSx for Lustre and FSx for OpenZFS filesystem mounts per instance group directly in the cluster API definition or through the advanced configuration section in the AWS Console. When you modify partition-node mappings directly in Slurm's native configuration files to fine-tune cluster resource assignments, Slurm's partition-node configurations can drift from HyperPod's view. A new cluster-level SlurmConfigStrategy helps you manage drift with three options: Managed, Overwrite, and Merge. The Managed strategy allows you to manage instance group to partition mappings completely via the API or Console, and automatically detects drift in partition-to-node mappings during scale-up or scale-down operations. When drift is detected, cluster updates are paused until you resolve it by switching to the Overwrite strategy to force API-defined mappings, the Merge strategy to preserve manual customizations, or by directly updating Slurm configurations to align with HyperPod.
API-driven Slurm configuration is available in all AWS Regions where SageMaker HyperPod is available. To get started, you can use the AWS Management Console, AWS CLI, AWS CloudFormation, or AWS SDKs. For more information, see the Amazon SageMaker HyperPod documentation for creating clusters using the Console or the CLI, and the API reference for CreateCluster and UpdateCluster.