Amazon SageMaker HyperPod now supports programmatic node reboot and replacement

Posted on: Nov 26, 2025

Today, Amazon SageMaker HyperPod announces the general availability of new APIs that enable programmatic rebooting and replacement of SageMaker HyperPod cluster nodes. SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). The new BatchRebootClusterNodes and BatchReplaceClusterNodes APIs enable customers to programmatically reboot or replace unresponsive or degraded cluster nodes, providing a consistent, orchestrator agnostic approach to node recovery operations.

The new APIs enhance node management capabilities for both Slurm and EKS orchestrated clusters complementing existing node reboot and replacement workflows. Existing orchestrator-specific methods, such as Kubernetes labels for EKS clusters and Slurm commands for Slurm clusters, remain available alongside the newly introduced programmatic capabilities for reboot and replace operations through these purpose-built APIs. When cluster nodes become unresponsive due to issues such as memory overruns or hardware degradation, recovery operations such as node reboots and replacements maybe be necessary and can be initiated through these new APIs. These capabilities are particularly valuable when running time-sensitive workloads. For instance, when a Slurm controller, login or compute node becomes unresponsive, administrators can trigger a reboot operation using the API and monitor its progress to get nodes back to operational status. Similarly, EKS cluster administrators can replace degraded worker nodes programmatically. Each API supports batch operations of up to 25 instances, enabling efficient management of large-scale recovery scenarios.

The reboot and replace APIs are currently supported in three AWS regions where SageMaker HyperPod is available: US East (Ohio), Asia Pacific (Mumbai), and Asia Pacific (Tokyo).The APIs can be accessed through the AWS CLI, SDK, or API calls. For more information, see the Amazon SageMaker HyperPod documentation for BatchRebootClusterNodes and BatchReplaceClusterNodes.