Amazon SageMaker HyperPod now supports node actions from the console

Posted on: Feb 10, 2026

Amazon SageMaker HyperPod now enables you to manage individual cluster nodes directly from the AWS Console. HyperPod cluster operators managing large-scale AI/ML workloads often need to connect to nodes for troubleshooting, reboot unresponsive instances, or replace degraded nodes. Connecting to a node previously required manually constructing SSM connection strings, while node recovery actions such as reboot and replace required CLI commands — the console now provides a single interface for all node actions.

With node actions in the console, you can now connect to any node via AWS Systems Manager (SSM). The console provides pre-populated SSM CLI commands with copy-to-clipboard support, and direct SSM session launch in the console. While SageMaker HyperPod clusters already support automatic replacement and reboot of unhealthy instances, there are scenarios such as memory overruns or undetectable hardware degradation that may require manual intervention. Now, node actions in the console provide a consistent approach to manually reboot nodes to recover from transient issues, delete unhealthy nodes, and replace nodes, with batch operations supporting multiple node actions simultaneously, enabling you to resolve node issues in minutes. This capability is especially valuable when running time-sensitive AI training and inference workloads where minimizing downtime is essential.

This feature is available in all AWS Regions where Amazon SageMaker HyperPod is supported. You can perform all these node actions in the HyperPod Cluster management page on console. Click on the respective links to learn more about replace/reboot and connecting to a node.