Artificial Intelligence
Manage Amazon SageMaker HyperPod clusters using the HyperPod CLI and SDK
Training and deploying large AI models requires advanced distributed computing capabilities, but managing these distributed systems shouldn’t be complex for data scientists and machine learning (ML) practitioners. The command line interface (CLI) and software development kit (SDK) for Amazon SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS) orchestration simplify how you manage cluster infrastructure and use the service’s distributed training and inference capabilities.
The SageMaker HyperPod CLI provides data scientists with an intuitive command-line experience, abstracting away the underlying complexity of distributed systems. Built on top of the SageMaker HyperPod SDK, the CLI offers straightforward commands for managing HyperPod clusters and common workflows like launching training or fine-tuning jobs, deploying inference endpoints, and monitoring cluster performance. This makes it ideal for quick experimentation and iteration.
A layered architecture for simplicity
The HyperPod CLI and SDK follow a multi-layered, shared architecture. The CLI and the Python module serve as user-facing entry points and are both built on top of common SDK components to provide consistent behavior across interfaces. For infrastructure automation, the SDK orchestrates cluster lifecycle management through a combination of AWS CloudFormation stack provisioning and direct AWS API interactions. Training and inference workloads and integrated development environments (IDEs) (Spaces) are expressed as Kubernetes Custom Resource Definitions (CRDs), which the SDK manages through the Kubernetes API.
In this post, we demonstrate how to use the CLI and the SDK to create and manage SageMaker HyperPod clusters in your AWS account. We walk through a practical example and dive deeper into the user workflow and parameter choices.
This post focuses on cluster creation and management. For a deep dive into using the HyperPod CLI and SDK to submit training jobs and deploy inference endpoints, see our companion post: Train and deploy models on Amazon SageMaker HyperPod using the new HyperPod CLI and SDK.
Prerequisites
To follow the examples in this post, you must have the following prerequisites:
- An AWS account with access to SageMaker HyperPod, Amazon Simple Storage Service (Amazon S3) and Amazon FSx for Lustre.
- Sufficient service quota for creating the HyperPod cluster and instance groups.
- A local environment (either your local machine or a cloud-based compute environment) from which to run the SageMaker HyperPod CLI commands, configured as follows:
- Operating system based on Linux or MacOS.
- Python 3.8 or later installed.
- The AWS Command Line Interface (AWS CLI) configured with the appropriate credentials to use the aforementioned services.
Install the SageMaker HyperPod CLI
First, install the latest version of the SageMaker HyperPod CLI and SDK. The examples in this post are based on version 3.5.0. From your local environment, run the following command, you can alternatively install the CLI in a Python virtual environment:
This command sets up the tools needed to interact with SageMaker HyperPod clusters. For an existing installation, make sure you have the latest version of the package installed (SageMaker HyperPod 3.5.0 or later) to be able to use the relevant set of features described in this post. To verify if the CLI is installed correctly, run the hyp command and check the outputs:
The output will be similar to the following, and includes instructions on how to use the CLI:
For more information on CLI usage and the available commands and respective parameters, see the CLI reference documentation.
The HyperPod CLI provides commands to manage the full lifecycle of HyperPod clusters. The following sections explain how to create new clusters, monitor their creation, modify instance groups, and delete clusters.
Creating a new HyperPod cluster
HyperPod clusters can be created through the AWS Management Console or the HyperPod CLI, both of which provide streamlined experiences for cluster creation. The console offers the easiest and most guided approach, while the CLI is especially useful for customers who prefer a programmatic experience—for example, to enable reproducibility or to build automation around cluster creation. Both methods use the same underlying CloudFormation template, which is available in the SageMaker HyperPod cluster setup GitHub repository. For a walkthrough of the console-based experience, see the cluster creation experience blog post.
Creating a new cluster through the HyperPod CLI follows a configuration-based workflow: the CLI first generates configuration files, which are then edited to match the intended cluster specifications. These files are subsequently submitted as a CloudFormation stack that creates the HyperPod cluster along with the required resources, such as a VPC and FSx for Lustre filesystem, among others.To initialize a new cluster configuration by running the following command:hyp init cluster-stack
This initializes a new cluster configuration in the current directory and generates a config.yaml file that you can use to specify the configuration of the cluster stack. Additionally it will create a README.md with information about the functionality and workflow in addition to a template for the CloudFormation stack parameters in cfn_params.jinja.
The cluster stack’s configuration variables are defined in config.yaml. The following is an excerpt from the file:
The resource_name_prefix parameter serves as the primary identifier for the AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. The value of the prefix parameter is automatically appended with a unique identifier during cluster creation to provide resource uniqueness.
The configuration can be edited either directly by opening config.yaml in an editor of your choice or by running the hyp configure command. The following example shows how to specify the Kubernetes version of the Amazon EKS cluster that will be created by the stack:
hyp configure --kubernetes-version 1.33
Updating variables through the CLI commands provides added security by performing validation against the defined schema before setting the value in config.yaml.
Besides the Kubernetes version and the resource name prefix, some examples of significant parameters are listed below:
There are two important nuances when updating the configuration values through hyp configure commands:
- Underscores (
_) in variable names withinconfig.yamlbecome hyphens (-) in the CLI commands. Thuskubernetes_versioninconfig.yamlis configured viahyp configure --kubernetes-versionin the CLI. - Variables that contain lists of entries within
config.yamlare configured as JSON lists in the CLI command. For example, multiple instance groups are configured withinconfig.yamlas the following:
Which translates to the following CLI command:
After you’re done making the desired changes, validate your configuration file by running the following command:hyp validate
This will validate the parameters in config.yaml against the defined schema. If successful, the CLI will output the following:
The cluster creation stack can be submitted to CloudFormation by running the following command:hyp create --region <region>
The hyp create command performs validation and injects values from config.yaml into the cfn_params.jinja template. If no AWS Region is explicitly provided, the command uses the default Region from your AWS credentials configuration. The resolved configuration file and CloudFormation template values are saved to a timestamped subdirectory under the ./run/ directory, providing a lightweight local versioning mechanism to track which configuration was used to create a cluster at a given point in time. You can also choose to commit these artifacts to your version control system to improve reproducibility and auditability. If successful, the command outputs the CloudFormation stack ID:
Monitoring the HyperPod cluster creation process
You can list the existing CloudFormation stacks by running the following command:hyp list cluster-stack --region <region>
You can optionally filter the output by stack status by adding the following flag: --status "['CREATE_COMPLETE', 'UPDATE_COMPLETE']".
The output of this command will look similar to the following:
Depending on the configuration in config.yaml, several nested stacks are created that cover different aspects of the HyperPod cluster setup such as the EKSClusterStack, FsxStack and the VPCStack.
You can use the describe command to view details about any of the individual stacks:hyp describe cluster-stack <stack-name> --region <region>
The output for an exemplary substack, S3EndpointStack, will look like the following:
If any of the stacks show CREATE_FAILED, ROLLBACK_* or DELETE_*, open the CloudFormation page in the console to investigate the root cause. Failed cluster creation stacks are often related to insufficient service quotas for the cluster itself, the instance groups, or the network components such as VPCs or NAT gateways. Check the corresponding SageMaker HyperPod Quotas to learn more about the required quotas for SageMaker HyperPod.
Connecting to a cluster
After the cluster stack has successfully created the required resources and the status has changed to CREATE_COMPLETE, you can configure the CLI and your local Kubernetes environment to interact with the HyperPod cluster.
hyp set-cluster-context --cluster-name <cluster-name> —region <region>
The --cluster-name option specifies the name of the HyperPod cluster to connect to and the --region option specifies the Region where the cluster has been created. Optionally, a specific namespace can be configured using the --namespace parameter. The command updates your local Kubernetes config in ./kube/config, so that you can use both the HyperPod CLI and Kubernetes utilities such as kubectl to manage the resources in your HyperPod cluster.
See our companion blog post for further information about how to use the CLI to submit training jobs and inference deployments to your newly created HyperPod cluster: Train and deploy models on Amazon SageMaker HyperPod using the new HyperPod CLI and SDK.
Modifying an existing HyperPod cluster
The HyperPod CLI provides a command to modify the instance groups and node recovery mode of an existing HyperPod cluster through the hyp update cluster command. This can be useful if you need to scale your cluster by adding or removing worker nodes, or if you want to change the instance types used by the node groups.
To update the instance groups, run the following command, adapted with your cluster name and desired instance group settings:
Note that all of the fields in the preceding command are required to run the update command, even if, for example, only the instance count is modified. You can list the current cluster and instance group configurations to obtain the required values by running the hyp describe cluster <cluster-name> --region <region> command.
The output of the update command will look like the following:
The --node-recovery option lets you configure the node recovery behavior, which can be set to either Automatic or None. For information about the SageMaker HyperPod automatic node recovery feature, see Automatic node recovery.
Deleting an existing HyperPod cluster
To delete an existing HyperPod cluster, run the following command. Note that this action is not reversible:
hyp delete cluster-stack <stack-name> --region <region>
This command removes the specified CloudFormation stack and the associated AWS resources. You can use the optional --retain-resources flag to specify a comma-separated list of logical resource IDs to retain during the deletion process. It’s important to carefully consider which resources you need to retain, because the delete operation cannot be undone.
The output of this command will look like the following, asking you to confirm the resource deletion:
SageMaker HyperPod SDK
SageMaker HyperPod also includes a Python SDK for programmatic access to the features described earlier. The Python SDK is used by the CLI commands and is installed when you install the sagemaker-hyperpod Python package as described in the beginning of this post. The HyperPod CLI is best suited for users who prefer a streamlined, interactive experience for common HyperPod management tasks like creating and monitoring clusters, training jobs, and inference endpoints. It’s particularly helpful for quick prototyping, experimentation, and automating repetitive HyperPod workflows through scripts or continuous integration and delivery (CI/CD) pipelines. In contrast, the HyperPod SDK provides more programmatic control and flexibility, making it the preferred choice when you need to embed HyperPod functionality directly into your application, integrate with other AWS or third-party services, or build complex, customized HyperPod management workflows. Consider the complexity of your use case, the need for automation and integration, and your team’s familiarity with programming languages when deciding whether to use the HyperPod CLI or SDK.
The SageMaker HyperPod CLI GitHub repository shows examples of how cluster creation and management can be implemented using the Python SDK.
Conclusion
The SageMaker HyperPod CLI and SDK simplify cluster creation and management. With the examples in this post, we’ve demonstrated how these tools provide value through:
- Simplified lifecycle management – From initial configuration to cluster updates and cleanup, the CLI aligns with how teams manage long-running training and inference environments and abstracts away unnecessary complexity.
- Declarative control when needed – The SDK exposes the underlying configuration model, so that teams can codify cluster specifications, instance groups, storage filesystems, and more.
- Integrated observability – Visibility into CloudFormation stacks is available without switching tools, supporting smooth iteration during development and operation.
Getting started with these tools is as straightforward as installing the SageMaker HyperPod package. The SageMaker HyperPod CLI and SDK provide the right level of abstraction for both data scientists looking to quickly experiment with distributed training and ML engineers building production systems.
If you’re interested in how to use the HyperPod CLI and SDK for submitting training jobs and deploying models to your new cluster, make sure to check our companion blog post: Train and deploy models on Amazon SageMaker HyperPod using the new HyperPod CLI and SDK.