Manage Amazon SageMaker HyperPod clusters using the HyperPod CLI and SDK

Training and deploying large AI models requires advanced distributed computing capabilities, but managing these distributed systems shouldn’t be complex for data scientists and machine learning (ML) practitioners. The command line interface (CLI) and software development kit (SDK) for Amazon SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS) orchestration simplify how you manage cluster infrastructure and use the service’s distributed training and inference capabilities.

The SageMaker HyperPod CLI provides data scientists with an intuitive command-line experience, abstracting away the underlying complexity of distributed systems. Built on top of the SageMaker HyperPod SDK, the CLI offers straightforward commands for managing HyperPod clusters and common workflows like launching training or fine-tuning jobs, deploying inference endpoints, and monitoring cluster performance. This makes it ideal for quick experimentation and iteration.

A layered architecture for simplicity

The HyperPod CLI and SDK follow a multi-layered, shared architecture. The CLI and the Python module serve as user-facing entry points and are both built on top of common SDK components to provide consistent behavior across interfaces. For infrastructure automation, the SDK orchestrates cluster lifecycle management through a combination of AWS CloudFormation stack provisioning and direct AWS API interactions. Training and inference workloads and integrated development environments (IDEs) (Spaces) are expressed as Kubernetes Custom Resource Definitions (CRDs), which the SDK manages through the Kubernetes API.

In this post, we demonstrate how to use the CLI and the SDK to create and manage SageMaker HyperPod clusters in your AWS account. We walk through a practical example and dive deeper into the user workflow and parameter choices.

This post focuses on cluster creation and management. For a deep dive into using the HyperPod CLI and SDK to submit training jobs and deploy inference endpoints, see our companion post: Train and deploy models on Amazon SageMaker HyperPod using the new HyperPod CLI and SDK.

Prerequisites

To follow the examples in this post, you must have the following prerequisites:

An AWS account with access to SageMaker HyperPod, Amazon Simple Storage Service (Amazon S3) and Amazon FSx for Lustre.
Sufficient service quota for creating the HyperPod cluster and instance groups.
A local environment (either your local machine or a cloud-based compute environment) from which to run the SageMaker HyperPod CLI commands, configured as follows:
- Operating system based on Linux or MacOS.
- Python 3.8 or later installed.
- The AWS Command Line Interface (AWS CLI) configured with the appropriate credentials to use the aforementioned services.

Install the SageMaker HyperPod CLI

First, install the latest version of the SageMaker HyperPod CLI and SDK. The examples in this post are based on version 3.5.0. From your local environment, run the following command, you can alternatively install the CLI in a Python virtual environment:

# Install the HyperPod CLI and SDK
pip install sagemaker-hyperpod

This command sets up the tools needed to interact with SageMaker HyperPod clusters. For an existing installation, make sure you have the latest version of the package installed (SageMaker HyperPod 3.5.0 or later) to be able to use the relevant set of features described in this post. To verify if the CLI is installed correctly, run the hyp command and check the outputs:

# Check if the HyperPod CLI is correctly installed
hyp

The output will be similar to the following, and includes instructions on how to use the CLI:

Usage: hyp [OPTIONS] COMMAND [ARGS]...

Options:
  --version  Show version information
  --help     Show this message and exit.

Commands:
  configure                       Update any subset of fields in ./config.yaml by passing --<field> flags.
  create                          Create endpoints, pytorch jobs, cluster stacks, space, space access or space admin config.
  delete                          Delete endpoints, pytorch jobs, space, space access or space template.
  describe                        Describe endpoints, pytorch jobs or cluster stacks, spaces or space template.
  exec                            Execute commands in pods for endpoints or pytorch jobs.
  get-cluster-context             Get context related to the current set cluster.
  get-logs                        Get pod logs for endpoints, pytorch jobs or spaces.
  get-monitoring                  Get monitoring configurations for Hyperpod cluster.
  get-operator-logs               Get operator logs for endpoints.
  init                            Initialize a TEMPLATE scaffold in DIRECTORY.
  invoke                          Invoke model endpoints.
  list                            List endpoints, pytorch jobs, cluster stacks, spaces, and space templates.
  list-accelerator-partition-type
                                  List available accelerator partition types for an instance type.
  list-cluster                    List SageMaker Hyperpod Clusters with metadata.
  list-pods                       List pods for endpoints or pytorch jobs.
  reset                           Reset the current directory's config.yaml to an "empty" scaffold: all schema keys set to default values (but keeping the...
  set-cluster-context             Connect to a HyperPod EKS cluster.
  start                           Start space resources.
  stop                            Stop space resources.
  update                          Update an existing HyperPod cluster configuration, space, or space template.
  validate                        Validate this directory's config.yaml against the appropriate schema.

For more information on CLI usage and the available commands and respective parameters, see the CLI reference documentation.

The HyperPod CLI provides commands to manage the full lifecycle of HyperPod clusters. The following sections explain how to create new clusters, monitor their creation, modify instance groups, and delete clusters.

Creating a new HyperPod cluster

HyperPod clusters can be created through the AWS Management Console or the HyperPod CLI, both of which provide streamlined experiences for cluster creation. The console offers the easiest and most guided approach, while the CLI is especially useful for customers who prefer a programmatic experience—for example, to enable reproducibility or to build automation around cluster creation. Both methods use the same underlying CloudFormation template, which is available in the SageMaker HyperPod cluster setup GitHub repository. For a walkthrough of the console-based experience, see the cluster creation experience blog post.

Creating a new cluster through the HyperPod CLI follows a configuration-based workflow: the CLI first generates configuration files, which are then edited to match the intended cluster specifications. These files are subsequently submitted as a CloudFormation stack that creates the HyperPod cluster along with the required resources, such as a VPC and FSx for Lustre filesystem, among others.To initialize a new cluster configuration by running the following command:hyp init cluster-stack

This initializes a new cluster configuration in the current directory and generates a config.yaml file that you can use to specify the configuration of the cluster stack. Additionally it will create a README.md with information about the functionality and workflow in addition to a template for the CloudFormation stack parameters in cfn_params.jinja.

(base) xxxxxxxx@3c06303f9abb hyperpod % hyp init cluster-stack
Initializing new scaffold for 'cluster-stack'…
✔️ cluster-stack for schema version='1.0' is initialized in .
? Welcome!
? See ./README.md for usage.

The cluster stack’s configuration variables are defined in config.yaml. The following is an excerpt from the file:

...
# Prefix to be used for all resources. A 4-digit UUID will be added to prefix during submission
resource_name_prefix: hyp-eks-stack
# Boolean to Create HyperPod Cluster Stack
create_hyperpod_cluster_stack: True
# Name of SageMaker HyperPod Cluster
hyperpod_cluster_name: hyperpod-cluster
# Boolean to Create EKS Cluster Stack
create_eks_cluster_stack: True
# The Kubernetes version
kubernetes_version: 1.31
...

The resource_name_prefix parameter serves as the primary identifier for the AWS resources created during deployment. Each deployment must use a unique resource name prefix to avoid conflicts. The value of the prefix parameter is automatically appended with a unique identifier during cluster creation to provide resource uniqueness.

The configuration can be edited either directly by opening config.yaml in an editor of your choice or by running the hyp configure command. The following example shows how to specify the Kubernetes version of the Amazon EKS cluster that will be created by the stack:

hyp configure --kubernetes-version 1.33

Updating variables through the CLI commands provides added security by performing validation against the defined schema before setting the value in config.yaml.

Besides the Kubernetes version and the resource name prefix, some examples of significant parameters are listed below:

# List of string containing instance group configurations
instance_group_settings:
  - {'InstanceCount': 1, 'InstanceGroupName': 'default', 'InstanceType': 'ml.t3.medium', 'TargetAvailabilityZoneId': 'use2-az2', 'ThreadsPerCore': 1, 'InstanceStorageConfigs': [{'EbsVolumeConfig': {'VolumeSizeInGB': 500}}]}

# Boolean to Create EKS Cluster Stack
create_eks_cluster_stack: True

# The name of the S3 bucket used to store the cluster lifecycle scripts
s3_bucket_name: amzn-s3-demo-bucket

# Storage capacity for the FSx file system in GiB
storage_capacity: 1200

There are two important nuances when updating the configuration values through hyp configure commands:

Underscores (_) in variable names within config.yaml become hyphens (-) in the CLI commands. Thus kubernetes_version in config.yaml is configured via hyp configure --kubernetes-version in the CLI.
Variables that contain lists of entries within config.yaml are configured as JSON lists in the CLI command. For example, multiple instance groups are configured within config.yaml as the following:

instance_group_settings:
  - {'InstanceCount': 1, 'InstanceGroupName': 'default', 'InstanceType': 'ml.t3.medium', 'TargetAvailabilityZoneId': 'use2-az2', 'ThreadsPerCore': 1, 'InstanceStorageConfigs': [{'EbsVolumeConfig': {'VolumeSizeInGB': 500}}]}
  - {'InstanceCount': 2, 'InstanceGroupName': 'worker', 'InstanceType': 'ml.t3.large', 'TargetAvailabilityZoneId': 'use2-az2', 'ThreadsPerCore': 1, 'InstanceStorageConfigs': [{'EbsVolumeConfig': {'VolumeSizeInGB': 1000}}]}

Which translates to the following CLI command:

hyp configure —instance-group-settings "[{'InstanceCount': 1, 'InstanceGroupName': 'default', 'InstanceType': 'ml.t3.medium', 'TargetAvailabilityZoneId': 'use2-az2', 'ThreadsPerCore': 1, 'InstanceStorageConfigs': [{'EbsVolumeConfig': {'VolumeSizeInGB': 500}}]}, {'InstanceCount': 2, 'InstanceGroupName': 'worker', 'InstanceType': 'ml.t3.large', 'TargetAvailabilityZoneId': 'use2-az2', 'ThreadsPerCore': 1, 'InstanceStorageConfigs': [{'EbsVolumeConfig': {'VolumeSizeInGB': 1000}}]}]"

After you’re done making the desired changes, validate your configuration file by running the following command:hyp validate

This will validate the parameters in config.yaml against the defined schema. If successful, the CLI will output the following:

(base) xxxxxxxx@3c06303f9abb hyperpod % hyp validate
✔️  config.yaml is valid!

The cluster creation stack can be submitted to CloudFormation by running the following command:hyp create --region <region>

The hyp create command performs validation and injects values from config.yaml into the cfn_params.jinja template. If no AWS Region is explicitly provided, the command uses the default Region from your AWS credentials configuration. The resolved configuration file and CloudFormation template values are saved to a timestamped subdirectory under the ./run/ directory, providing a lightweight local versioning mechanism to track which configuration was used to create a cluster at a given point in time. You can also choose to commit these artifacts to your version control system to improve reproducibility and auditability. If successful, the command outputs the CloudFormation stack ID:

(base) xxxxxxxx@3c06303f9abb dev % hyp create
✔️ config.yaml is valid!
✔️ Submitted! Files written to run/20251118T101501
Submitting to default region: us-east-1.
Stack creation initiated. Stack ID: arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1

Monitoring the HyperPod cluster creation process

You can list the existing CloudFormation stacks by running the following command:hyp list cluster-stack --region <region>

You can optionally filter the output by stack status by adding the following flag: --status "['CREATE_COMPLETE', 'UPDATE_COMPLETE']".

The output of this command will look similar to the following:

(base) xxxxxxxx@3c06303f9abb dev % hyp list cluster-stack
? HyperPod Cluster Stacks (94 found)

[1] Stack Details:
 Field | Value
---------------------+---------------------------------------------------------------------------------------------------------------------------------------------------
 StackId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8/e2898250-c491-11f0-bf25-0afff7e082cf
 StackName | HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8
 TemplateDescription | S3 Endpoint Stack
 CreationTime | :18:50
 StackStatus | CREATE_COMPLETE
 ParentId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1
 RootId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1
 DriftInformation | {'StackDriftStatus': 'NOT_CHECKED'}

Depending on the configuration in config.yaml, several nested stacks are created that cover different aspects of the HyperPod cluster setup such as the EKSClusterStack, FsxStack and the VPCStack.

You can use the describe command to view details about any of the individual stacks:hyp describe cluster-stack <stack-name> --region <region>

The output for an exemplary substack, S3EndpointStack, will look like the following:

(base) xxxxxxxx@3c06303f9abb dev % hyp describe cluster-stack HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8
? Stack Details for: HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8
Status: CREATE_COMPLETE
 Field | Value 
-----------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------
 StackId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8/e2898250-c491-11f0-bf25-0afff7e082cf
 StackName | HyperpodClusterStack-d5351-S3EndpointStack-10JBD25F965A8
 Description | S3 Endpoint Stack
 Parameters | [
 | {
 | "ParameterKey": "ResourceNamePrefix",
 | "ParameterValue": "hyp-eks-demo-stack"
 | },
 | {
 | "ParameterKey": "VpcId",
 | "ParameterValue": "vpc-XXXXXXXXXXXXXX"
 | },
 | {
 | "ParameterKey": "EksPrivateRouteTableIds",
 | "ParameterValue": "rtb-XXXXXXXXXXXXXX,rtb-XXXXXXXXXXXXXX"
 | },
 | {
 | "ParameterKey": "PrivateRouteTableIds",
 | "ParameterValue": "rtb-XXXXXXXXXXXXXX,rtb-XXXXXXXXXXXXXX"
 | }
 | ]
 CreationTime | :18:50.007000+00:00
 RollbackConfiguration | {}
 StackStatus | CREATE_COMPLETE
 DisableRollback | True
 NotificationARNs | []
 Capabilities | [
 | "CAPABILITY_AUTO_EXPAND",
 | "CAPABILITY_IAM",
 | "CAPABILITY_NAMED_IAM"
 | ]
 Tags | []
 EnableTerminationProtection | False
 ParentId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1
 RootId | arn:aws:cloudformation:us-east-1:xxxxxxxxxxx:stack/HyperpodClusterStack-d5351/5b83ed40-c491-11f0-a31f-1234073395a1
 DriftInformation | {
 | "StackDriftStatus": "NOT_CHECKED"

If any of the stacks show CREATE_FAILED, ROLLBACK_* or DELETE_*, open the CloudFormation page in the console to investigate the root cause. Failed cluster creation stacks are often related to insufficient service quotas for the cluster itself, the instance groups, or the network components such as VPCs or NAT gateways. Check the corresponding SageMaker HyperPod Quotas to learn more about the required quotas for SageMaker HyperPod.

Connecting to a cluster

After the cluster stack has successfully created the required resources and the status has changed to CREATE_COMPLETE, you can configure the CLI and your local Kubernetes environment to interact with the HyperPod cluster.

hyp set-cluster-context --cluster-name <cluster-name> —region <region>

The --cluster-name option specifies the name of the HyperPod cluster to connect to and the --region option specifies the Region where the cluster has been created. Optionally, a specific namespace can be configured using the --namespace parameter. The command updates your local Kubernetes config in ./kube/config, so that you can use both the HyperPod CLI and Kubernetes utilities such as kubectl to manage the resources in your HyperPod cluster.

See our companion blog post for further information about how to use the CLI to submit training jobs and inference deployments to your newly created HyperPod cluster: Train and deploy models on Amazon SageMaker HyperPod using the new HyperPod CLI and SDK.

Modifying an existing HyperPod cluster

The HyperPod CLI provides a command to modify the instance groups and node recovery mode of an existing HyperPod cluster through the hyp update cluster command. This can be useful if you need to scale your cluster by adding or removing worker nodes, or if you want to change the instance types used by the node groups.

To update the instance groups, run the following command, adapted with your cluster name and desired instance group settings:

hyp update cluster --cluster-name  --region  \
 --instance-groups '[{
        "instance_count": 2,
        "instance_group_name": "worker-nodes",
        "instance_type": "ml.m5.large",
        "execution_role": "arn:aws:iam:::role/",
        "life_cycle_config": {
            "source_s3_uri": "s3:///amzn-s3-demo-source-bucket/",
            "on_create": "on_create.sh"
        }
    }]'

Note that all of the fields in the preceding command are required to run the update command, even if, for example, only the instance count is modified. You can list the current cluster and instance group configurations to obtain the required values by running the hyp describe cluster <cluster-name> --region <region> command.

The output of the update command will look like the following:

[11/18/25 13:21:57] Update Params: {'instance_groups': [ClusterInstanceGroupSpecification(instance_count=2, instance_group_name='worker-nodes', instance_type='ml.m5.large', life_cycle_config=ClusterLifeCycleConfig(source_s3_uri='s3://amzn-s3-demo-source-bucket2', on_create='on_create.sh'), execution_role='arn:aws:iam::037065979077:role/hyp-eks-stack-4e5aExecRole', threads_per_core=<sagemaker_core.main.utils.Unassigned object at 0x106637810>, instance_storage_configs=<sagemaker_core.main.utils.Unassigned object at 0x106637810>, on_start_deep_health_checks=<sagemaker_core.main.utils.Unassigned object at 0x106637810>, training_plan_arn=<sagemaker_core.main.utils.Unassigned object at 0x106637810>, override_vpc_config=<sagemaker_core.main.utils.Unassigned object at 0x106637810>, scheduled_update_config=<sagemaker_core.main.utils.Unassigned object at 0x106637810>, image_id=<sagemaker_core.main.utils.Unassigned object at 0x106637810>)], 'node_recovery': 'Automatic'}
[11/18/25 13:21:58]  Updating cluster resource. resources.py:3506
INFO:sagemaker_core.main.resources:Updating cluster resource.
Cluster has been updated
Cluster hyperpod-cluster has been updated

The --node-recovery option lets you configure the node recovery behavior, which can be set to either Automatic or None. For information about the SageMaker HyperPod automatic node recovery feature, see Automatic node recovery.

Deleting an existing HyperPod cluster

To delete an existing HyperPod cluster, run the following command. Note that this action is not reversible:

hyp delete cluster-stack <stack-name> --region <region>

This command removes the specified CloudFormation stack and the associated AWS resources. You can use the optional --retain-resources flag to specify a comma-separated list of logical resource IDs to retain during the deletion process. It’s important to carefully consider which resources you need to retain, because the delete operation cannot be undone.

The output of this command will look like the following, asking you to confirm the resource deletion:

⚠ WARNING: This will delete the following 12 resources:

Other (12):
 - EKSClusterStack
 - FsxStack
 - HelmChartStack
 - HyperPodClusterStack
 - HyperPodParamClusterStack
 - LifeCycleScriptStack
 - PrivateSubnetStack
 - S3BucketStack
 - S3EndpointStack
 - SageMakerIAMRoleStack
 - SecurityGroupStack
 - VPCStack

Continue? [y/N]: y
✓ Stack 'HyperpodClusterStack-d5351' deletion initiated successfully

SageMaker HyperPod SDK

SageMaker HyperPod also includes a Python SDK for programmatic access to the features described earlier. The Python SDK is used by the CLI commands and is installed when you install the sagemaker-hyperpod Python package as described in the beginning of this post. The HyperPod CLI is best suited for users who prefer a streamlined, interactive experience for common HyperPod management tasks like creating and monitoring clusters, training jobs, and inference endpoints. It’s particularly helpful for quick prototyping, experimentation, and automating repetitive HyperPod workflows through scripts or continuous integration and delivery (CI/CD) pipelines. In contrast, the HyperPod SDK provides more programmatic control and flexibility, making it the preferred choice when you need to embed HyperPod functionality directly into your application, integrate with other AWS or third-party services, or build complex, customized HyperPod management workflows. Consider the complexity of your use case, the need for automation and integration, and your team’s familiarity with programming languages when deciding whether to use the HyperPod CLI or SDK.

The SageMaker HyperPod CLI GitHub repository shows examples of how cluster creation and management can be implemented using the Python SDK.

Conclusion

The SageMaker HyperPod CLI and SDK simplify cluster creation and management. With the examples in this post, we’ve demonstrated how these tools provide value through:

Simplified lifecycle management – From initial configuration to cluster updates and cleanup, the CLI aligns with how teams manage long-running training and inference environments and abstracts away unnecessary complexity.
Declarative control when needed – The SDK exposes the underlying configuration model, so that teams can codify cluster specifications, instance groups, storage filesystems, and more.
Integrated observability – Visibility into CloudFormation stacks is available without switching tools, supporting smooth iteration during development and operation.

Getting started with these tools is as straightforward as installing the SageMaker HyperPod package. The SageMaker HyperPod CLI and SDK provide the right level of abstraction for both data scientists looking to quickly experiment with distributed training and ML engineers building production systems.

If you’re interested in how to use the HyperPod CLI and SDK for submitting training jobs and deploying models to your new cluster, make sure to check our companion blog post: Train and deploy models on Amazon SageMaker HyperPod using the new HyperPod CLI and SDK.

Artificial Intelligence

Manage Amazon SageMaker HyperPod clusters using the HyperPod CLI and SDK

A layered architecture for simplicity

Prerequisites

Install the SageMaker HyperPod CLI

Creating a new HyperPod cluster

Monitoring the HyperPod cluster creation process

Connecting to a cluster

Modifying an existing HyperPod cluster

Deleting an existing HyperPod cluster

SageMaker HyperPod SDK

Conclusion

About the authors

Resources

Blog Topics

Follow

Learn

Resources

Developers

Help