Amazon SageMaker HyperPod

Amazon SageMaker HyperPod features

Scale and accelerate generative AI model development across thousands of AI accelerators

Task governance

Amazon SageMaker HyperPod provides full visibility and control over compute resource allocation across generative AI model development tasks, such as training and inference. SageMaker HyperPod automatically manages tasks queues, ensuring the most critical tasks are prioritized, while more efficiently using compute resources to reduce model development costs. In a few short steps, administrators can define priorities for different tasks and set up limits for how many compute resources each team or project can use. Then, data scientists and developers create tasks (for example, a training run, fine-tuning a particular model, or making predictions on a trained model) that SageMaker HyperPod automatically runs, adhering to the compute resource limits and priorities that the administrator set. When a high-priority task needs to be completed immediately but all compute resources are in use, SageMaker HyperPod automatically frees up compute resources from lower-priority tasks. Additionally, SageMaker HyperPod automatically uses idle compute resources to accelerate waiting tasks. SageMaker HyperPod provides a dashboard where administrators can monitor and audit tasks that are running or waiting for compute resources.

Learn more

Flexible training plans

To meet your training timelines and budgets, SageMaker HyperPod helps you create the most cost-efficient training plans that use compute resources from multiple blocks of compute capacity. Once you approve the training plans, SageMaker HyperPod automatically provisions the infrastructure and runs the training jobs on these compute resources without requiring any manual intervention. You save weeks of effort managing the training process to align jobs with compute availability.

Learn more

Optimized recipes to customize models

SageMaker HyperPod recipes help data scientists and developers of all skill sets benefit from state-of-the-art performance while quickly getting started training and fine-tuning publicly available generative AI models, including Llama, Mixtral, Mistral, and DeepSeek models. In addition, you can customize Amazon Nova foundation models (FMs), including Nova Micro, Nova Lite, and Nova Pro using a suite of techniques, which includes Supervised Fine-Tuning (SFT), Knowledge Distillation, Direct Preference Optimization (DPO), Proximal Policy Optimization, and Continued Pre-Training—with support for both parameter-efficient and full-model training options across SFT, Distillation, and DPO. Each recipe includes a training stack that has been tested by AWS, removing weeks of tedious work testing different model configurations. You can switch between GPU-based and AWS Trainium–based instances with a one-line recipe change, enable automated model checkpointing for improved training resiliency, and run workloads in production on SageMaker HyperPod.

Learn more

High-performing distributed training

SageMaker HyperPod accelerates distributed training by automatically splitting your models and training datasets across AWS accelerators. It helps you to optimize your training job for AWS network infrastructure and cluster topology and streamline model checkpointing by optimizing the frequency of saving checkpoints, ensuring minimum overhead during training.

Advanced observability and experimentation tools

SageMaker HyperPod observability provides a unified dashboard preconfigured in Amazon Managed Grafana, with the monitoring data automatically published to an Amazon Managed Prometheus workspace. You can see real-time performance metrics, resource utilization, and cluster health in a single view, allowing teams to quickly spot bottlenecks, prevent costly delays, and optimize compute resources. SageMaker HyperPod is also integrated with Amazon CloudWatch Container Insights, providing deeper insights into cluster performance, health, and use. Managed TensorBoard in SageMaker helps you save development time by visualizing the model architecture to identify and remediate convergence issues. Managed MLflow in SageMaker helps you efficiently manage experiments at scale.

Screenshot of a GPU cluster dashboard displaying metrics and performance data for HyperPod, including GPU temperature, power usage, memory usage, NVLink bandwidth, and cluster alerts.

Workload scheduling and orchestration

The SageMaker HyperPod user interface is highly customizable using Slurm or Amazon Elastic Kubernetes Service (Amazon EKS). You can select and install any needed frameworks or tools. All clusters are provisioned with the instance type and count you choose, and they are retained for your use across workloads. With Amazon EKS support in SageMaker HyperPod, you can manage and operate clusters with a consistent Kubernetes-based administrator experience. Efficiently run and scale workloads, from training to fine-tuning to inference. You can also share compute capacity and switch between Slurm and Amazon EKS for different types of workloads.

Automatic cluster health check and repair

If any instances become defective during a model development workload, SageMaker HyperPod automatically detects and addresses infrastructure issues. To detect faulty hardware, SageMaker HyperPod regularly runs an array of health checks for accelerator and network integrity.

Accelerate open-weights model deployments from SageMaker Jumpstart

SageMaker HyperPod automatically streamlines the deployment of open-weights FMs from SageMaker JumpStart and fine-tuned models from Amazon S3 and Amazon FSx. SageMaker HyperPod automatically provisions the required infrastructure and configures endpoints, eliminating manual provisioning. With SageMaker HyperPod task governance, endpoint traffic is continuously monitored and dynamically adjusts compute resources while simultaneously publishing comprehensive performance metrics to the observability dashboard for real-time monitoring and optimization.

Screenshot of the deployment settings for deploying a model endpoint using SageMaker HyperPod in SageMaker Studio. The interface shows fields for deployment name, HyperPod cluster selection, instance type, namespace, auto-scaling options, and the model being deployed. Used for large-scale inference with pre-provisioned compute.

Managed tiered checkpointing

SageMaker HyperPod managed tiered checkpointing uses CPU memory to store frequent checkpoints for rapid recovery, while periodically persisting data to Amazon Simple Storage Service (Amazon S3) for long-term durability. This hybrid approach minimizes training loss and significantly reduces the time to resume training after a failure. Customers can configure checkpoint frequency and retention policies across both in-memory and persistent storage tiers. By storing frequently in memory, customers can recover quickly while minimizing storage costs. Integrated with PyTorch's Distributed Checkpoint (DCP), customers can easily implement checkpointing with only a few lines of code, while gaining the performance benefits of in-memory storage.

Learn more

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages

Amazon SageMaker HyperPod features

Task governance

Flexible training plans

Optimized recipes to customize models

High-performing distributed training

Advanced observability and experimentation tools

Workload scheduling and orchestration

Automatic cluster health check and repair

Accelerate open-weights model deployments from SageMaker Jumpstart

Managed tiered checkpointing

Did you find what you were looking for today?

Learn

Resources

Developers

Help