Skip to main content

Amazon SageMaker HyperPod

Amazon SageMaker HyperPod

Scale and accelerate generative AI model development across thousands of AI accelerators

What is SageMaker HyperPod?

Amazon SageMaker HyperPod removes the undifferentiated heavy lifting involved in building generative AI models. It helps quickly scale model development tasks such as training, fine-tuning, or inference across a cluster of hundreds or thousands of AI accelerators. SageMaker HyperPod enables centralized governance across all your model development tasks, giving you full visibility and control over how different tasks are prioritized and how compute resources are allocated to each task, helping you maximize GPU and AWS Trainium utilization of your cluster and accelerate innovation.

Purpose built for distributed training at scale

With SageMaker HyperPod, you can efficiently distribute and parallelize your training workload across all accelerators. SageMaker HyperPod automatically applies the best training configurations for popular publicly available models to help you quickly achieve optimal performance. It also continually monitors your cluster for any infrastructure faults, automatically repairs the issue, and recovers your workloads without human intervention—all of which help save you up to 40% of training time.

Benefits of SageMaker HyperPod

SageMaker HyperPod provides a resilient environment for model development by automatically detecting, diagnosing, and recovering from infrastructure faults, allowing you to continually run model development workloads for months without disruption. Checkpointless training on SageMaker HyperPod mitigates the need for a checkpoint-based job level restart and enables forward training progress despite failures, saving on idle compute costs during recovery and accelerating time to market by weeks.

The SageMaker HyperPod task governance innovation gives you full visibility and control over compute resource allocation across model development tasks, including training, fine-tuning, experimentation, and inference. SageMaker HyperPod automatically manages task queues, ensuring the most critical tasks are prioritized and completed on time and within budget while more efficiently using compute resources to reduce model development costs by up to 40%. In addition, SageMaker HyperPod provides advanced observability with unified visibility across AI model development tasks and compute resources.

With SageMaker HyperPod recipes, data scientists and developers of all skill levels benefit from state-of-the-art performance and can quickly start training and fine-tuning publicly available foundation models in minutes. In addition, you can customize Amazon Nova models, including Nova Micro, Nova Lite, and Nova Pro, for your business-specific use cases using the recipes to improve the accuracy of your generative AI applications, all while maintaining industry-leading price performance and low latency. Amazon Nova Forge is a first-of-its-kind program that offers organizations the easiest and most cost-effective way to build their own frontier models using Nova.

With SageMaker HyperPod you can automatically split your models and training datasets across AWS cluster instances to efficiently scale training workloads. It helps you optimize your training job for AWS network infrastructure and cluster topology. It also streamlines model checkpointing through recipes by optimizing the frequency of saving checkpoints, ensuring minimum overhead during training.

SageMaker HyperPod helps accelerate open-weights model deployments from SageMaker JumpStart and fine-tuned models from Amazon Simple Storage Service (Amazon S3) and Amazon FSx. You can streamline model deployment tasks with automatic provisioning, compute resource management through task governance, real-time performance monitoring, and enhanced observability.

Introducing checkpointless training in Amazon SageMaker HyperPod

Automatic recovery from infrastructure faults in minutes, even across thousands of AI accelerators.