What is SageMaker HyperPod?
Amazon SageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing machine learning (ML) infrastructure. It is pre-configured with SageMaker’s distributed training libraries that automatically split training workloads across more than a thousand AI accelerators, so workloads can be processed in parallel for improved model performance. SageMaker HyperPod ensures your FM training uninterrupted by periodically saving checkpoints. It automatically detects hardware failure when it happens, repairs or replaces the faulty instance, and resumes the training from the last saved checkpoint, removing the need for you to manually manage this process. The resilient environment allows you to train models for week or months in a distributed setting without disruption, saving training time by up to 40%. SageMaker HyperPod is also highly customizable, allowing you to efficiently run and scale FM workloads and easily share compute capacity between different workloads, from large scale training to inference.
Benefits of SageMaker HyperPod
Automatic cluster health check and repair
If any instances become defective during a training workload, SageMaker HyperPod automatically detects and swaps faulty nodes with healthy ones. To detect faulty hardware, SageMaker HyperPod regularly runs an array of health checks for GPU and network integrity.
High performing distributed training libraries
With SageMaker’s distributed training libraries, you can run highly scalable and cost-effective custom data parallel and model parallel deep learning training jobs. SageMaker HyperPod is preconfigured with SageMaker distributed libraries. With only a few lines of code, you can enable data parallelism in your training scripts. SageMaker HyperPod makes it faster to perform distributed training by automatically splitting your models and training datasets across AWS GPU instances.
Advanced observability for improved performance
You can use built-in ML tools in SageMaker HyperPod to improve model performance. For example, Amazon SageMaker with TensorBoard helps you save development time by visualizing the model architecture to identify and remediate convergence issues and Amazon SageMaker Debugger captures metrics and profiles training jobs in real time. The integration with Amazon CloudWatch Container Insights provides deeper insights into cluster performance, health, and utilization.
Workload scheduling and orchestration
The SageMaker HyperPod user interface is highly customizable using Slurm or Amazon EKS. You can select and install any needed frameworks or tools. All clusters are provisioned with the instance type and count you choose, and they are retained for your use across workloads.
Scalability and optimized resources utilization
You can manage and operate SageMaker HyperPod clusters with a consistent Kubernetes-based administrator experience. This allows you to efficiently run and scale FM workloads, from training, fine-tuning, experimentation, to inference. You can easily share compute capacity and switch between Slurm and EKS for different types of workloads.