What is SageMaker HyperPod?
Amazon SageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing machine learning (ML) infrastructure. It is pre-configured with SageMaker’s distributed training libraries that automatically split training workloads across more than a thousand AI accelerators, so workloads can be processed in parallel for improved model performance. SageMaker HyperPod ensures your FM training uninterrupted by periodically saving checkpoints. It automatically detects hardware failure when it happens, repairs or replaces the faulty instance, and resumes the training from the last saved checkpoint, removing the need for you to manually manage this process. The resilient environment allows you to train models for week or months in a distributed setting without disruption, saving training time by up to 40%. SageMaker HyperPod is also highly customizable, allowing you to efficiently run and scale FM workloads and easily share compute capacity between different workloads, from large scale training to inference.