Posted On: Nov 29, 2023

Today, AWS announces the general availability of Amazon SageMaker HyperPod, which reduces time to train foundation models (FMs) by up to 40% by providing purpose-built infrastructure for distributed training at scale. 

Many organizations want to train their own FMs using graphics processing units (GPU)-based and Trainium-based instances at low cost. However, the volume of data, size of the models, and time required for training FMs has exponentially increased the complexity of training a model. Customers often need to split their FM training across potentially hundreds or thousands of accelerators. They then run trillions of data computations in parallel for weeks or months at a time, which is time consuming and requires specialized ML expertise. The number of accelerators and training time increases substantially compared to training task specific models, so the likelihood of rare, small errors, like a single accelerator failure, compounds. 

SageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing ML infrastructure for training FMs. SageMaker HyperPod is pre-configured with SageMaker’s distributed training libraries that enable customers to automatically split training workloads across thousands of accelerators, so workloads can be processed in parallel for improved model performance. SageMaker HyperPod also ensures customers can continue FM training uninterrupted by periodically saving checkpoints. When a hardware failure occurs during training, SageMaker HyperPod automatically detects the failure, repairs or replaces the faulty instance, and resumes the training from the last saved checkpoint, removing the need for customers to manually manage this process and helping them train for week or months in a distributed setting without disruption.

SageMaker HyperPod is generally available, and you can use it in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm).

To learn more, see the following list of resources: