Amazon SageMaker HyperPod features
Scale and accelerate generative AI model development across thousands of AI acceleratorsTask governance
SageMaker HyperPod provides full visibility and control over compute resource allocation across generative AI model development tasks, such as training and inference. SageMaker HyperPod automatically manages tasks queues, ensuring the most critical tasks are prioritized, while more efficiently using compute resources to reduce model development costs. In a few short steps, administrators can define priorities for different tasks and set up limits for how many compute resources each team or project can use. Then, data scientists and developers create tasks (for example, a training run, fine-tuning a particular model, or making predictions on a trained model) that SageMaker HyperPod automatically runs, adhering to the compute resource limits and priorities that the administrator set. When a high-priority task needs to be completed immediately but all compute resources are in use, SageMaker HyperPod automatically frees up compute resources from lower-priority tasks. Additionally, SageMaker HyperPod automatically uses idle compute resources to accelerate waiting tasks. SageMaker HyperPod provides a dashboard where administrators can monitor and audit tasks that are running or waiting for compute resources.
Flexible training plans
To meet your training timelines and budgets, SageMaker HyperPod helps you create the most cost-efficient training plans that use compute resources from multiple blocks of compute capacity. Once you approve the training plans, SageMaker HyperPod automatically provisions the infrastructure and runs the training jobs on these compute resources without requiring any manual intervention. You save weeks of effort managing the training process to align jobs with compute availability.
Optimized recipes
SageMaker HyperPod recipes help data scientists and developers of all skill sets benefit from state-of-the-performance while quickly getting started training and fine-tuning publicly available generative AI models, including Llama 3.1 405B, Mixtral 8x22B, and Mistral 7B. Each recipe includes a training stack that has been tested by AWS, removing weeks of tedious work testing different model configurations. You can switch between GPU-based and AWS Trainium-based instances with a one-line recipe change, enable automated model checkpointing for improved training resiliency, and run workloads in production on SageMaker HyperPod.
High-performing distributed training
SageMaker HyperPod accelerates distributed training by automatically splitting your models and training datasets across AWS accelerators. It helps you to optimize your training job for AWS network infrastructure and cluster topology and streamline model checkpointing by optimizing the frequency of saving checkpoints, ensuring minimum overhead during training.