What is SageMaker Model Training?
Amazon SageMaker Model Training reduces the time and cost to train and tune machine learning (ML) models at scale without the need to manage infrastructure. You can take advantage of the highest-performing ML compute infrastructure currently available, and SageMaker can automatically scale infrastructure up or down, from one to thousands of GPUs. Since you pay only for what you use, you can manage your training costs more effectively. To train deep learning models faster, SageMaker helps you select and refine datasets in real time. SageMaker distributed training libraries can automatically split large models and training datasets across AWS GPU instances, or you can use third-party libraries, such as DeepSpeed, Horovod, or Megatron. Train foundation models (FMs) for weeks and months without disruption by automatically monitor and repair training clusters.
How it works

How it works

Benefits of cost effective training
Train models at scale
-
Fully Managed Infrastructure
-
SageMaker HyperPod
-
High-performance distributed training
-
Smart data shifting
-
Fully Managed Infrastructure
-
Fully managed infrastructure at scale
Efficiently manage system resources with a wide choice of GPUs and CPUs. This includes NVIDIA A100 and H100 GPUs as well as AWS accelerators like AWS Trainium and AWS Inferentia. SageMaker automatically scale infrastructure up or down, from one to thousands of GPUs.
-
SageMaker HyperPod
-
Amazon SageMaker Hyperpod
SageMaker HyperPod removes the undifferentiated heavy lifting involved in building and optimizing ML infrastructure for training FMs, reducing training time by up to 40%. SageMaker HyperPod is pre-configured with SageMaker’s distributed training libraries that allow you to automatically split training workloads across thousands of accelerators, so workloads can be processed in parallel for improved model performance. When a hardware failure occurs, SageMaker HyperPod automatically detects the failure, repairs or replaces the faulty instance, and resumes the training from the last saved checkpoint, allowing you to train for week or months in a distributed setting without disruption.
-
High-performance distributed training
-