Amazon SageMaker Model Training

Train ML models quickly and cost effectively with Amazon SageMaker

What is SageMaker Model Training?

Amazon SageMaker Model Training reduces the time and cost to train and tune machine learning (ML) models at scale without the need to manage infrastructure. You can take advantage of the highest-performing ML compute infrastructure currently available, and SageMaker can automatically scale infrastructure up or down, from one to thousands of GPUs. Since you pay only for what you use, you can manage your training costs more effectively. To train deep learning models faster, SageMaker helps you select and refine datasets in real time. SageMaker distributed training libraries can automatically split large models and training datasets across AWS GPU instances, or you can use third-party libraries, such as DeepSpeed, Horovod, or Megatron. Train foundation models (FMs) for weeks and months without disruption by automatically monitor and repair training clusters.

Amazon SageMaker Model Training Overview

How it works

Train and tune ML models at scale with state-of-the art ML tools and the highest-performing ML compute infrastructure.
How SageMaker Model Training works

How it works

Train and tune ML models at scale with state-of-the art ML tools and the highest-performing ML compute infrastructure.
How SageMaker Model Training works

Benefits of cost effective training

Amazon SageMaker offers a broad choice of GPUs and CPUs, as well as AWS accelerators such as AWS Trainium and AWS Inferentia, to enable large-scale model training. You automatically scale infrastructure up or down, from one to thousands of GPUs. Amazon SageMaker HyperPod is purpose-built for large-scale distributed training, allowing you to train foundation models (FMs) faster.
With only a few lines of code, you can add either data parallelism or model parallelism to your training scripts. SageMaker makes it faster to perform distributed training by automatically splitting your models and training datasets across AWS GPU instances.
SageMaker can automatically tune your model by adjusting thousands of algorithm parameter combinations to arrive at the most accurate predictions. Use debugging and profiling tools to quickly correct performance issues and optimize training performance.
SageMaker enables efficient ML experiments to help you more easily track ML model iterations. Improve model training performance by visualizing the model architecture to identify and remediate convergence issues.

Train models at scale

  • Fully Managed Infrastructure
  • SageMaker HyperPod