Amazon SageMaker Model Training

Train and fine-tune ML and generative AI models

What is Amazon SageMaker Model Training?

Amazon SageMaker Model Training reduces the time and cost to train and tune machine learning (ML) models at scale without the need to manage infrastructure. You can take advantage of the highest-performing ML compute infrastructure currently available, and SageMaker AI can automatically scale infrastructure up or down, from one to thousands of GPUs. To train deep learning models faster, SageMaker AI helps you select and refine datasets in real time. SageMaker distributed training libraries can automatically split large models and training datasets across AWS GPU instances, or you can use third-party libraries, such as DeepSpeed, Horovod, or Megatron. Train foundation models (FMs) for weeks and months without disruption by automatically monitoring and repairing training clusters.

Benefits of cost effective training

SageMaker AI offers a broad choice of GPUs and CPUs, as well as AWS accelerators such as AWS Trainium and AWS Inferentia, to enable large-scale model training. You automatically scale infrastructure up or down, from one to thousands of GPUs.
SageMaker AI allows you to automatically split your models and training datasets across AWS cluster instances to help you efficiently scale training workloads. It helps you to optimize your training job for AWS network infrastructure and cluster topology. It also streamlines model checkpointing via the recipes by optimizing the frequency of saving checkpoints, ensuring minimum overhead during training. You can also use optimized recipes to benefit from state-of-the-art performance and quickly get started training and fine-tuning publicly available generative AI models in minutes.
SageMaker AI can automatically tune your model by adjusting thousands of algorithm parameter combinations to arrive at the most accurate predictions. Use debugging and profiling tools to quickly correct performance issues and optimize training performance.
SageMaker AI enables efficient ML experiments to help you more easily track ML model iterations. Improve model training performance by visualizing the model architecture to identify and remediate convergence issues.

Train models at scale

Fully managed training jobs

Amazon SageMaker training jobs offer a fully managed user experience for large distributed FM training, removing the undifferentiated heavy lifting around infrastructure management. SageMaker training jobs automatically spins up a resilient distributed training cluster, monitors the infrastructure, and auto-recovers from faults to ensure a smooth training experience. Once the training is complete, SageMaker spins down the cluster and you are billed for the net training time. In addition, with SageMaker training jobs, you have the flexibility to choose the right instance type to best fits an individual workload (e.g., pre-train an LLM on a P5 cluster or fine tune an open source LLM on p4d instances) to further optimize your training budget. In addition, it offers a consistent user experience across ML teams with varying levels of technical expertise and different workload types.

Learn more

Amazon SageMaker HyperPod

Amazon SageMaker HyperPod is a purpose-built infrastructure to efficiently manage compute clusters to scale foundation model (FM) development. It enables advanced model training techniques, infrastructure control, performance optimization, and enhanced model observability. SageMaker HyperPod is preconfigured with Amazon SageMaker distributed training libraries, allowing you to automatically split models and training datasets across AWS cluster instances to help efficiently utilize the cluster’s compute and network infrastructure. It enables a more resilient environment by automatically detecting, diagnosing, and recovering from hardware faults, allowing you to continually train FMs for months without disruption, reducing training time by up to 40%.

Learn more

High-performance distributed training

SageMaker AI makes it faster to perform distributed training by automatically splitting your models and training datasets across AWS accelerators. It helps you to optimize your training job for AWS network infrastructure and cluster topology. It also streamlines model checkpointing via the recipes by optimizing the frequency of saving checkpoints, ensuring minimum overhead during training. With recipes, data scientists and developers of all skill sets benefit from state-of-the-art performance while quickly getting started training and fine-tuning publicly available generative AI models, including Llama 3.1 405B, Mixtral 8x22B, and Mistral 7B. The recipes include a training stack that has been tested by AWS, removing weeks of tedious work testing different model configurations. You can switch between GPU-based and AWS Trainium-based instances with a one-line recipe change and enable automated model checkpointing for improved training resiliency. In addition, run workloads in production on the SageMaker training feature of your choice.

Learn more

Built-in tools for the highest accuracy and lowest cost

Automatic model tuning

SageMaker AI can automatically tune your model by adjusting thousands of algorithm parameter combinations to arrive at the most accurate predictions, saving weeks of effort. It helps you to find the best version of a model by running many training jobs on your dataset.

ML Training Workflows

Managed Spot training

SageMaker AI helps reduce training costs by up to 90 percent by automatically running training jobs when compute capacity becomes available. These training jobs are also resilient to interruptions caused by changes in capacity.

Learn more

Debugging

Amazon SageMaker Debugger captures metrics and profiles training jobs in real time, so you can quickly correct performance issues before deploying the model to production. You can also remotely connect to the model training environment in Amazon SageMaker for debugging with access to the underlying training container.

Automatic model tuning

Profiler

Amazon SageMaker Profiler helps you optimize training performance with granular hardware profiling insights including aggregated GPU and CPU utilization metrics, high resolution GPU/CPU trace plots, custom annotations, and visibility into mixed precision utilization.
Managed Spot Tarining