Amazon SageMaker Model Training

Train ML models quickly and cost-effectively with Amazon SageMaker

Amazon SageMaker reduces the time and cost to train and tune machine learning (ML) models without the need to manage infrastructure. With SageMaker, easily train and tune ML models using built-in tools to manage and track training experiments, automatically choose optimal hyperparameters, debug training jobs, and monitor the utilization of system resources such as GPUs, CPUs, and network bandwidth. Automate and integrate every training step into your complete ML workflow, making it easier to scale to thousands of training experiments. SageMaker also offers the highest-performing ML compute infrastructure currently available—including Amazon EC2 P4d instances, which can reduce ML training costs by up to 60% compared with previous generations. SageMaker can automatically scale infrastructure up or down based on your training job requirements, from one GPU to thousands, or from terabytes to petabytes of storage. And, since you pay only for what you use, you can manage your training costs more effectively. To train deep learning (DL) models faster, you can use the Amazon SageMaker Training Compiler to accelerate the model training process by up to 50% through graph- and kernel-level optimizations that make more efficient use of GPUs. Moreover, you can add either data parallelism or model parallelism to your training script with a few lines of code, and the SageMaker distributed training libraries will automatically split models and training datasets across GPU instances to help you complete distributed training faster.

Key features

Experiment management and automatic model tuning

Amazon SageMaker Experiments captures input parameters, configurations, and results, and stores them as experiments to help you track ML model iterations. SageMaker can also automatically tune your model by adjusting thousands of algorithm parameter combinations to arrive at the most accurate predictions, saving weeks of effort.

Learn more »

Debug and profile training runs

Amazon SageMaker Debugger captures metrics and profiles training jobs in real time so you can correct performance issues quickly before deploying the model to production.

Learn more »

Distributed training

With only a few lines of code, you can add either data parallelism or model parallelism to your TensorFlow or PyTorch training scripts. SageMaker makes it faster to perform distributed training by automatically splitting DL models and training datasets across AWS GPU instances.

Learn more »

Training Compiler

SageMaker Training Compiler can accelerate training by up to 50% through graph- and kernel-level optimizations that use GPUs more efficiently. SageMaker Training Compiler is integrated with versions of TensorFlow and PyTorch in SageMaker, so you can speed up training in these popular frameworks with minimal code changes.

Learn more »

Managed Spot Training

Amazon SageMaker Managed Spot Training helps reduce training costs by up to 90%. Training jobs are automatically run when compute capacity becomes available, and they’re resilient to interruptions caused by changes in capacity.

Learn more »


Hyundai Motor Company
“We use computer vision models to do scene segmentation, which is important for scene understanding. It used to take 57 minutes to train the model for one epoch, which slowed us down. Using Amazon SageMaker’s data parallelism library and with the help of the Amazon ML Solutions Lab, we were able to train in six minutes with optimized training code on 5ml.p3.16xlarge instances. With the 10x reduction in training time, we can spend more time preparing data during the development cycle.”

Jinwook Choi, Senior Research Engineer, Hyundai Motor Company

Read more »
Latent Space
“At Latent Space we're building a neural rendered game engine where anyone can create at the speed of thought. Driven by advances in language modeling, we're working to incorporate semantic understanding of both text and images to determine what to generate. Our current focus is on utilizing information retrieval to augment large-scale model training, for which we have sophisticated ML pipelines. This setup presents a challenge on top of distributed training since there are multiple data sources and models being trained at the same time. As such, we're leveraging the new distributed training capabilities in Amazon SageMaker to efficiently scale training for large generative models.”

Sarah Jane Hong, Cofounder & Chief Science Officer, Latent Space

Read more »
“One of Guidewire’s services is to help customers develop cutting-edge natural language processing (NLP) models for applications like risk assessment and claims operations. Amazon SageMaker Training Compiler is compelling because it offers time and cost-savings to our customers while developing these NLP models. We expect it to help us reduce training time by more than 20 percent through more efficient use of GPU resources. We are excited to implement SageMaker Training Compiler in our NLP workloads, helping us to accelerate the transformation of data to insight for our customers.”

Matt Pearson, Principal Product Manager, Analytics and Data Services, Guidewire Software

“Musixmatch uses Amazon SageMaker to build natural language processing (NLP) and audio processing models, and is experimenting with Hugging Face with Amazon SageMaker. We choose Amazon SageMaker because it allows data scientists to iteratively build, train, and tune models quickly without having to worry about managing the underlying infrastructure, which means data scientists can work more quickly and independently. As the company has grown, so too have our requirements to train and tune larger and more complex NLP models. We are always looking for ways to accelerate training time while also lowering training costs, which is why we are excited about Amazon SageMaker Training Compiler. SageMaker Training Compiler provides more efficient ways to use GPUs during the training process and, with the seamless integration between SageMaker Training Compiler, PyTorch, and high-level libraries like Hugging Face, we have seen a significant improvement in training time of our transformer-based models going from weeks to days, as well as lower training costs.”

Loreto Parisi, AI Engineering Director, Musixmatch

Get started with Amazon SageMaker

Get started training machine learning models in the AWS Management Console

Explore more of AWS