Amazon SageMaker Model Training
Train ML models quickly and cost effectively with Amazon SageMaker
Amazon SageMaker Model Training reduces the time and cost to train and tune machine learning (ML) models at scale without the need to manage infrastructure. You can take advantage of the highest-performing ML compute infrastructure currently available, and SageMaker can automatically scale infrastructure up or down, from one to thousands of GPUs. Since you pay only for what you use, you can manage your training costs more effectively. To train deep learning models faster, SageMaker distributed training libraries can automatically split large models and training datasets across AWS GPU instances, or you can use third-party libraries, such as DeepSpeed, Horovod, or Megatron.
Fully managed infrastructure at scale
Broad choice of hardware
Efficiently manage system resources with a wide choice of GPUs and CPUs including P4d.24xl instances, which are the fastest training instances currently available in the cloud.
Easy setup and scale
Specify the location of data, indicate the type of SageMaker instances, and get started with a single click. SageMaker can automatically scale infrastructure up or down, from one to thousands of GPUs.
High-performance distributed training
Distributed training libraries
With only a few lines of code, you can add either data parallelism or model parallelism to your training scripts. SageMaker makes it faster to perform distributed training by automatically splitting your models and training datasets across AWS GPU instances.
Amazon SageMaker Training Compiler can accelerate training by up to 50 percent through graph- and kernel-level optimizations that use GPUs more efficiently.
Built-in tools for the highest accuracy and lowest cost
Automatic model tuning
SageMaker can automatically tune your model by adjusting thousands of algorithm parameter combinations to arrive at the most accurate predictions, saving weeks of effort.
Managed Spot training
SageMaker helps reduce training costs by up to 90 percent by automatically running training jobs when compute capacity becomes available. These training jobs are also resilient to interruptions caused by changes in capacity.
Built-in tools for interactivity and monitoring
Debugger and profiler
Amazon SageMaker Debugger captures metrics and profiles training jobs in real time, so you can quickly correct performance issues before deploying the model to production.
Amazon SageMaker Experiments captures input parameters, configurations, and results, and it stores them as experiments to help you track ML model iterations.
SageMaker comes with built-in libraries and tools to make model training easier and faster. SageMaker works with popular open-source ML models such as GPT, BERT, and DALL·E; ML frameworks, such as PyTorch and TensorFlow; and transformers, such as Hugging Face. With SageMaker, you can use popular open source libraries and tools, such as DeepSpeed, Megatron, Horovod, Ray Tune, and TensorBoard, based on your needs.
Automated ML training workflows
Automating training workflows helps you create a repeatable process to orchestrate model development steps for rapid experimentation and model retraining. You can automate the entire model build workflow, including data preparation, feature engineering, model training, model tuning, and model validation, using Amazon SageMaker Pipelines. You can configure SageMaker Pipelines to run automatically at regular intervals or when certain events are initiated, or you can run them manually as needed.
"Aurora’s advanced machine learning and simulation at scale are foundational to developing our technology safely and quickly, and AWS delivers the high performance we need to maintain our progress. With its virtually unlimited scale, AWS supports millions of virtual tests to validate the capabilities of the Aurora Driver so that it can safely navigate the countless edge cases of real-world driving."
Chris Urmson, CEO, Aurora
"We use computer vision models to do scene segmentation, which is important for scene understanding. It used to take 57 minutes to train the model for one epoch, which slowed us down. Using Amazon SageMaker’s data parallelism library and with the help of the Amazon ML Solutions Lab, we were able to train in 6 minutes with optimized training code on 5ml.p3.16xlarge instances. With the 10x reduction in training time, we can spend more time preparing data during the development cycle."
Jinwook Choi, Senior Research Engineer, Hyundai Motor Company
“At Latent Space, we're building a neural-rendered game engine where anyone can create at the speed of thought. Driven by advances in language modeling, we're working to incorporate semantic understanding of both text and images to determine what to generate. Our current focus is on utilizing information retrieval to augment large-scale model training, for which we have sophisticated ML pipelines. This setup presents a challenge on top of distributed training since there are multiple data sources and models being trained at the same time. As such, we're leveraging the new distributed training capabilities in Amazon SageMaker to efficiently scale training for large generative models.”
Sarah Jane Hong, Cofounder/Chief Science Officer, Latent Space
"One of Guidewire’s services is to help customers develop cutting-edge natural language processing (NLP) models for applications like risk assessment and claims operations. Amazon SageMaker Training Compiler is compelling because it offers time and cost savings to our customers while developing these NLP models. We expect it to help us reduce training time by more than 20 percent through more efficient use of GPU resources. We are excited to implement SageMaker Training Compiler in our NLP workloads, helping us to accelerate the transformation of data to insight for our customers."
Matt Pearson, Principal Product Manager—Analytics and Data Services, Guidewire Software
“Musixmatch uses Amazon SageMaker to build natural language processing (NLP) and audio processing models and is experimenting with Hugging Face with Amazon SageMaker. We choose Amazon SageMaker because it allows data scientists to iteratively build, train, and tune models quickly without having to worry about managing the underlying infrastructure, which means data scientists can work more quickly and independently. As the company has grown, so too have our requirements to train and tune larger and more complex NLP models. We are always looking for ways to accelerate training time while also lowering training costs, which is why we are excited about Amazon SageMaker Training Compiler. SageMaker Training Compiler provides more efficient ways to use GPUs during the training process and, with the seamless integration between SageMaker Training Compiler, PyTorch, and high-level libraries like Hugging Face, we have seen a significant improvement in training time of our transformer-based models going from weeks to days, as well as lower training costs.”
Loreto Parisi, Artificial Intelligence Engineering Director, Musixmatch
Stay up to date with the latest SageMaker model training announcements.
Train 175+ billion parameter NLP models with model parallel additions and Hugging Face on Amazon SageMaker.
AWS re:Invent 2021 - Train ML models at scale with Amazon SageMaker, featuring Aurora.
Download SageMaker model training and tuning code samples from the GitHub repository.
View our latest benchmarks for most popular machine learning models.
Choose the best data source for your Amazon SageMaker training job.
Follow the step-by-step tutorial to learn how to train a model using SageMaker.
In this hands-on lab, learn how to use SageMaker to build, train, and deploy an ML model.
Get started building with SageMaker in the AWS Management Console.