Fully managed infrastructure at scale

Broad choice of hardware

Efficiently manage system resources with a wide choice of GPUs and CPUs including P4d.24xl instances, which are the fastest training instances currently available in the cloud.

Easy setup and scale

Specify the location of data, indicate the type of SageMaker instances, and get started with a single click. SageMaker can automatically scale infrastructure up or down, from one to thousands of GPUs.

High-performance distributed training

Distributed training libraries

With only a few lines of code, you can add either data parallelism or model parallelism to your training scripts. SageMaker makes it faster to perform distributed training by automatically splitting your models and training datasets across AWS GPU instances.

Training Compiler

Amazon SageMaker Training Compiler can accelerate training by up to 50 percent through graph- and kernel-level optimizations that use GPUs more efficiently.

Built-in tools for the highest accuracy and lowest cost

Automatic model tuning

SageMaker can automatically tune your model by adjusting thousands of algorithm parameter combinations to arrive at the most accurate predictions, saving weeks of effort.

Managed Spot training

SageMaker helps reduce training costs by up to 90 percent by automatically running training jobs when compute capacity becomes available. These training jobs are also resilient to interruptions caused by changes in capacity.

Built-in tools for interactivity and monitoring

Debugger and profiler

Amazon SageMaker Debugger captures metrics and profiles training jobs in real time, so you can quickly correct performance issues before deploying the model to production.

Experiment management

Amazon SageMaker Experiments captures input parameters, configurations, and results, and it stores them as experiments to help you track ML model iterations.

Full customization

SageMaker comes with built-in libraries and tools to make model training easier and faster. SageMaker works with popular open-source ML models such as GPT, BERT, and DALL·E; ML frameworks, such as PyTorch and TensorFlow; and transformers, such as Hugging Face. With SageMaker, you can use popular open source libraries and tools, such as DeepSpeed, Megatron, Horovod, Ray Tune, and TensorBoard, based on your needs.

Hugging Face logo

Automated ML training workflows

Automating training workflows helps you create a repeatable process to orchestrate model development steps for rapid experimentation and model retraining. You can automate the entire model build workflow, including data preparation, feature engineering, model training, model tuning, and model validation, using Amazon SageMaker Pipelines. You can configure SageMaker Pipelines to run automatically at regular intervals or when certain events are initiated, or you can run them manually as needed.

Learn more »

Customer success


"Aurora’s advanced machine learning and simulation at scale are foundational to developing our technology safely and quickly, and AWS delivers the high performance we need to maintain our progress. With its virtually unlimited scale, AWS supports millions of virtual tests to validate the capabilities of the Aurora Driver so that it can safely navigate the countless edge cases of real-world driving." 

Chris Urmson, CEO, Aurora

Watch the video »


"We use computer vision models to do scene segmentation, which is important for scene understanding. It used to take 57 minutes to train the model for one epoch, which slowed us down. Using Amazon SageMaker’s data parallelism library and with the help of the Amazon ML Solutions Lab, we were able to train in 6 minutes with optimized training code on 5ml.p3.16xlarge instances. With the 10x reduction in training time, we can spend more time preparing data during the development cycle." 

Jinwook Choi, Senior Research Engineer, Hyundai Motor Company

Read the blog »

Latent Space

“At Latent Space, we're building a neural-rendered game engine where anyone can create at the speed of thought. Driven by advances in language modeling, we're working to incorporate semantic understanding of both text and images to determine what to generate. Our current focus is on utilizing information retrieval to augment large-scale model training, for which we have sophisticated ML pipelines. This setup presents a challenge on top of distributed training since there are multiple data sources and models being trained at the same time. As such, we're leveraging the new distributed training capabilities in Amazon SageMaker to efficiently scale training for large generative models.”

Sarah Jane Hong, Cofounder/Chief Science Officer, Latent Space

Read the blog »


"One of Guidewire’s services is to help customers develop cutting-edge natural language processing (NLP) models for applications like risk assessment and claims operations. Amazon SageMaker Training Compiler is compelling because it offers time and cost savings to our customers while developing these NLP models. We expect it to help us reduce training time by more than 20 percent through more efficient use of GPU resources. We are excited to implement SageMaker Training Compiler in our NLP workloads, helping us to accelerate the transformation of data to insight for our customers."

Matt Pearson, Principal Product Manager—Analytics and Data Services, Guidewire Software


“Musixmatch uses Amazon SageMaker to build natural language processing (NLP) and audio processing models and is experimenting with Hugging Face with Amazon SageMaker. We choose Amazon SageMaker because it allows data scientists to iteratively build, train, and tune models quickly without having to worry about managing the underlying infrastructure, which means data scientists can work more quickly and independently. As the company has grown, so too have our requirements to train and tune larger and more complex NLP models. We are always looking for ways to accelerate training time while also lowering training costs, which is why we are excited about Amazon SageMaker Training Compiler. SageMaker Training Compiler provides more efficient ways to use GPUs during the training process and, with the seamless integration between SageMaker Training Compiler, PyTorch, and high-level libraries like Hugging Face, we have seen a significant improvement in training time of our transformer-based models going from weeks to days, as well as lower training costs.”

Loreto Parisi, Artificial Intelligence Engineering Director, Musixmatch


What's New

Stay up to date with the latest SageMaker model training announcements.

Blog post

Train 175+ billion parameter NLP models with model parallel additions and Hugging Face on Amazon SageMaker.


AWS re:Invent 2021 - Train ML models at scale with Amazon SageMaker, featuring Aurora.

Example notebooks

Download SageMaker model training and tuning code samples from the GitHub repository.


Blog post

Choose the best data source for your Amazon SageMaker training job.

Get started with a tutorial

Follow the step-by-step tutorial to learn how to train a model using SageMaker.

Learn more 
Amazon Pinpoint getting started tutorial
Try a self-paced workshop

In this hands-on lab, learn how to use SageMaker to build, train, and deploy an ML model.

Learn more 
Start building in the console

Get started building with SageMaker in the AWS Management Console.

Sign in 

What's new

Date (Newest to Oldest)
  • Date (Newest to Oldest)
No results found