Product›
Machine Learning›
AWS Trainium

AWS Trainium

Get high performance for deep learning and generative AI training while lowering costs

Get started with Trainium using Neuron

Why Trainium?

AWS Trainium is the machine learning (ML) chip that AWS purpose built for deep learning (DL) training of 100B+ parameter models. Each Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instance deploys up to 16 Trainium accelerators to deliver a high-performance, low-cost solution for DL training in the cloud. Although use of DL and generative AI is accelerating, many development teams have fixed budgets, limiting the scope and frequency of training needed to improve their models and applications. Trainium-based Amazon EC2 Trn1 instances solve this challenge by delivering faster time to train while offering up to 50% cost-to-train savings over comparable EC2 instances. Trainium has been optimized for training natural language processing, computer vision, and recommender models used in a broad set of applications, such as text summarization, code generation, question answering, image and video generation, recommendation, and fraud detection.

AWS Neuron SDK helps developers train models on Trainium accelerators (and deploy them on AWS Inferentia accelerators). It natively integrates with popular frameworks, such as PyTorch and TensorFlow, so that you can continue to train on Trainium accelerators and use your existing code and workflows.

Benefits of Trainium

High-performance, cost-effective DL training

Trainium-powered Trn1 instances deliver high performance while reducing training costs by up to 50% over other comparable Amazon EC2 instances. Each Trainium accelerator includes two second-generation NeuronCores that are purpose built for DL algorithms. To support efficient data and model parallelism, each Trainium accelerator has 32 GB of high-bandwidth memory, delivers up to 190 TFLOPS of FP16/BF16 compute power, and features NeuronLink, an intra-instance, ultra-high-speed nonblocking interconnect technology.

Native support for ML frameworks and libraries

The AWS Neuron SDK, which supports Trainium, is natively integrated with PyTorch and TensorFlow. This ensures that you can continue using your existing workflows in these popular frameworks and get started with Trainium with only a few lines of code changes. For distributed model training, the Neuron SDK supports libraries such as Megatron-LM and PyTorch Fully Sharded Data Parallel (FSDP). To quickly get started with Trainium-powered Amazon EC2 Trn1 instances, see popular model examples in the Neuron documentation.

Wide range of data types with automatic casting

To deliver high performance while meeting accuracy goals, Trainium is optimized for FP32, TF32, BF16, FP16, UINT8, and the new configurable FP8 (cFP8) data type.

State-of-the-art DL capabilities

To support the fast pace of DL innovation and generative AI, Trainium has several innovations that make it flexible and extendable to train constantly evolving DL models. Trainium has hardware optimizations and software support for dynamic input shapes. To allow support for new operators in the future, it supports custom operators written in C++. It also supports stochastic rounding, a method for probabilistically rounding to achieve high performance and higher accuracy compared to legacy rounding modes.

Built for sustainability

Trn1 instances powered by Trainium are up to 25% more energy efficient for DL training than comparable accelerated computing EC2 instances. Trn1 instances help you meet your sustainability goals when training ultra-large models.

Videos

Behind the scenes look at generative AI infrastructure at Amazon

Accelerate DL and innovate faster with AWS Trainium

Introducing Amazon EC2 Trn1 instances powered by AWS Trainium

Resources

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Train Llama2 with AWS Trainium on Amazon Elastic Kubernetes Service (Amazon EKS)

Frugality meets Accuracy: Cost-efficient training of GPT NeoX and Pythia models with AWS Trainium

How Amazon Search M5 saved 30% for in large language model (LLM) training costs by using AWS Trainium

Fast and cost-effective LLaMA 2 fine-tuning with AWS Trainium

Scale your ML workloads on Amazon ECS powered by AWS Trainium instances

Additional resources

Use AWS Neuron and get started with AWS Trainium from within TensorFlow, PyTorch, or MXNet

Additional resources

AWS Neuron feature roadmap

Get started with Trainium

Start building in the console

Training samples and tutorials (Trn1 and Trn1n)