AWS Trainium
Purpose-built AI accelerator to deliver the best economics for AI workloads at scale
Why Trainium?
AWS Trainium is a purpose-built AI accelerator designed for one goal: the best economics for AI workloads at scale. Today’s AI builders need to train and serve models at the speed and cost users demand. Behind every foundation model, every agent, every real-time AI experience is infrastructure that determines both. The system that delivers the best training and token economics wins. Delivering that requires solving speed, cost, scale, fault tolerance, and developer velocity. Trainium is at the center of a fully integrated system: chip, server, network, software, and services — purpose-built at every layer and co-designed to work as one.
Benefits
Every dollar saved in training is reinvested in the next iteration. Every token served cheaper means more users, more interactions, more value. Trainium delivers the best cost-per-token at production scale — because every layer of the system was designed to minimize waste.
Trainium is purpose-built — and so is every layer around it: server, network, software, and services. Graviton as host CPU, Nitro secures, Elastic Fabric Adapter (EFA) scales, Neuron SDK makes it accessible, Amazon EKS as orchestrator, SageMaker HyperPod for managing large-scale AI computing. Every layer designed with the others in mind. A straight line from transistor to token.
Models are getting bigger. Training them faster demands more compute than any single chip can deliver. Trainium scales from a single chip to hundreds of thousands. The data center is now the new AI accelerator.
PyTorch, vLLM, HuggingFace, Ray — they work without modification. Two lines of code to get started. NKI for bare-metal kernel access when you need it. The tools you already use, on purpose-built silicon. No porting, no friction.
At scale, failures are constants — not exceptions. Redundant NeuronLink lanes, hot-swap maintenance, automatic failure detection in seconds. The system routes around failures without stopping the job. The difference between losing minutes and losing days.
Features
Trainium3 contains eight large cores, four specialized engines each — Tensor, Vector, Scalar, GPSIMD — running simultaneously. Up to 2x FP4 and FP8 compute throughput compared to Trainium2. Tensor Dereferencing enables native Mixture of Experts routing in hardware. Optimized for the mathematical primitives that underlie all AI workloads.
144 GB HBM3e per chip, 4.9 TB/s bandwidth — 70% more than Trainium2. Hardware-accelerated W4A8 quantization doubles effective weight-loading rate with zero software overhead. Two-level on-chip hierarchy keeps data close to compute. The memory system never becomes the bottleneck.
Dozens of communication cores physically separate from compute. Zero contention between compute and communication. On-chip traffic prioritization ensures time-sensitive data moves first. Eliminates the straggler effect — chips that aren’t waiting are chips that are computing.
Purpose-built networking that scales Trainium to hundreds of thousands of chips in a single non-blocking, petabit-scale network. Automatic traffic steering and instant rerouting on failure. Clusters reconfigure dynamically — failures are detected in seconds and the system routes around them without stopping the job.
Trainium3 UltraServers scale up to 144 Trainium chips, delivering up to 362 MXFP8 PFLOPs, 20.7 TB of HBM3e, and 706 TB/s of aggregate memory bandwidth. NeuronSwitch provides an all-to-all fabric that doubles interchip interconnect bandwidth over Trainium2 UltraServers. Available in UltraClusters 3.0 to scale to hundreds of thousands of chips.
Native PyTorch — one line to switch. vLLM, HuggingFace, TorchTitan, TRL integration. NKI for instruction-level kernel development. Neuron Explorer for AI-powered profiling. Ray, Slurm, Amazon EKS for orchestration. Software forward-compatible across chip generations.
Neuron SDK: Built for How You Work
The Neuron SDK meets developers where they are — no rewrites, no workarounds, no friction.
Deploy models to production without becoming a hardware expert. vLLM, HuggingFace Transformers, TorchTitan, and TRL run natively on Trainium — no custom code, no porting effort. Ray, Amazon EKS, and AWS Batch handle orchestration. Your existing stack works out of the box with better economics from day one.
Switch from GPU to Trainium by changing one line of code: .to('neuron'). Native PyTorch eager mode, PyTorch Lightning, FSDP, and TorchTitan support rapid experimentation and debugging. Your ideas flow directly to silicon — no infrastructure fighting, no workflow changes, just faster iteration on better models.
Get instruction-set-level access with the Neuron Kernel Interface (NKI). NKI.isa delivers direct hardware control; NKI.lang provides NumPy-like semantics for rapid kernel development. The open-source Neuron Kernel Library offers production-ready optimized kernels. Neuron Explorer profiles execution from source code to hardware — with AI-powered recommendations to pinpoint bottlenecks instantly.
Deploy and manage infrastructure with the tools you already know — Ray, Slurm, Amazon EKS, Amazon ECS, and SageMaker HyperPod. Neuron Monitor delivers real-time health and utilization metrics. Hot-swap capability and redundant NeuronLink lanes mean zero-downtime maintenance. Deterministic compilation ensures reproducible deployments across environments.
Customers
Customers such as Anthropic, Databricks, Decart, Open AI, Ricoh, SplashMusic, Uber, and others, are realizing performance and cost benefits of Trainium instances and UltraServers.
Early adopters of Trainium3 are achieving new levels of efficiency and scalability for the next generation of large-scale generative AI models.
Conquer AI performance, cost, and scale
AWS Trainium2 for breakthrough AI performance
AWS AI chips customer stories