Skip to main content

Amazon EC2

AWS EC2 Trn3 Instances

Purpose-built to deliver the best token economics for next gen agentic, reasoning, and video generation applications.

Why Amazon EC2 Trn3 UltraServers?

Today’s frontier models are shifting trillion-parameter, multi-modal models supporting long contexts over 1M tokens, which requires the next generation of scale-up, high performance compute. Amazon EC2 Trn3 UltraServers and the AWS Neuron developer stack are purpose-built for these demands—delivering the performance, cost efficiency, and energy efficiency required to train and serve the next generation of agentic and reasoning systems at scale. 

Amazon EC2 Trn3 UltraServers, powered by our fourth-generation AI chip Trainium3, our first 3nm AWS AI chip purpose-built to deliver the best token economics for next gen agentic, reasoning, and video generation applications.

Trn3 UltraServer delivers up to 4.4x higher performance, 3.9x higher memory bandwidth and over 4x better performance/watt compared to our Trn2 UltraServers, providing the best price-performance for training and serving frontier-scale models, including reinforcement learning, Mixture-of-Experts (MoE), reasoning, and long-context architectures. Trn3 UltraServers continues the Trainium family’s leadership in price-performance and scalability, helping you train faster and deploy next generation of foundation models with higher performance and more cost-effectively.

Trn3 UltraServers can scale up to 144 Trainium3 chips (up to 362 FP8 PFLOPs) and are available in EC2 UltraClusters 3.0 to scale to hundreds of thousands of chips. The next-generation Trn3 UltraServer features NeuronSwitch-v1, an all-to-all fabric using NeuronLink-v4 with 2TB/s of bandwidth per chip.

You can get started easily with native support for PyTorch, JAX, Hugging Face Optimum Neuron and other libraries, along with full compatibility across Amazon SageMaker, EKS, ECS, AWS Batch, and ParallelCluster

Missing alt text value

Benefits

Trn3 UltraServers, powered by AWS Trainium3 chips, deliver up to 4.4x higher performance, 3.9x higher memory bandwidth, and 4x better performance per watt compared to our Trn2 UltraServers. On Amazon Bedrock, Trainium3 is the fastest accelerator, delivering up to 3x faster performance than Trainium2. This remarkable performance uplift also translates into significantly higher throughput for models like GPT-OSS serving at scale compared to Trainium2-based instances, while maintaining low latency per user.

Each Trn3 UltraServer scales up to 144 Trainium3 chips, and the new racks deliver over 2x the chip density compared to Trn2, increasing compute per rack and improving data center efficiency. Trn3 UltraServers are built on the AWS Nitro System and Elastic Fabric Adapter (EFA), and are deployed in non-blocking, multi-petabit scale EC2 UltraClusters 3.0, allowing you to scale to hundreds of thousands of Trainium chips for distributed training and serving.

Continuing Trainium’s legacy of performance leadership, Trn3 instances offer better price-performance than legacy AI accelerators, allowing you to drive down cost per token and cost per experiment. Higher throughput on workloads such as GPT-OSS and frontier-scale LLMs lower inference costs, and reduced training times for your most demanding models.

AWS Trainium3 chips, our first 3nm AI chips, are optimized to deliver the best token economics for next-generation agentic, reasoning, and video generation applications. Trn3 UltraServers deliver over 4× better energy efficiency than Trn2 UltraServers, and on Amazon Bedrock.  In real-world serving, Trn3 achieves over 5× higher output tokens per megawatt than Trn2 UltraServer while maintaining similar latency per user, helping you meet sustainability objectives without compromising performance.

Trn3 UltraServers are powered by AWS Neuron, the developer stack for AWS Trainium and AWS Inferentia, so you can run existing PyTorch and JAX code with without code changes.

TNeuron supports popular ML libraries such as vLLM, Hugging Face Optimum Neuron, PyTorch Lightning, TorchTitan, and integrates with services including Amazon SageMaker, Amazon SageMaker HyperPod, Amazon EKS, Amazon ECS, AWS Batch, and AWS ParallelCluster.

Features

Each AWS Trainium3 chip delivers 2.52 FP8 PFLOPs of compute, and Trn3 UltraServers scale up to 144 Trainium3 chips, providing up to 362 FP8 PFLOPs of total FP8 compute in a single UltraServer. This high-density compute envelope is designed for training and serving frontier-scale transformers, Mixture-of-Experts models, and long-context architectures.

AWS Trainium3 delivers both memory capacity and bandwidth over the previous generation, with each chip offering 144 GB of HBM3e and 4.9 TB/s of memory bandwidth. Trn3 UltraServer delivers up to 20.7 TB of HBM3e and 706 TB/s of aggregate memory bandwidth, enabling larger batch sizes, extended context windows, and higher utilization for ultra-large multimodal, video, and reasoning models.

Trn3 UltraServers introduce NeuronSwitch-v1, an all-to-all fabric that doubles interchip interconnect bandwidth over Trn2 UltraServers, improving model-parallel efficiency and reducing communication overhead for MoE and tensor-parallel training. Trn3 UltraServers support up to 144 chips per UltraServer, over 2x more than Trn2 UltraServers.. For large-scale distributed training, we deploy Trn3 UltraServers in UltraCluster 3.0 with hundreds of thousands of Trainium3 chips in a single non-blocking, petabit scale network.

Trainium3 supports FP32, BF16, MXFP8, and MXFP4 precision modes, allowing you to balance accuracy and efficiency across dense and expert-parallel workloads. Built-in collective communication engines accelerate synchronization and reduce training overhead for large transformer, diffusion, and Mixture-of-Experts models, improving end-to-end training throughput at scale.

Trn3 UltraServers are programmed using the AWS Neuron SDK, which provides the compiler, runtime, training and inference libraries, and developer tools for AWS Trainium and AWS Inferentia. The Neuron Kernel Interface (NKI) offers low-level access to the Trainium instruction set, memory, and execution scheduling so performance engineers can build custom kernels and push performance beyond standard frameworks. Neuron Explorer delivers a unified profiling and debugging environment, tracing execution from PyTorch and JAX code down to hardware operations and providing actionable insights for sharding strategies, kernel optimizations, and large-scale distributed runs.

Did you find what you were looking for today?

Let us know so we can improve the quality of the content on our pages