Amazon EC2 Trn1 Instances
High-performance, cost-effective training of generative AI models
Amazon Elastic Compute Cloud (EC2) Trn1 instances, powered by AWS Trainium accelerators, are purpose built for high-performance deep learning (DL) training of generative AI models, including large language models (LLMs) and latent diffusion models. Trn1 instances offer up to 50% cost-to-train savings over other comparable Amazon EC2 instances. You can use Trn1 instances to train 100B+ parameter DL and generative AI models across a broad set of applications, such as text summarization, code generation, question answering, image and video generation, recommendation, and fraud detection.
The AWS Neuron SDK helps developers train models on AWS Trainium (and deploy models on the AWS Inferentia accelerators). It integrates natively with frameworks such as PyTorch and TensorFlow, so that you can continue using your existing code and workflows to train models on Trn1 instances. To learn about the current Neuron support for machine learning (ML) frameworks and libraries, model architectures, and hardware optimizations, see the Neuron documentation.
Trn1n instances are now available
Trn1n instances double the network bandwidth (compared to Trn1 instances) to 1600 Gbps of Elastic Fabric Adapter (EFAv2). The increased bandwidth delivers up to 20% faster time-to-train relative to Trn1 for training network-intensive generative AI models, such as large language models (LLMs) and mixture of experts (MoE).
Reduce training times for 100B+ parameter models
Trn1 instances are purpose built for high-performance DL and reduce training times from months to weeks, or even days. With reduced training times, you can iterate faster, build more innovative models, and increase productivity. Trn1n instances deliver up to 20% faster time-to-train than Trn1 instances for models that benefit from increased network bandwidth.
Lower your fine-tuning and pre-training costs
Trn1 instances deliver high performance while offering up to 50% cost-to-train savings over other comparable Amazon EC2 instances.
Use your existing ML frameworks and libraries
Use the AWS Neuron SDK to extract the full performance of Trn1 instances. With Neuron, you can use popular ML frameworks like PyTorch and TensorFlow and continue to use your existing code and workflows to train models on Trn1 instances. To quickly get started with Trn1 instances, see popular model examples in the Neuron documentation.
Scale up to 6 exaflops with EC2 UltraClusters
Trn1 instances support up to 800 Gbps of second-generation Elastic Fabric Adapter (EFAv2) network bandwidth. Trn1n instances support up to 1600 Gbps of EFAv2 network bandwidth to deliver even higher performance for network-intensive models. Both instances are deployed in EC2 UltraClusters that enable scaling up to 30,000 Trainium accelerators, which are interconnected with a nonblocking petabit-scale network to provide 6 exaflops of compute performance.
How it works
Up to 3 petaflops with AWS Trainium
Trn1 instances are powered by up to 16 AWS Trainium accelerators purpose built to accelerate DL training and deliver up to 3 petaflops of FP16/BF16 compute power. Each accelerator includes two second-generation NeuronCores.
Up to 512 GB high-bandwidth accelerator memory
To support efficient data and model parallelism, each Trn1 instance has 512 GB of shared accelerator memory (HBM) with 9.8 TB/s of total memory bandwidth.
High-performance networking and storage
To support training of network-intensive models, such as Mixture of Experts (MoE) and Generative Pre-Trained Transformers (GPT), each Trn1n instance delivers up to 1600 Gbps of EFAv2 networking bandwidth. Each Trn1 instance supports up to 800 Gbps of EFAv2 bandwidth. EFAv2 speeds up distributed training by delivering up to 50% improvement in collective communications performance over first-generation EFA. These instances also support up to 80 Gbps of Amazon Elastic Block Store (EBS) bandwidth and up to 8 TB of local NVMe solid state drive (SSD) storage for fast workload access to large datasets.
For fast connectivity between accelerators and streamlined collective communications, Trn1 instances support up to 768 GB/s of NeuronLink, a high-speed, nonblocking interconnect.
Optimized for novel data types
To deliver high performance while meeting accuracy goals, Trn1 instances are optimized for FP32, TF32, BF16, FP16, UINT8, and the new configurable FP8 (cFP8) data type.
State-of-the-art DL optimizations
To support the fast pace of DL innovation and generative AI, Trn1 instances have several innovations that make them flexible and extendable to train constantly evolving DL models. Trn1 instances have hardware optimizations and software support for dynamic input shapes. To allow support for new operators in the future, they support custom operators written in C++. They also support stochastic rounding, a method for rounding probabilistically to achieve high performance and higher accuracy compared to legacy rounding modes.
“At HeliXon, we build next-generation AI solutions to protein-based therapeutics. We aim to develop AI tools that empower scientists to decipher protein function and interaction, interrogate large-scale genomic datasets for target identification, and design therapeutics such as antibodies and cell therapies. Today, we use training distribution libraries like FSDP to parallelize model training over many GPU-based servers, but this still takes us weeks to train a single model. We are excited to utilize Amazon EC2 Trn1 instances, featuring the highest networking bandwidth (800 Gbps) available in AWS to improve the performance of our distributed training jobs and reduce our model training times, while also reducing our training costs."
Jian Peng, CEO, Helixon
Money Forward, Inc. serves businesses and individuals with an open and fair financial platform.
“We launched a large-scale AI chatbot service on the Amazon EC2 Inf1 instances and reduced our inference latency by 97% over comparable GPU-based instances while also reducing costs. As we keep fine-tuning tailored NLP models periodically, reducing model training times and costs is also important. Based on our experience from successful migration of inference workload on Inf1 instances and our initial work on AWS Trainium-based EC2 Trn1 instances, we expect Trn1 instances will provide additional value in improving end-to-end ML performance and cost.”
Takuya Nakade, CTO, Money Forward, Inc.
Magic is an integrated product and research company developing AI that feels like a colleague to make the world more productive.
“Training large autoregressive Transformer-based models is an essential component of our work. AWS Trainium-powered Trn1 instances are designed specifically for these workloads, offering near infinite scalability, fast inter-node networking, and advanced support for 16- and 8-bit data types. Trn1 instances will help us train large models faster, at a lower cost. We are particularly excited about the native support for BF16 stochastic rounding in Trainium, increasing performance while numerical accuracy is indistinguishable from full precision.”
Eric Steinberger, Cofounder and CEO, Magic
CACTUS has a suite of products and solutions for researchers, and organizations that improve how research gets funded, published, communicated and discovered.
“At Cactus Labs, we harness the power of AI, with research focused on natural language processing, ranking and recommendation, conversational AI, large language models, computer vision, AR/VR and XAI. In line with our quest to enable faster training of machine learning models as well as enable our researchers to run more experiments while managing the infrastructure cost, we were delighted to evaluate AWS Trainium. AWS Trainium’s out of the box features like XLA optimization, multi-worker data parallel training, and graph caching are really useful for us to reduce our training times and help us run more experiments faster and cheaper.”
Nishchay Shah, CTO and Head of Emerging Products, Cactus Communications
Watashiha offers an innovative and interactive AI chatbot service, “OGIRI AI,” which incorporates humor to provide a funny answer on the spot for a question.
“We use Large Language Models to incorporate humor and offer a more relevant and conversational experience to our customers on our AI services. This requires us to pre-train and fine-tune these models frequently. We pre-trained a GPT-based Japanese model on the EC2 Trn1.32xlarge instance, leveraging tensor and data parallelism. The training was completed within 28 days at a 33% cost reduction over our previous GPU based infrastructure. As our models rapidly continue to grow in complexity, we are looking forward to Trn1n instances which has double the network bandwidth of Trn1 to speed up training of larger models.”
Yohei Kobashi, CTO, Watashiha, K.K.
"At PyTorch, we accelerate taking machine learning from research prototyping to production ready for customers. We have collaborated extensively with the AWS team to provide native PyTorch support for the new AWS Trainium powered Amazon EC2 Trn1 instances that are purpose built for training deep learning models. Developers building PyTorch models can start training on Trn1 instances with minimal code changes. Additionally, we have worked with the OpenXLA community to enable PyTorch Distributed libraries for easy model migration from GPU-based instances to Trn1 instances. We are excited about the innovation that Trn1 instances bring to the PyTorch community, including more efficient data types, dynamic shapes, custom operators, hardware-optimized stochastic rounding, and eager debug mode. All these makes Trn1 well suited for wide adoption by PyTorch developers and we look forward to future joint contributions to PyTorch to further optimize training performance."
Geeta Chauhan, Applied AI, Engineering Manager, PyTorch
"Hugging Face’s mission is to democratize good ML to help ML developers around the world solve real-world problems. And key to that is ensuring the latest and greatest models run as fast and efficiently as possible on the best ML accelerators in the cloud. We are incredibly excited about the potential for Inferentia2 to become the new standard way to deploy generative AI models at scale. With Inf1, we saw up to 70% lower cost than traditional GPU-based instances, and with Inf2 we have seen up to 8x lower latency for BERT-like transformers compared to Inferentia1. With Inferentia2, our community will be able to easily scale this performance to LLMs at the 100B+ parameters scale, and to the latest diffusion and computer vision models as well."
Amazon services using Trn1 instances
Amazon’s product search engine indexes billions of products, serves billions of customer queries daily, and is one of the most heavily used services in the world.
“We are training large language models (LLM) that are multi-modal (text + image), multilingual, multi-locale, pre-trained on multiple tasks, and span multiple entities (products, queries, brands, reviews, etc.) to improve the customer shopping experience. Trn1 instances provide a more sustainable way to train LLMs by delivering the best performance/watt compared to other accelerated machine-learning solutions and offers us high performance at the lowest cost. We plan to explore the new configurable FP8 datatype, and hardware-accelerated stochastic rounding to further increase our training efficiency and development velocity.”
Trishul Chilimbi, VP, Amazon Search
Using Amazon SageMaker
You can easily train models on Trn1 instances by using Amazon SageMaker. Significantly reduce the time and cost to train and tune ML models without the need to manage infrastructure. With SageMaker, you can use built-in tools to manage and track training experiments, automatically choose optimal hyperparameters, debug training jobs, and monitor the use of system resources.
Using the AWS Deep Learning AMIs
Using AWS Deep Learning Containers
Price per Hour
|trn1.2xlarge||1||32||8||32||0.5||Up to 12.5||No||Up to 20||$1.34||$0.79||$0.4744|