Amazon EC2 Trn1 instances
Best price performance for training deep learning models in the cloud
Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances will deliver the best price performance for training deep learning models in the cloud for use cases such as natural language processing (NLP), computer vision, search, recommendation, ranking, and more. Trn1 instances are powered by AWS Trainium, the second machine learning (ML) chip built by AWS that is optimized for high-performance deep learning training.
Trn1 instances support up to 16 AWS Trainium accelerators, up to 800 Gbps of Elastic Fabric Adapter (EFA) networking bandwidth, and 768 GB/s of ultra-high-speed, NeuronLink connectivity.
Trn1 instances are deployed in Amazon EC2 UltraClusters consisting of tens of thousands of Trainium accelerators to rapidly train even the most complex deep learning models with trillions of parameters.
Developers can get started quickly on Trn1 instances using the AWS Neuron SDK and easily train models using leading ML frameworks.
Best price performance for model training
Trn1 instances are powered by AWS Trainium accelerators that are purpose built for ML training to deliver the best price performance for training deep learning models in the cloud.
Reduce model training from months to days
Deploy Trn1 instances in EC2 UltraClusters to scale model training to 10,000+ accelerators interconnected with petabit-scale networking for the fastest ML training in Amazon EC2.
Ease of use
You can get started easily with Trn1 instances using the AWS Neuron SDK that comes integrated with leading ML frameworks such as PyTorch and TensorFlow, and continue using existing ML workflows with minimal code changes.
Maximized resource efficiency
Trn1 instances are built on the AWS Nitro System, a combination of dedicated hardware and lightweight hypervisor that provides you a rich collection of flexible building blocks to assemble the compute, storage, memory, and networking resources you need for better overall performance and security.
AWS Trainium accelerators
Trn1 instances are powered by up to 16 AWS Trainium accelerators that have specific math engines for processing DL algorithms, making the accelerators more efficient than general-purpose GPUs for training deep learning models. Each accelerator delivers up to 210 trillion operations per second (TOPS) of compute power, supports 32 GB of high bandwidth memory (HBM2e), and features NeuronLink, an intra-instance ultra-high-speed, nonblocking interconnect of 768 GB/s.
High-performance networking and storage
Trn1 instances deliver up to 800 Gbps of high-performance networking. They also support Elastic Fabric Adapter (EFA), a custom network interface designed by AWS to improve scaling efficiency and deliver low latencies for faster training. Each Trn1 instance also supports up to 8 TB of local nonvolatile memory express solid-state drive (NVMe SSD) storage for fast workload access to large datasets.
Amazon EC2 UltraClusters
Trn1 instances are deployed in EC2 UltraClusters consisting of tens of thousands of Trainium accelerators interconnected with fully nonblocking petabit scale networking. Developers can access petabyte-scale high throughput, low latency storage with Amazon FSx for Lustre.
AWS Neuron SDK
Get started with Amazon EC2 Trn1 instances easily with the AWS Neuron SDK. The Neuron SDK consists of a compiler, framework extensions, a runtime library, and developer tools, natively integrated with ML frameworks, such as TensorFlow and PyTorch. You can use distributed training libraries, such as Megatron-ML and DeepSpeed, for efficient distributed model training. The Neuron SDK supports a large number of operators for state-of-the art natural language processing and computer vision models. Advanced developers can implement custom operators with C++.
Built on the AWS Nitro System
Trn1 instances are built on the AWS Nitro System, which offloads many of the traditional virtualization functions to dedicated hardware and software to deliver high performance, high availability, and high security while reducing virtualization overhead.
"At Anthropic we build reliable, interpretable, and steerable AI systems that will have many opportunities to create value commercially and for public benefit. Our research interests span multiple areas including natural language, human feedback, scaling laws, reinforcement learning, code generation, and interpretability. A major key to our success is access to modern infrastructure that allows us to spin up very large fleets of high performance deep learning accelerators. We are looking forward to using AWS Trainium, as its unprecedented ability to scale to tens of thousands of nodes and higher network bandwidth will enable us to iterate faster while keeping our costs under control."
Tom Brown, Co-founder at Anthropic
"Sprinklr's natural language processing and computer vision ML models analyze different data formats sourced from publicly available social media posts, blog posts, video content, and other content available on public domains across more than 30 channels. Based on our value from using AWS Inferentia we are eager to try AWS Trainium to improve time to train and lower training costs for our models. We look forward to developing our models on these high performance, and low-cost training instances.”
Vasant Srinivasan, Senior Vice President of Product Engineering at Sprinklr