Announcing Amazon EC2 Trn1 instances for high-performance, cost-effective deep learning training

Posted on: Oct 10, 2022

AWS announces the general availability of Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances. Amazon EC2 Trn1 instances are powered by AWS Trainium chips, which are purpose built for high-performance ML training applications in the cloud. Trn1 instances deliver the highest performance on deep learning (DL) training of popular natural language processing (NLP) models on AWS while offering up to 50% cost savings over comparable GPU-based EC2 instances. You can get started with Trn1 instances by using popular ML frameworks, such as PyTorch and TensorFlow, helping you to lower training costs, reduce training times, iterate faster to build more innovative models, and increase productivity. You can use EC2 Trn1 instances to train natural language processing (NLP), computer vision, and recommender models across a broad set of applications, such as speech recognition, recommendation, fraud detection, image and video classification, and forecasting.

Trn1 instances feature up to 16 AWS Trainium chips, a second-generation ML chip built by AWS after AWS Inferentia. Trn1 instances are the first EC2 instances with up to 800 Gbps of Elastic Fabric Adapter (EFA) network bandwidth. For efficient data and model parallelism, each Trn1 instance has 512 GB of high-bandwidth memory, delivers up to 3.4 petaflops of FP16/BF16 compute power, and features NeuronLink, an intra-instance high-bandwidth nonblocking interconnect. To support large-scale deep learning models, Trn1 instances are deployed in EC2 UltraClusters. You will be able to use the UltraClusters to scale to up to 30,000 Trainium accelerators, which are interconnected with a nonblocking petabit scale network, and will get on-demand access to a supercomputer with 6.3 exaflops of compute. Trn1 instances have native support for a wide range of data types, including the new configurable FP8, dynamic input shapes, control flow, C++ custom operators, and stochastic rounding. AWS Neuron SDK, unlocks these advanced features and adds support for just-in-time (JIT) compilation and the eager debug mode. AWS Neuron is integrated with leading ML frameworks and libraries, such as PyTorch, TensorFlow, Megatron-LM, Hugging Face, PyTorch FSDP, so you can continue using your existing frameworks and run your application with minimal code changes.

Developers can run DL training workloads on Trn1 instances using AWS Deep Learning AMIs, AWS Deep Learning Containers, or managed services such as Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster, with support for Amazon Elastic Kubernetes Service (Amazon EKS), Amazon SageMaker, and AWS Batch coming soon.

Amazon EC2 Trn1 instances are available in two sizes: trn1.2xlarge, for experimenting with a single accelerator and training small models cost effectively, and trn1.32xlarge for training large-scale models. They are available in the following AWS Regions as On-Demand Instances, Reserved Instances, and Spot Instances, or as part of a Savings Plan: US East (N. Virginia) and US West (Oregon).

To learn more about Trn1 instances, see Amazon EC2 Trn1 instances.

Announcing Amazon EC2 Trn1 instances for high-performance, cost-effective deep learning training

Learn

Resources

Developers

Help