Posted On: Apr 13, 2023
Today, AWS announces the general availability of Amazon Elastic Compute Cloud (Amazon EC2) Trn1n instances, which are powered by AWS Trainium accelerators. Building on the capabilities of Trainium-powered Trn1 instances, Trn1n instances double the network bandwidth to 1600 Gbps of second-generation Elastic Fabric Adapter (EFAv2). With this increased bandwidth, Trn1n instances deliver up to 20% faster time-to-train for training network-intensive generative AI models such as large language models (LLMs) and mixture of experts (MoE). Similar to Trn1 instances, Trn1n instances offer up to 50% savings on training costs over other comparable Amazon EC2 instances.
To support large-scale deep learning (DL) models, Trn1n instances are deployed in EC2 UltraClusters with high-speed EFAv2 networking. EFAv2 speeds up distributed training by delivering up to 50% improvement in collective communications performance over first-generation EFA. You can use the UltraClusters to scale to up to 30,000 Trainium accelerators and get on-demand access to a supercomputer with 6.3 exaflops of compute performance.
Similar to Trn1, each Trn1n instance has up to 512 GB of high-bandwidth memory, delivers up to 3.4 petaflops of FP16/BF16 compute power, and features NeuronLink, an intra-instance high-bandwidth nonblocking interconnect. AWS Neuron SDK integrates natively with popular machine learning (ML) frameworks, such as PyTorch and TensorFlow, so that you can continue using your existing frameworks and application code to train DL models on Trn1n. Developers can run DL training workloads on Trn1n instances using AWS Deep Learning AMIs, AWS Deep Learning Containers, or managed services such as Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), AWS ParallelCluster, Amazon SageMaker, and AWS Batch.