Posted On: Oct 10, 2022
AWS Neuron adds support for AWS Trainium powered Amazon EC2 Trn1 instances to unlock high-performance, cost effective deep learning training at scale. The Neuron SDK includes a compiler, runtime libraries, and profiling tools that integrate with popular ML frameworks such as PyTorch and Tensorflow. With this first release of Neuron 2.x, developers can now run deep learning training workloads on Trn1 instances to save training costs by up to 50% over comparable GPU-based EC2 instances, while getting the highest training performance in AWS cloud for popular NLP models.
Neuron adds support for training deep learning models, starting with language models, to be followed by additional model families including vision models [as outlined in the Neuron roadmap]. Under language models, this release of Neuron supports Transformers Encoder/Autoencoder and Transformers Decoders/Autoregressive model architectures such as GPT. To help speed up developer workflows and provide better insight into training workloads, Neuron now supports seamless Just-In-Time compilation, step-by-step execution with Eager Debug mode, and tools that provide performance and utilization insig
To help developers capitalize on Trainium innovations and maximize their performance and cost benefits, Neuron unlocks various hardware optimizations. It supports FP32, TF32, FP16, and BF16 data types and automatic casting from FP32 to TF32, BF16 and FP16. It also adds support for hardware-accelerated stochastic rounding which enables training at BF16 speeds, with FP32 range of accuracy when auto-casting from FP32 to BF16.
To support distributed training of large-scale models across accelerators in a Trn1 UltraCluster, Neuron adds support for various collective compute operations and 800 Gbps of EFA networking, which is the highest networking bandwidth currently offered in the AWS cloud. Neuron also provides support for distributed training libraries such as Megatron-LM in a public gitHub repository.
Developers can run DL training workloads on Trn1 instances using AWS Deep Learning AMIs, AWS Deep Learning Containers, or managed services such as Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster, with support for Amazon Elastic Kubernetes Service (Amazon EKS), Amazon SageMaker, and AWS Batch coming soon. To help developers get started, this release provides examples for pre-training and fine-tuning of HuggingFace BERT-large, and pre-training of Megatron-LM GPT3 (6.7B) model.
Trn1 instances are available in the following AWS Regions as On-Demand Instances, Reserved Instances, and Spot Instances, or as part of a Savings Plan: US East (N. Virginia) and US West (Oregon). To get started on Trn1 instances, please refer to Neuron documentation. For a full list of features, enhancements, and changes in this release, please view the release notes. To get insight into the up-coming capabilities, please see the Neuron roadmap.