AWS Neuron is an SDK with a compiler, runtime, and profiling tools that unlocks high-performance and cost-effective deep learning (DL) acceleration. It supports high-performance training on AWS Trainium-based Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances. For model deployment, it supports high-performance and low-latency inference on AWS Inferentia-based Amazon EC2 Inf1 instances and AWS Inferentia2-based Amazon EC2 Inf2 instances. With Neuron, you can use popular frameworks, such as TensorFlow and PyTorch, and optimally train and deploy machine learning (ML) models on Amazon EC2 Trn1, Inf1, and Inf2 instances with minimal code changes and without tie-in to vendor-specific solutions.
Build with native support for ML frameworks and libraries
AWS Neuron SDK, which supports Inferentia and Trainium accelerators, is natively integrated with PyTorch and TensorFlow. This integration ensures that you can continue using your existing workflows in these popular frameworks and get started with only a few lines of code changes. For distributed model training, the Neuron SDK supports libraries, such as Megatron-LM and PyTorch Fully Sharded Data Parallel (FSDP).
Optimize performance for training and inference
The AWS Neuron SDK enables efficient programming and runtime access to the Trainium and Inferentia accelerators. It supports a wide range of data types, new rounding modes, control flow, and custom operators to help you choose the optimal configuration for your DL workloads. For distributed training, Neuron enables efficient use of Trn1 UltraClusters with tightly coupled support for collective compute operations over Elastic Fabric Adapter (EFA) networking.
Get enhanced debugging and monitoring
Neuron offers just-in-time (JIT) compilation to speed up developer workflows. It offers debugging and profiling tools with the support of the TensorBoard plugin. Neuron supports eager debug mode, which you can use to easily step through the code and evaluate operators one by one. You can use the Neuron helper tools to help you follow best practices for model onboarding and performance optimizations. Neuron also includes tools that provide performance and utilization insights.
Integrate easily with other AWS services
AWS Deep Learning AMIs and AWS Deep Learning Containers come preconfigured with AWS Neuron. If you are using containerized applications, you can deploy Neuron by using Amazon Elastic Container Service (ECS), Amazon Elastic Kubernetes Service (EKS), or your preferred native container engine. Neuron also supports Amazon SageMaker, which data scientists and developers can use to build, train, and deploy ML models.
To increase overall performance, AWS Neuron automatically optimizes neural-net compute to run compute-intensive tasks on Trainium and Inferentia accelerators and other tasks on the CPU.
Wide range of ML data types
AWS Neuron supports FP32, TF32, BF16, FP16, INT8, and the new configurable FP8. Using the right data types for your workloads helps you optimize performance while meeting accuracy goals.
AWS Neuron takes high-precision FP32 models and automatically casts them to lower-precision data types, while optimizing accuracy and performance. Autocasting reduces time to market by removing the need for lower-precision retraining.
Native support for stochastic rounding
AWS Neuron enables hardware-accelerated stochastic rounding. Stochastic rounding enables training at BF16 speeds, with near-FP32 accuracy when autocasting from FP32 to BF16.
NeuronCore pipeline enables high-throughput model parallelism for latency-sensitive applications such as natural language processing. The pipeline does this by sharding a compute graph across multiple NeuronCores, caching the model parameters in each core’s on-chip memory. It then streams training and inference workloads across the cores in a pipelined manner.
Collective communication operations
AWS Neuron supports running various collective communication and compute operations in parallel on dedicated hardware. Doing so delivers lower latency and higher overall performance on distributed workloads.
AWS Neuron supports custom operators. You can write new custom operators in C++, and Neuron will run those on Trainium and Inferentia2 inline single instruction, multiple data (SIMD) cores.
Eager debug mode
AWS Neuron supports eager debug mode, which you can use to easily step through the code and evaluate operators one by one.
AWS machine learning accelerators
AWS Trainium accelerators
AWS Trainium is an ML training accelerator that AWS purpose built for high-performance, low-cost DL training. Each AWS Trainium accelerator has two second-generation NeuronCores and supports FP32, TF32, BF16, FP16, and INT8 data types and also configurable FP8 (cFP8), which you can use to achieve the right balance between range and precision. To support efficient data and model parallelism, each Trainium accelerator has 32 GB of high-bandwidth memory, delivers up to 210 TFLOPS of FP16/BF16 compute power, and features NeuronLink, an intra-instance, ultra-high-speed nonblocking interconnect technology.
AWS Inferentia accelerators
AWS Inferentia and AWS Inferentia2 are machine learning inference accelerators that AWS designed and built to deliver high-performance, low-cost inference. Each AWS Inferentia accelerator has four first-generation NeuronCores and supports FP16, BF16, and INT8 data types. Each AWS Inferentia2 accelerator has two second-generation NeuronCores and further adds support for FP32, TF32, and the new configurable FP8 (cFP8) data types.
Amazon EC2 ML instances
Amazon EC2 Trn1 instances
Amazon EC2 Trn1 instances, powered by the AWS Trainium accelerator, are purpose built for high-performance DL training. They offer up to 50% cost-to-train savings over comparable Amazon EC2 instances. Trn1 instances feature up to 16 AWS Trainium accelerators and support up to 1600 Gbps (Trn1n) of second-generation Elastic Fabric Adapter (EFA) network bandwidth.
Amazon EC2 Inf2 instances
Amazon EC2 Inf2 instances are powered by up to 12 AWS Inferentia2 accelerators and deliver up to 4x higher throughput and up to 10x lower latency compared to Inf1 instances. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between accelerators.
Amazon EC2 Inf1 instances
Amazon EC2 Inf1 instances are powered by up to 16 AWS Inferentia accelerators. These instances deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances.