Inference Sdk - AWS Neuron

AWS Neuron is an SDK with a compiler, runtime, and profiling tools that unlocks high-performance and cost-effective deep learning (DL) acceleration. It supports high-performance training on AWS Trainium-based Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances. For model deployment, it supports high-performance and low-latency inference on AWS Inferentia-based Amazon EC2 Inf1 instances and AWS Inferentia2-based Amazon EC2 Inf2 instances. With Neuron, you can use popular frameworks, such as TensorFlow and PyTorch, and optimally train and deploy machine learning (ML) models on Amazon EC2 Trn1, Inf1, and Inf2 instances, and Neuron is designed to minimize code changes and tie-in to vendor-specific solutions.

Benefits

Build with native support for ML frameworks and libraries

AWS Neuron SDK, which supports Inferentia and Trainium accelerators, is natively integrated with PyTorch and TensorFlow. This integration ensures that you can continue using your existing workflows in these popular frameworks and get started with only a few lines of code changes. For distributed model training, the Neuron SDK supports libraries, such as Megatron-LM and PyTorch Fully Sharded Data Parallel (FSDP).

Optimize performance for training and inference

The AWS Neuron SDK enables efficient programming and runtime access to the Trainium and Inferentia accelerators. It supports a wide range of data types, new rounding modes, control flow, and custom operators to help you choose the optimal configuration for your DL workloads. For distributed training, Neuron enables efficient use of Trn1 UltraClusters with tightly coupled support for collective compute operations over Elastic Fabric Adapter (EFA) networking.

Get enhanced debugging and monitoring

Neuron offers just-in-time (JIT) compilation to speed up developer workflows. It offers debugging and profiling tools with the support of the TensorBoard plugin. Neuron supports eager debug mode, which you can use to easily step through the code and evaluate operators one by one. You can use the Neuron helper tools to help you follow best practices for model onboarding and performance optimizations. Neuron also includes tools that provide performance and utilization insights.

Integrate easily with other AWS services

AWS Deep Learning AMIs and AWS Deep Learning Containers come preconfigured with AWS Neuron. If you are using containerized applications, you can deploy Neuron by using Amazon Elastic Container Service (ECS), Amazon Elastic Kubernetes Service (EKS), or your preferred native container engine. Neuron also supports Amazon SageMaker, which data scientists and developers can use to build, train, and deploy ML models.

Features

Smart partitioning

To increase overall performance, AWS Neuron automatically optimizes neural-net compute to run compute-intensive tasks on Trainium and Inferentia accelerators and other tasks on the CPU.

Wide range of ML data types

AWS Neuron supports FP32, TF32, BF16, FP16, INT8, and the new configurable FP8. Using the right data types for your workloads helps you optimize performance while meeting accuracy goals.

FP32 autocasting

AWS Neuron takes high-precision FP32 models and automatically casts them to lower-precision data types, while optimizing accuracy and performance. Autocasting reduces time to market by removing the need for lower-precision retraining.

Native support for stochastic rounding

AWS Neuron enables hardware-accelerated stochastic rounding. Stochastic rounding enables training at BF16 speeds, with near-FP32 accuracy when autocasting from FP32 to BF16.

NeuronCore pipeline

NeuronCore pipeline enables high-throughput model parallelism for latency-sensitive applications such as natural language processing. The pipeline does this by sharding a compute graph across multiple NeuronCores, caching the model parameters in each core’s on-chip memory. It then streams training and inference workloads across the cores in a pipelined manner.

Collective communication operations

AWS Neuron supports running various collective communication and compute operations in parallel on dedicated hardware. Doing so delivers lower latency and higher overall performance on distributed workloads.

Custom operators

AWS Neuron supports custom operators. You can write new custom operators in C++, and Neuron will run those on Trainium and Inferentia2 inline single instruction, multiple data (SIMD) cores.

Eager debug mode

AWS Neuron supports eager debug mode, which you can use to easily step through the code and evaluate operators one by one.

How it works

Diagram shows how to use AWS Neuron to build your model, train and test the model, and then deploy it on any hardware platform. Described at the Enlarge and read image description link.

AWS machine learning accelerators

AWS Trainium accelerators

AWS Trainium is an ML training accelerator that AWS purpose built for high-performance, low-cost DL training. Each AWS Trainium accelerator has two second-generation NeuronCores and supports FP32, TF32, BF16, FP16, and INT8 data types and also configurable FP8 (cFP8), which you can use to achieve the right balance between range and precision. To support efficient data and model parallelism, each Trainium accelerator has 32 GB of high-bandwidth memory, delivers up to 210 TFLOPS of FP16/BF16 compute power, and features NeuronLink, an intra-instance, ultra-high-speed nonblocking interconnect technology.

Learn more »

AWS Inferentia accelerators

AWS Inferentia and AWS Inferentia2 are machine learning inference accelerators that AWS designed and built to deliver high-performance, low-cost inference. Each AWS Inferentia accelerator has four first-generation NeuronCores and supports FP16, BF16, and INT8 data types. Each AWS Inferentia2 accelerator has two second-generation NeuronCores and further adds support for FP32, TF32, and the new configurable FP8 (cFP8) data types.

Learn more »

Amazon EC2 ML instances

Amazon EC2 Trn1 instances

Amazon EC2 Trn1 instances, powered by the AWS Trainium accelerator, are purpose built for high-performance DL training. They offer up to 50% cost-to-train savings over comparable Amazon EC2 instances. Trn1 instances feature up to 16 AWS Trainium accelerators and support up to 1600 Gbps (Trn1n) of second-generation Elastic Fabric Adapter (EFA) network bandwidth.

Learn more »

Amazon EC2 Inf2 instances

Amazon EC2 Inf2 instances are powered by up to 12 AWS Inferentia2 accelerators and deliver up to 4x higher throughput and up to 10x lower latency compared to Inf1 instances. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between accelerators.

Learn more »

Amazon EC2 Inf1 instances

Amazon EC2 Inf1 instances are powered by up to 16 AWS Inferentia accelerators. These instances deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances.

Learn more »

Getting started

Refer to the documentation for tutorials, how-to guides, application notes, and the roadmap.

For further assistance, visit the developers' forum, which is also available through the AWS Management Console.

AWS Neuron

Benefits

Build with native support for ML frameworks and libraries

Optimize performance for training and inference

Get enhanced debugging and monitoring

Integrate easily with other AWS services

Features

Smart partitioning

Wide range of ML data types

FP32 autocasting

Native support for stochastic rounding

NeuronCore pipeline

Collective communication operations

Custom operators

Eager debug mode

How it works

AWS machine learning accelerators

AWS Trainium accelerators

AWS Inferentia accelerators

Amazon EC2 ML instances

Amazon EC2 Trn1 instances

Amazon EC2 Inf2 instances

Amazon EC2 Inf1 instances

Getting started

Ending Support for Internet Explorer