Products›
Machine Learning›
AWS Inferentia

AWS Inferentia

Get high performance at the lowest cost in Amazon EC2 for deep learning and generative AI inference

Get started with Inferentia accelerators using AWS Neuron

Why Inferentia?

AWS Inferentia accelerators are designed by AWS to deliver high performance at the lowest cost in Amazon EC2 for your deep learning (DL) and generative AI inference applications.

The first-generation AWS Inferentia accelerator powers Amazon Elastic Compute Cloud (Amazon EC2) Inf1 instances, which deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances. Many customers, including Finch AI, Sprinklr, Money Forward, and Amazon Alexa, have adopted Inf1 instances and realized its performance and cost benefits.

AWS Inferentia2 accelerator delivers up to 4x higher throughput and up to 10x lower latency compared to Inferentia. Inferentia2-based Amazon EC2 Inf2 instances are optimized to deploy increasingly complex models, such as large language models (LLM) and latent diffusion models, at scale. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between accelerators. Many customers, including Leonardo.ai, Deutsche Telekom, and Qualtrics have adopted Inf2 instances for their DL and generative AI applications.

AWS Neuron SDK helps developers deploy models on the AWS Inferentia accelerators (and train them on AWS Trainium accelerators). It integrates natively with popular frameworks, such as PyTorch and TensorFlow, so that you can continue to use your existing code and workflows and run on Inferentia accelerators.

Benefits of Inferentia

Optimized for high throughput and low latency

Each first-generation Inferentia accelerator has four first-generation NeuronCores with up to 16 Inferentia accelerators per EC2 Inf1 instance. Each Inferentia2 accelerator has two second-generation NeuronCores with up to 12 Inferentia2 accelerators per EC2 Inf2 instance. Each Inferentia2 accelerator supports up to 190 tera floating operations per second (TFLOPS) of FP16 performance. The first-generation Inferentia has 8 GB of DDR4 memory per accelerator and also features a large amount of on-chip memory. Inferentia2 offers 32 GB of HBM per accelerator, increasing the total memory by 4x and memory bandwidth by 10x over Inferentia.

Native support for ML frameworks

AWS Neuron SDK integrates natively with popular ML frameworks such as PyTorch and TensorFlow. With AWS Neuron, you can use these frameworks to optimally deploy DL models on both AWS Inferentia accelerators, and Neuron is designed to minimize code changes and tie-in to vendor-specific solutions. Neuron helps you to run your inference applications for natural language processing (NLP)/understanding, language translation, text summarization, video and image generation, speech recognition, personalization, fraud detection, and more on Inferentia accelerators.

Wide range of data types with automatic casting

The first-generation Inferentia supports FP16, BF16, and INT8 data types. Inferentia2 adds additional support for FP32, TF32, and the new configurable FP8 (cFP8) data type to provide developers more flexibility to optimize performance and accuracy. AWS Neuron takes high-precision FP32 models and automatically casts them to lower-precision data types while optimizing accuracy and performance. Autocasting reduces time to market by removing the need for lower-precision retraining.

State-of-the-art DL capabilities

Inferentia2 adds hardware optimizations for dynamic input sizes and custom operators written in C++. It also supports stochastic rounding, a way of rounding probabilistically that enables high performance and higher accuracy compared to legacy rounding modes.

Built for sustainability

Inf2 instances offer up to 50% better performance/watt over comparable Amazon EC2 instances because they and the underlying Inferentia2 accelerators are purpose built to run DL models at scale. Inf2 instances help you meet your sustainability goals when deploying ultra-large models.

Videos

Behind the scenes look at Generative AI infrastructure at Amazon

Introducing Amazon EC2 Inf2 instances powered by AWS Inferentia2

How four AWS customers reduced ML costs and drove innovation with AWS Inferentia

Resources

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Fine-tune Llama 2 using QLoRA and Deploy it on Amazon SageMaker with AWS Inferentia2

Maximize Stable Diffusion performance and lower inference costs with AWS Inferentia2

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

ByteDance saves up to 60% on inference costs while reducing latency and increasing throughput using AWS Inferentia

How Amazon Search reduced ML inference costs by 85% with AWS Inferentia

Additional resources

Use AWS Neuron and get started with AWS Inferentia from within TensorFlow, PyTorch, or MXNet

Additional resources

AWS Neuron feature roadmap

Get started with Inferentia

Start building in the console

Inference Samples/Tutorials (Inf2/Trn1)