AWS Inferentia
AWS Inferentia accelerators are designed by AWS to deliver high performance at the lowest cost for your deep learning (DL) inference applications.
The first-generation AWS Inferentia accelerator powers Amazon Elastic Compute Cloud (Amazon EC2) Inf1 instances, which deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances. Many customers, including Airbnb, Snap, Sprinklr, Money Forward, and Amazon Alexa, have adopted Inf1 instances and realized its performance and cost benefits.
AWS Inferentia2 accelerator delivers a major leap in performance and capabilities over first-generation AWS Inferentia. Inferentia2 delivers up to 4x higher throughput and up to 10x lower latency compared to Inferentia. Inferentia2-based Amazon EC2 Inf2 instances are designed to deliver high performance at the lowest cost in Amazon EC2 for your DL inference and generative artificial intelligence (AI) applications. They are optimized to deploy increasingly complex models, such as large language models (LLM) and vision transformers, at scale. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between accelerators. You can now efficiently and cost-effectively deploy models with hundreds of billions of parameters across multiple accelerators on Inf2 instances.
AWS Neuron is the SDK that helps developers deploy models on both AWS Inferentia accelerators and run your inference applications for natural language processing (NLP)/understanding, language translation, text summarization, video and image generation, speech recognition, personalization, fraud detection, and more. It integrates natively with popular machine learning (ML) frameworks, such as PyTorch and TensorFlow, so that you can continue to use your existing code and workflows and run on Inferentia accelerators.
Benefits
High performance and throughput
Each first-generation Inferentia accelerator has four first-generation NeuronCores with up to 16 Inferentia accelerators per EC2 Inf1 instance. Each Inferentia2 accelerator has two second-generation NeuronCores with up to 12 Inferentia2 accelerators per EC2 Inf2 instance. Inferentia2 offers up to 4x higher throughput and 3x higher compute performance than Inferentia. Each Inferentia2 accelerator supports up to 190 tera floating operations per second (TFLOPS) of FP16 performance.
Low latency with high-bandwidth memory
The first-generation Inferentia has 8 GB of DDR4 memory per accelerator and also features a large amount of on-chip memory. Inferentia2 offers 32 GB of HBM per accelerator, increasing the total memory by 4x and memory bandwidth by 10x over Inferentia.
Native support for ML frameworks
AWS Neuron SDK integrates natively with popular ML frameworks such as PyTorch and TensorFlow. With AWS Neuron, you can use these frameworks to optimally deploy DL models on both AWS Inferentia accelerators with minimal code changes and without tie-in to vendor-specific solutions.
Wide range of data types with automatic casting
The first-generation Inferentia supports FP16, BF16, and INT8 data types. Inferentia2 adds additional support for FP32, TF32, and the new configurable FP8 (cFP8) data type to provide developers more flexibility to optimize performance and accuracy. AWS Neuron takes high-precision FP32 models and automatically casts them to lower-precision data types while optimizing accuracy and performance. Autocasting reduces time to market by removing the need for lower-precision retraining.
State-of-the-art DL capabilities
Inferentia2 adds hardware optimizations for dynamic input sizes and custom operators written in C++. It also supports stochastic rounding, a way of rounding probabilistically that enables high performance and higher accuracy compared to legacy rounding modes.
Built for sustainability
Inf2 instances offer up to 50% better performance/watt over comparable Amazon EC2 instances because they and the underlying Inferentia2 accelerators are purpose built to run DL models at scale. Inf2 instances help you meet your sustainability goals when deploying ultra-large models.
AWS Neuron SDK
AWS Neuron is the SDK that helps developers deploy models on both AWS Inferentia accelerators and train them on AWS Trainium accelerator. It integrates natively with popular ML frameworks, such as PyTorch and TensorFlow, so you can continue to use your existing workflows and run on Inferentia accelerators with only a few lines of code.
AWS Trainium
AWS Trainium is an AWS-designed DL training accelerator that delivers high performance and cost-effective DL training on AWS. Amazon EC2 Trn1 instances, powered by AWS Trainium, deliver the highest performance on DL training of popular NLP models on AWS. Trn1 instances offer up to 50% cost-to-train savings over comparable Amazon EC2 instances.
Customer testimonials

Qualtrics designs and develops experience management software.
"At Qualtrics, our focus is building technology that closes experience gaps for customers, employees, brands, and products. To achieve that, we are developing complex multi-task, multi-modal DL models to launch new features, such as text classification, sequence tagging, discourse analysis, key-phrase extraction, topic extraction, clustering, and end-to-end conversation understanding. As we utilize these more complex models in more applications, the volume of unstructured data grows, and we need more performant inference-optimized solutions that can meet these demands, such as Inf2 instances, to deliver the best experiences to our customers. We are excited about the new Inf2 instances, because it will not only allow us to achieve higher throughputs, while dramatically cutting latency, but also introduces features like distributed inference and enhanced dynamic input shape support, which will help us scale to meet the deployment needs as we push towards larger, more complex large models."
Aaron Colak, Head of Core Machine Learning, Qualtrics

Finch Computing is a natural language technology company providing artificial intelligence applications for government, financial services, and data integrator clients.
"To meet our customers’ needs for real-time NLP, we develop state-of-the-art DL models that scale to large production workloads. We have to provide low-latency transactions and achieve high throughputs to process global data feeds. We already migrated many production workloads to Inf1 instances and achieved an 80% reduction in cost over GPUs. Now, we are developing larger, more complex models that enable deeper, more insightful meaning from written text. A lot of our customers need access to these insights in real time, and the performance on Inf2 instances will help us deliver lower latency and higher throughput over Inf1 instances. With the Inf2 performance improvements and new Inf2 features, such as support for dynamic input sizes, we are improving our cost-efficiency, elevating the real-time customer experience, and helping our customers glean new insights from their data."
Franz Weckesser, Chief Architect, Finch Computing

"We alert on many types of events all over the world in many languages, in different formats (images, video, audio, text sensors, combinations of all these types) from hundreds of thousands of sources. Optimizing for speed and cost given that scale is absolutely critical for our business. With AWS Inferentia, we have lowered model latency and achieved up to 9x better throughput per dollar. This has allowed us to increase model accuracy and grow our platform's capabilities by deploying more sophisticated DL models and processing 5x more data volume while keeping our costs under control."
Alex Jaimes, Chief Scientist and Senior Vice President of AI, Dataminr

Founded in 2008, San Francisco–based Airbnb is a community marketplace with over 4 million hosts who have welcomed more than 900 million guest arrivals in almost every country across the globe.
"Airbnb’s Community Support Platform enables intelligent, scalable, and exceptional service experiences to our community of millions of guests and hosts around the world. We are constantly looking for ways to improve the performance of our NLP models that our support chatbot applications use. With Amazon EC2 Inf1 instances powered by AWS Inferentia, we see a 2x improvement in throughput out of the box over GPU-based instances for our PyTorch-based BERT models. We look forward to leveraging Inf1 instances for other models and use cases in the future."
Bo Zeng, Engineering Manager, Airbnb

"We incorporate ML into many aspects of Snapchat, and exploring innovation in this field is a key priority. Once we heard about Inferentia, we started collaborating with AWS to adopt Inf1/Inferentia instances to help us with ML deployment, including around performance and cost. We started with our recommendation models and look forward to adopting more models with the Inf1 instances in the future."
Nima Khajehnouri, VP Engineering, Snap Inc.

"Sprinklr's AI-driven unified customer experience management (Unified-CXM) platform enables companies to gather and translate real-time customer feedback across multiple channels into actionable insights—resulting in proactive issue resolution, enhanced product development, improved content marketing, better customer service, and more. Using Amazon EC2 Inf1, we were able to significantly improve the performance of one of our NLP models and improve the performance of one of our computer vision models. We're looking forward to continuing to use Amazon EC2 Inf1 to better serve our global customers."
Vasant Srinivasan, Senior Vice President of Product Engineering, Sprinklr

"Autodesk is advancing the cognitive technology of our AI-powered virtual assistant, Autodesk Virtual Agent (AVA), by using Inferentia. AVA answers over 100,000 customer questions per month by applying natural language understanding (NLU) and DL techniques to extract the context, intent, and meaning behind inquiries. Piloting Inferentia, we are able to obtain a 4.9x higher throughput over G4dn for our NLU models, and look forward to running more workloads on the Inferentia-based Inf1 instances."
Binghui Ouyang, Sr. Data Scientist, Autodesk
Amazon services using AWS Inferentia

Amazon Advertising helps businesses of all sizes connect with customers at every stage of their shopping journey. Millions of ads, including text and images, are moderated, classified, and served for the optimal customer experience every single day.
“For our text ad processing, we deploy PyTorch based BERT models globally on AWS Inferentia based Inf1 instances. By moving to Inferentia from GPUs, we were able to lower our cost by 69% with comparable performance. Compiling and testing our models for AWS Inferentia took less than three weeks. Using Amazon SageMaker to deploy our models to Inf1 instances ensured our deployment was scalable and easy to manage. When I first analyzed the compiled models, the performance with AWS Inferentia was so impressive that I actually had to re-run the benchmarks to make sure they were correct! Going forward, we plan to migrate our image ad processing models to Inferentia. We have already benchmarked 30% lower latency and 71% cost savings over comparable GPU-based instances for these models."
Yashal Kanungo, Applied Scientist, Amazon Advertising
Read the news blog »

“Amazon Alexa’s AI- and ML-based intelligence, powered by AWS, is available on more than 100 million devices today—and our promise to customers is that Alexa is always becoming smarter, more conversational, more proactive, and even more delightful. Delivering on that promise requires continuous improvements in response times and ML infrastructure costs, which is why we are excited to use Amazon EC2 Inf1 to lower inference latency and cost per inference on Alexa text-to-speech. With Amazon EC2 Inf1, we’ll be able to make the service even better for the tens of millions of customers who use Alexa each month."
Tom Taylor, Senior Vice President, Amazon Alexa
"We are constantly innovating to further improve our customer experience and to drive down our infrastructure costs. Moving our web-based question answering (WBQA) workloads from GPU-based P3 instances to AWS Inferentia-based Inf1 instances not only helped us reduce inference costs by 60%, but also improved the end-to-end latency by more than 40%, helping enhance customer Q&A experience with Alexa. Using Amazon SageMaker for our TensorFlow-based model made the process of switching to Inf1 instances straightforward and easy to manage. We are now using Inf1 instances globally to run these WBQA workloads and are optimizing their performance for AWS Inferentia to further reduce cost and latency."
Eric Lind, Software Development Engineer, Alexa AI

“Amazon Prime Video uses computer vision ML models to analyze video quality of live events to ensure an optimal viewer experience for Prime Video members. We deployed our image classification ML models on EC2 Inf1 instances and were able to see 4x improvement in performance and up to 40% savings in cost. We are now looking to leverage these cost savings to innovate and build advanced models that can detect more complex defects, such as synchronization gaps between audio and video files, to deliver more enhanced viewing experience for Prime Video members."
Victor Antonino, Solutions Architect, Amazon Prime Video

“Amazon Rekognition is a simple and easy image and video analysis application that helps customers identify objects, people, text, and activities. Amazon Rekognition needs high-performance DL infrastructure that can analyze billions of images and videos daily for our customers. With AWS Inferentia-based Inf1 instances, running Amazon Rekognition models such as object classification resulted in 8x lower latency and 2x the throughput compared to running these models on GPUs. Based on these results, we are moving Amazon Rekognition to Inf1, enabling our customers to get accurate results faster."
Rajneesh Singh, Director, SW Engineering, Amazon Rekognition and Video
Blogs and articles
Patrick Moorhead, May 13, 2020
James Hamilton, Nov 28, 2018