AWS AI Chips

AWS Inferentia

Get high performance at the lowest cost in Amazon EC2 for deep learning and generative AI inference

Get started with AWS Inferentia chips using AWS Neuron

Why Inferentia?

AWS Inferentia chips are designed by AWS to deliver high performance at the lowest cost in Amazon EC2 for your deep learning (DL) and generative AI inference applications.

The first-generation AWS Inferentia chip powers Amazon Elastic Compute Cloud (Amazon EC2) Inf1 instances, which deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable Amazon EC2 instances. Many customers, including Finch AI, Sprinklr, Money Forward, and Amazon Alexa, have adopted Inf1 instances and realized its performance and cost benefits.

AWS Inferentia2 chip delivers up to 4x higher throughput and up to 10x lower latency compared to Inferentia. Inferentia2-based Amazon EC2 Inf2 instances are optimized to deploy increasingly complex models, such as large language models (LLM) and latent diffusion models, at scale. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between chips. Many customers, including Leonardo.ai, Deutsche Telekom, and Qualtrics have adopted Inf2 instances for their DL and generative AI applications.

AWS Neuron SDK helps developers deploy models on the AWS Inferentia chips (and train them on AWS Trainium chips). It integrates natively with popular frameworks, such as PyTorch and TensorFlow, so that you can continue to use your existing code and workflows and run on Inferentia chips.

Benefits of AWS Inferentia

Each first-generation Inferentia chip has four first-generation NeuronCores, and each EC2 Inf1 instance has up to 16 Inferentia chips. Each Inferentia2 chip has two second-generation NeuronCores, and each EC2 Inf2 instance has up to 12 Inferentia2 chips. Each Inferentia2 chip supports up to 190 tera floating operations per second (TFLOPS) of FP16 performance. The first-generation Inferentia has 8 GB of DDR4 memory per chip and also features a large amount of on-chip memory. Inferentia2 offers 32 GB of HBM per chip, increasing the total memory by 4x and memory bandwidth by 10x over Inferentia.

AWS Neuron SDK integrates natively with popular ML frameworks such as PyTorch and TensorFlow. With AWS Neuron, you can use these frameworks to optimally deploy DL models on both AWS Inferentia chips, and Neuron is designed to minimize code changes and tie-in to vendor-specific solutions. Neuron helps you to run your inference applications for natural language processing (NLP)/understanding, language translation, text summarization, video and image generation, speech recognition, personalization, fraud detection, and more on Inferentia chips.

The first-generation Inferentia supports FP16, BF16, and INT8 data types. Inferentia2 adds additional support for FP32, TF32, and the new configurable FP8 (cFP8) data type to provide developers more flexibility to optimize performance and accuracy. AWS Neuron takes high-precision FP32 models and automatically casts them to lower-precision data types while optimizing accuracy and performance. Autocasting reduces time to market by removing the need for lower-precision retraining.

Inferentia2 adds hardware optimizations for dynamic input sizes and custom operators written in C++. It also supports stochastic rounding, a way of rounding probabilistically that enables high performance and higher accuracy compared to legacy rounding modes.

Inf2 instances offer up to 50% better performance/watt over comparable Amazon EC2 instances because they and the underlying Inferentia2 chips are purpose built to run DL models at scale. Inf2 instances help you meet your sustainability goals when deploying ultra-large models.

Karakuri

Learn how Karakuri delivers high-performance AI while controlling costs using AWS Inferentia

Watch the video

Metagenomi

Learn how Metagenomi reduced large-scale protein design costs by up to 56% using AWS Inferentia

Read the blog

NetoAI

Learn how NetoAI achieved 300–600 ms inference latency using AWS Inferentia2

Read the testimonial

Tomofun

Learn how Tomofun cut BLIP inference deployment costs by 83% by migrating to AWS Inferentia

Read the testimonial

SplashMusic

Learn how SplashMusic reduced inference latency by
upto 10x using AWS Inferentia

Read the testimonial

Leonardo.ai

Our team at Leonardo leverages generative AI to enable creative professionals and enthusiasts to produce visual assets with unmatched quality, speed, and style consistency. Utilizing AWS Inferentia2 we are able to reduce our costs by 80%, without sacrificing performance, fundamentally changing the value proposition we can offer customers, enabling our most advanced features at a more accessible price point. It also alleviates concerns around cost and capacity availability for our ancillary AI services, which are increasingly important as we grow and scale. It is a key enabling technology for us as we continue to push the envelope on what’s possible with generative AI, enabling a new era of creativity and expressive power for our users.

Pete Werner, Head of AI, Leonardo.ai

Qualtrics

Qualtrics designs and develops experience management software.

At Qualtrics, our focus is building technology that closes experience gaps for customers, employees, brands, and products. To achieve that, we are developing complex multi-task, multi-modal DL models to launch new features, such as text classification, sequence tagging, discourse analysis, key-phrase extraction, topic extraction, clustering, and end-to-end conversation understanding. As we utilize these more complex models in more applications, the volume of unstructured data grows, and we need more performant inference-optimized solutions that can meet these demands, such as Inf2 instances, to deliver the best experiences to our customers. We are excited about the new Inf2 instances, because it will not only allow us to achieve higher throughputs, while dramatically cutting latency, but also introduces features like distributed inference and enhanced dynamic input shape support, which will help us scale to meet the deployment needs as we push towards larger, more complex large models.

Aaron Colak, Head of Core Machine Learning, Qualtrics

Finch Computing

Finch Computing is a natural language technology company providing artificial intelligence applications for government, financial services, and data integrator clients.

To meet our customers’ needs for real-time NLP, we develop state-of-the-art DL models that scale to large production workloads. We have to provide low-latency transactions and achieve high throughputs to process global data feeds. We already migrated many production workloads to Inf1 instances and achieved an 80% reduction in cost over GPUs. Now, we are developing larger, more complex models that enable deeper, more insightful meaning from written text. A lot of our customers need access to these insights in real time, and the performance on Inf2 instances will help us deliver lower latency and higher throughput over Inf1 instances. With the Inf2 performance improvements and new Inf2 features, such as support for dynamic input sizes, we are improving our cost-efficiency, elevating the real-time customer experience, and helping our customers glean new insights from their data.

Franz Weckesser, Chief Architect, Finch Computing

Dataminr

We alert on many types of events all over the world in many languages, in different formats (images, video, audio, text sensors, combinations of all these types) from hundreds of thousands of sources. Optimizing for speed and cost given that scale is absolutely critical for our business. With AWS Inferentia, we have lowered model latency and achieved up to 9x better throughput per dollar. This has allowed us to increase model accuracy and grow our platform's capabilities by deploying more sophisticated DL models and processing 5x more data volume while keeping our costs under control.

Alex Jaimes, Chief Scientist and Senior Vice President of AI, Dataminr

Snap Inc.

We incorporate ML into many aspects of Snapchat, and exploring innovation in this field is a key priority. Once we heard about Inferentia, we started collaborating with AWS to adopt Inf1/Inferentia instances to help us with ML deployment, including around performance and cost. We started with our recommendation models and look forward to adopting more models with the Inf1 instances in the future.

Nima Khajehnouri, VP Engineering, Snap Inc.

Sprinklr

Sprinklr's AI-driven unified customer experience management (Unified-CXM) platform enables companies to gather and translate real-time customer feedback across multiple channels into actionable insights—resulting in proactive issue resolution, enhanced product development, improved content marketing, better customer service, and more. Using Amazon EC2 Inf1, we were able to significantly improve the performance of one of our NLP models and improve the performance of one of our computer vision models. We're looking forward to continuing to use Amazon EC2 Inf1 to better serve our global customers.

Vasant Srinivasan, Senior Vice President of Product Engineering, Sprinklr

Autodesk

Autodesk is advancing the cognitive technology of our AI-powered virtual assistant, Autodesk Virtual Agent (AVA), by using Inferentia. AVA answers over 100,000 customer questions per month by applying natural language understanding (NLU) and DL techniques to extract the context, intent, and meaning behind inquiries. Piloting Inferentia, we are able to obtain a 4.9x higher throughput over G4dn for our NLU models, and look forward to running more workloads on the Inferentia-based Inf1 instances.

Binghui Ouyang, Sr. Data Scientist, Autodesk

Screening Eagle Technologies

The use of ground-penetrating radar and detection of visual defects is typically the domain of expert surveyors. An AWS microservices-based architecture enables us to process videos captured by automated inspection vehicles and inspectors. By migrating our in-house–built models from traditional GPU-based instances to Inferentia, we were able to reduce costs by 50%. Moreover, we were able to see performance gains when comparing the times with a G4dn GPU instance. Our team is looking forward to running more workloads on the Inferentia-based Inf1 instances.

Jesús Hormigo, Chief of Cloud and AI Officer, Screening Eagle Technologies

NTT PC Communications Inc.

NTT PC Communications, a network service and communication solution provider in Japan, is a telco leader in introducing new innovative products in the information and communication technology market.

NTT PC developed AnyMotion, a motion analysis API platform service based on advanced posture estimation ML models. We deployed our AnyMotion platform on Amazon EC2 Inf1 instances using Amazon ECS for a fully managed container orchestration service. By deploying our AnyMotion containers on Amazon EC2 Inf1, we saw 4.5x higher throughout, a 25% lower inference latency, and 90% lower cost compared to current-generation GPU-based EC2 instances. These superior results will help to improve the quality of the AnyMotion service at scale.

Toshiki Yanagisawa, Software Engineer, NTT PC Communications Inc.

Anthem

Anthem is one of the nation's leading health benefits companies, serving the healthcare needs of 40+ million members across dozens of states.

The market of digital health platforms is growing at a remarkable rate. Gathering intelligence on this market is a challenging task due to the vast amounts of customer opinions data and its unstructured nature. Our application automates the generation of actionable insights from customer opinions via DL natural language models (Transformers). Our application is computationally intensive and needs to be deployed in a highly performant manner. We seamlessly deployed our DL inferencing workload onto Amazon EC2 Inf1 instances powered by the AWS Inferentia processor. The new Inf1 instances provide 2x higher throughput to GPU-based instances and allowed us to streamline our inference workloads.

Numan Laanait and Miro Mihaylov, PhDs, Principal AI/Data Scientists, Anthem

Videos

Behind the scenes look at Generative AI infrastructure at Amazon

Introducing Amazon EC2 Inf2 instances powered by AWS Inferentia2

How four AWS customers reduced ML costs and drove innovation with AWS Inferentia

Resources

Blog

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Read the blog

Blog

Fine-tune Llama 2 using QLoRA and Deploy it on Amazon SageMaker with AWS Inferentia2

Read the blog

Blog

Maximize Stable Diffusion performance and lower inference costs with AWS Inferentia2

Read the blog

Blog

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

Read the blog

Blog

ByteDance saves up to 60% on inference costs while reducing latency and increasing throughput using AWS Inferentia

Read the blog

Blog

How Amazon Search reduced ML inference costs by 85% with AWS Inferentia

Read the blog

Additional resources

Use AWS Neuron and get started with AWS Inferentia from within TensorFlow, PyTorch, or MXNet

Learn more

Additional resources

Get started with inference on AWS Inferentia using these easy tutorials

Learn more

Get started with AWS Inferentia

Learn more

Console

Start building in the console

Free tier

Inference Samples/Tutorials (Inf2/Trn1)

Learn more

AWS Inferentia

Why Inferentia?

Benefits of AWS Inferentia

Karakuri

Metagenomi

NetoAI

Tomofun

SplashMusic

Leonardo.ai

Qualtrics

Finch Computing

Dataminr

Snap Inc.

Sprinklr

Autodesk

Screening Eagle Technologies

NTT PC Communications Inc.

Anthem

Videos

Resources

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Fine-tune Llama 2 using QLoRA and Deploy it on Amazon SageMaker with AWS Inferentia2

Maximize Stable Diffusion performance and lower inference costs with AWS Inferentia2

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

ByteDance saves up to 60% on inference costs while reducing latency and increasing throughput using AWS Inferentia

How Amazon Search reduced ML inference costs by 85% with AWS Inferentia

Use AWS Neuron and get started with AWS Inferentia from within TensorFlow, PyTorch, or MXNet

Get started with inference on AWS Inferentia using these easy tutorials

Get started with AWS Inferentia

Start building in the console

Inference Samples/Tutorials (Inf2/Trn1)

Learn

Resources

Developers

Help

AWS Inferentia

Why Inferentia?

Benefits of AWS Inferentia

Optimized for high throughput and low latency

Native support for ML frameworks

Wide range of data types with automatic casting

State-of-the-art DL capabilities

Built for sustainability

Karakuri

Metagenomi

NetoAI

Tomofun

SplashMusic

Leonardo.ai

Qualtrics

Finch Computing

Dataminr

Snap Inc.

Sprinklr

Autodesk

Screening Eagle Technologies

NTT PC Communications Inc.

Anthem

Videos

Resources

Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Fine-tune Llama 2 using QLoRA and Deploy it on Amazon SageMaker with AWS Inferentia2

Maximize Stable Diffusion performance and lower inference costs with AWS Inferentia2

Achieve high performance with lowest cost for generative AI inference using AWS Inferentia2 and AWS Trainium on Amazon SageMaker

ByteDance saves up to 60% on inference costs while reducing latency and increasing throughput using AWS Inferentia

How Amazon Search reduced ML inference costs by 85% with AWS Inferentia

Use AWS Neuron and get started with AWS Inferentia from within TensorFlow, PyTorch, or MXNet

Get started with inference on AWS Inferentia using these easy tutorials

Get started with AWS Inferentia

Start building in the console

Inference Samples/Tutorials (Inf2/Trn1)

Learn

Resources

Developers

Help