AWS Inferentia

High performance machine learning inference chip, custom designed by AWS

The demand for deep learning acceleration is growing at a rapid pace, and across a wide range of applications. Applications such as personalized search recommendations, dynamic pricing, or automated customer support, are growing in sophistication and becoming more expensive to run in production. As more applications embed machine learning capabilities, a higher percentage of workloads needs acceleration, including ones that need low-latency, real-time performance. These applications benefit from infrastructure optimized to execute machine learning algorithms.

AWS’s vision is to make deep learning pervasive for everyday developers and to democratize access to cutting edge hardware made available in a low-cost pay-as-you-go usage model. AWS Inferentia is a big step and commitment that will help us deliver on this vision. AWS Inferentia is designed to provide high inference performance in the cloud, drive down the total cost of inference, and to make it easy for you to integrate machine learning as part of your standard application features and capabilities.

AWS Inferentia


High performance

Each AWS Inferentia chip supports up to 128 TOPS (trillions of operations per second) of performance at low power to enable multiple chips per EC2 instance. AWS Inferentia supports FP16, BF16, and INT8 data types. Furthermore, Inferentia can take a 32-bit trained model and run it at the speed of a 16-bit model using BFloat16.

Low latency

AWS Inferentia features a large amount of on-chip memory which can be used for caching large models, removing the need to store them off-chip. This has a significant impact in lowering inference latency as Inferentia’s processing cores – Neuron Cores, have high-speed access to models and are not limited by the chip’s off-chip memory bandwidth.

Ease of use

AWS Inferentia comes with the AWS Neuron software development kit (SDK) that enables complex neural net models, created and trained in popular frameworks to be executed using AWS Inferentia based EC2 Inf1 instances. Neuron consists of a compiler, run-time, and profiling tools and is pre-integrated into popular machine learning frameworks including TensorFlow, Pytorch, and MXNet to deliver optimal performance of EC2 Inf1 instances.

Sign up for a free account

Instantly get access to the AWS Free Tier. 

Sign up 
Start building in the console

Get started with machine learning in the AWS Console.

Sign in