AWS Inferentia

High performance machine learning inference chip, custom designed by AWS

The demand for deep learning acceleration is growing at a rapid pace, and across a wide range of applications. Applications such as personalized search recommendations, dynamic pricing, or automated customer support, are growing in sophistication and becoming more expensive to run in production. As more applications embed machine learning capabilities, a higher percentage of workloads needs acceleration, including ones that need low-latency, real-time performance. These applications benefit from infrastructure optimized to execute machine learning algorithms.

AWS’s vision is to make deep learning pervasive for everyday developers and to democratize access to cutting edge hardware made available in a low-cost pay-as-you-go usage model. AWS Inferentia is a big step and commitment that will help us deliver on this vision. AWS Inferentia is designed to provide high inference performance in the cloud, drive down the total cost of inference, and to make it easy for you to integrate machine learning as part of your standard application features and capabilities. AWS Inferentia comes with the AWS Neuron software development kit (SDK) consisting of a compiler, run-time, and profiling tools. It enables complex neural net models, created and trained in popular frameworks such as Tensorflow, PyTorch, and MXNet, to be executed using AWS Inferentia based Amazon EC2 Inf1 instances.

AWS Inferentia


High performance

Each AWS Inferentia chip supports up to 128 TOPS (trillions of operations per second) of performance at low power to enable multiple chips per EC2 instance. AWS Inferentia supports FP16, BF16, and INT8 data types. Furthermore, Inferentia can take a 32-bit trained model and run it at the speed of a 16-bit model using BFloat16.

Low latency

AWS Inferentia features a large amount of on-chip memory which can be used for caching large models, removing the need to store them off-chip. This has a significant impact in lowering inference latency as Inferentia’s processing cores – Neuron Cores, have high-speed access to models and are not limited by the chip’s off-chip memory bandwidth.

Ease of use

Trained machine learning models can be easily deployed to AWS Inferentia-based Amazon EC2 Inf1 instances with minimal code changes. To get started quickly, you can use Amazon SageMaker, a fully managed service to build, train, and deploy machine learning models. Developers who prefer to manage their own workflows for building and deploying their models, can directly use the AWS Neuron SDK which is natively integrated with popular frameworks including TensorFlow, PyTorch, and MXNet. AWS Neuron is also pre-installed in AWS Deep Learning AMIs and can also be installed in your custom environment without a framework.

Amazon EC2 Inf1 Instances Powered by AWS Inferentia

Amazon EC2 Inf1 instances deliver high performance and the lowest cost machine learning inference in the cloud. Using Inf1 instances, customers can run large scale machine learning inference applications like image recognition, speech recognition, natural language processing, personalization, and fraud detection, at the lowest cost in the cloud.

Learn more »

Sign up for a free account

Instantly get access to the AWS Free Tier. 

Sign up 
Start building in the console

Get started with machine learning in the AWS Console.

Sign in