Amazon EC2 Inf1 Instances

High performance and the lowest cost machine learning inference in the cloud

Amazon EC2 Inf1 instances deliver high performance and lowest cost machine learning inference in the cloud. Inf1 instances are built from the ground up to support machine learning inference applications. These instances feature up to 16 AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS. In addition, we’ve coupled the Inferentia chips with the latest custom 2nd generation Intel® Xeon® Scalable processors and up to 100 Gbps networking to enable high throughput inference. This powerful configuration enables Inf1 instances to deliver up to 3x higher throughput and up to 40% lower cost per inference than Amazon EC2 G4 instances, which were already the lowest cost instance for machine learning inference available in the cloud. Using Inf1 instances, customers can run large scale machine learning inference applications like image recognition, speech recognition, natural language processing, personalization, and fraud detection, at the lowest cost in the cloud.

Customers across a diverse set of industries are turning to machine learning to address common use cases for applications like providing personalized shopping recommendations, improving safety and security through online content moderation, and improving customer engagement with chatbots. Customers want more performance for their machine learning applications in order to deliver the best possible end user experience.

To get started with machine learning inference using the Inf1 instances, you can take your trained machine learning model and compile it to run on AWS Inferentia chip using AWS Neuron. AWS Neuron, is a software development kit (SDK) consisting of a compiler, run-time, and profiling tools that optimizes the machine learning inference performance of Inferentia chips. Its integrated with popular machine learning frameworks such as TensorFlow, PyTorch, and MXNet and comes pre-installed in AWS Deep Learning AMIs and can also be installed in your custom environment without a framework. The easiest and quickest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service that enables developers to build, train, and deploy machine learning models quickly. Amazon SageMaker supports Inf1 instances and AWS Neuron to provide one-click deployment of machine learning models onto auto-scaling Inf1 instances across multiple availability zones for high redundancy.


Free trial: Up to $10,000 in AWS credits for EC2 Hardware Accelerated Instances, ideal for ML, HPC, & Graphics applications.

Click here to apply 
Amazon EC2 Inf1 instances based on AWS Inferentia (2:51)


Up to 40% lower cost per inference

The high throughput of Inf1 instances enables the lowest cost per inference in the cloud, up to 40% lower cost-per-inference than Amazon EC2 G4 instances, which were already the lowest cost instance for machine learning inference available in the cloud. With machine learning inference representing up to 90% of overall operational costs for running machine learning workloads, this results in significant cost savings.

Up to 3x higher throughput

Inf1 instances deliver high throughput for batch inference applications, up to 3x higher throughput than Amazon EC2 G4 instances. Batch inference applications, such as photo tagging, are sensitive to inference throughput or how many inferences can be processed per second. With 1 to 16 AWS Inferentia chips per instance, Inf1 instances can scale in performance to up to 2000 Tera Operations per Second (TOPS).

Extremely low latency

Inf1 instances deliver extremely low latency for real-time applications. Real-time inference applications, such as speech recognition, need to make inferences in response to a user’s input quickly and are sensitive to inference latency. The large on-chip memory on AWS Inferentia chips used in Inf1 instances allows caching of machine learning models directly on the chip. This eliminates the need to access outside memory resources during inference, enabling low latency without impacting bandwidth.

Ease of use

Inf1 instances are easy to use, requiring little, if any code changes to support deploying models trained using the most popular machine learning frameworks including TensorFlow, PyTorch, and MXNet. The easiest and fastest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service that enables developers to build, train, and deploy machine learning models quickly.

Flexibility for different machine learning models

Using AWS Neuron, Inf1 instances support many commonly used machine learning models such as single shot detector (SSD) and ResNet for image recognition/classification as well as Transformer and BERT for natural language processing and translation.

Support for multiple data types

Inf1 instances support multiple data types including INT8, BF16, and FP16 with mixed precision to support a wide range of models and performance needs.

Amazon SageMaker

Amazon SageMaker makes it easy to compile and deploy your trained machine learning model in production on Amazon Inf1 instances so that you can start generating real-time predictions with low latency. Amazon SageMaker is a fully managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning models quickly. Amazon SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high quality models, tune them to optimize performance, and deploy them into production faster. AWS Neuron, the compiler for AWS Inferentia, is integrated with Amazon SageMaker Neo enabling you to compile your trained machine learning models to run optimally on Inf1 instances. With Amazon SageMaker you can easily choose to run your models on auto-scaling clusters of Inf1 instances that are spread across multiple availability zones to deliver both high performance and high availability real-time inference.

Learn more »

How it works

How to use Inf1 and AWS Inferentia

AWS Inferentia chips

AWS Inferentia is a machine learning inference chip designed and built by AWS to deliver high performance at low cost. Each AWS Inferentia chip has 4 NeuronCores and supports FP16, BF16, and INT8 data types. AWS Inferentia chips feature a large amount of on-chip memory which can be used for caching large models, which is especially beneficial for models that require frequent memory access. AWS Inferentia comes with the AWS Neuron software development kit (SDK) consisting of a compiler, run-time, and profiling tools. It enables complex neural net models, created and trained in popular frameworks such as Tensorflow, PyTorch, and MXNet, to be executed using AWS Inferentia based Amazon EC2 Inf1 instances. AWS Neuron also supports the ability to split large models for execution across multiple Inferentia chips using a high-speed physical chip-to-chip interconnect, delivering high inference throughput and lower inference costs.

Learn more >>

Use cases


Machine learning is being increasingly used to improve customer engagement by powering personalized product and content recommendations, tailored search results, and targeted marketing promotions.


Companies today use everything from simple spreadsheets to complex financial planning software to attempt to accurately forecast future business outcomes such as product demand, resource needs, or financial performance. These tools build forecasts by looking at a historical series of data, which is called time series data. Companies are increasingly using machine learning to combine time series data with additional variables to build forecasts.

Image & video analysis

Machine learning is being used today to identify the objects, people, text, scenes, and activities, as well as detect any inappropriate content contained in images or video. In addition, facial analysis and facial recognition on images and video can detect, analyze, and compare faces for a wide variety of user verification, people counting, and public safety use cases.

Advanced text analytics

Machine learning is particularly good at accurately identifying specific items of interest inside vast swathes of text (such as finding company names in analyst reports), and can learn the sentiment hidden inside language (identifying negative reviews, or positive customer interactions with customer service agents), at almost limitless scale.

Document analysis

Machine learning can be used to instantly “read” virtually any type of document to accurately extract text and data without the need for any manual effort or custom code. You can quickly automate document workflows, enabling you to process millions of document pages in hours.


Businesses can use machine learning to turn text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Text-to-Speech (TTS) services can use advanced deep learning technologies to synthesize speech that sounds like a human voice.

Conversational agents

AI is playing a role in improving the customer experience in call centers to include engagement through chatbots -- intelligent, natural language virtual assistants. These chatbots are able to recognize human speech and understand the caller’s intent without requiring the caller to speak in specific phrases. Callers can perform tasks such as changing a password, requesting a balance on an account, or scheduling an appointment, without the need to speak to an agent.


Companies can use machine learning based translation to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms. Companies can localize content - such as websites and applications - for international users, and easily translate large volumes of text efficiently.


Machine learning transcription can be used for lots of common applications, including the transcription of customer service calls and generating subtitles on audio and video content. Transcription services can place time stamps for every word so that you can easily locate the audio in the original source by searching for the text.

Fraud detection

Fraud Detection using machine learning detects potentially fraudulent activity and flags that activity for review. Fraud detection is typically used in the financial services industry to classify transactions as legitimate or fraudulent using a model that scores a transaction based on the amount, location, merchant, or time.


Machine learning in healthcare enables physicians to treat patients more quickly, not only cutting costs, but also improving outcomes. Hospitals are improving traditional x-ray imaging technologies like ultrasounds and CT scans by incorporating a variety of data sets—patient-reported data, sensor data and numerous other sources—into the scan process, and machine learning algorithms are able to recognize the difference between normal and abnormal results.


Getting Started

To compile and deploy a trained machine learning model to Inf1, you can either use Amazon SageMaker or the AWS Neuron SDK.

• Get started with AWS Neuron on Github
• Get support on the AWS Neuron developer forum
• Learn how to deploy to Inf1 using Amazon SageMaker with Amazon SageMaker examples on Github