Amazon EC2 Inf1 Instances

High performance and the lowest cost machine learning inference in the cloud

Amazon EC2 Inf1 instances are built from the ground up to support machine learning inference applications. Inf1 instances feature up to 16 AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS. In addition, we’ve coupled the Inferentia chips with the latest custom 2nd generation Intel® Xeon® Scalable processors and up to 100 Gbps networking to enable high throughput inference. This powerful configuration enables Inf1 instances to deliver up to 3x higher throughput and up to 40% lower cost per inference than Amazon EC2 G4 instances, which were already the lowest cost instance for machine learning inference available in the cloud. Using Inf1 instances, customers can run large scale machine learning inference applications like image recognition, speech recognition, natural language processing, personalization, and fraud detection, at the lowest cost in the cloud.


Free trial: Up to $10,000 in AWS credits for EC2 Hardware Accelerated Instances, ideal for ML, HPC, & Graphics applications.

Click here to apply 

Customers across a diverse set of industries are turning to machine learning to address common use cases for applications like providing personalized shopping recommendations, improving safety and security through online content moderation, and improving customer engagement with chatbots. Customers want more performance for their machine learning applications in order to deliver the best possible end user experience.

Amazon EC2 Inf1 instances deliver high performance and the lowest cost machine learning inference in the cloud. You can start your machine learning workflow by building your model in one of the popular machine learning frameworks such as TensorFlow, PyTorch, or MXNet and use GPU instances such as P3 or P3dn to train your model. Once your machine learning model is trained to meet your requirements, you can deploy your model on Inf1 instances by using AWS Neuron, a specialized software development kit (SDK) consisting of a compiler, run-time, and profiling tools that optimizes the machine learning inference performance of Inferentia chips. Neuron is pre-installed in AWS Deep Learning AMIs and can also be installed in your custom environment without a framework. In addition, Neuron will be pre-installed in AWS Deep Learning Containers and Amazon SageMaker, the easiest way to be successful with machine learning.


Up to 40% lower cost per inference

The high throughput of Inf1 instances enables the lowest cost per inference in the cloud, up to 40% lower cost-per-inference than Amazon EC2 G4 instances, which were already the lowest cost instance for machine learning inference available in the cloud. With machine learning inference representing up to 90% of overall operational costs for running machine learning workloads, this results in significant cost savings.

Up to 3x higher throughput

Inf1 instances deliver high throughput for batch inference applications, up to 3x higher throughput than Amazon EC2 G4 instances. Batch inference applications, such as photo tagging, are sensitive to inference throughput or how many inferences can be processed per second. With 1 to 16 AWS Inferentia chips per instance, Inf1 instances can scale in performance to up to 2000 Tera Operations per Second (TOPS).

Extremely low latency

Inf1 instances deliver extremely low latency for real-time applications. Real-time inference applications, such as speech recognition, need to make inferences in response to a user’s input quickly and are sensitive to inference latency. The large on-chip memory on AWS Inferentia chips used in Inf1 instances allows caching of machine learning models directly on the chip. This eliminates the need to access outside memory resources during inference, enabling low latency without impacting bandwidth.

Ease of use

Inf1 instances are easy to use, requiring little, if any code changes to support models trained using the most popular machine learning frameworks including TensorFlow, PyTorch, and MXNet.

Flexibility for different machine learning models

Using AWS Neuron, Inf1 instances support many commonly used machine learning models such as single shot detector (SSD) and ResNet for image recognition/classification as well as Transformer and BERT for natural language processing and translation.

Support for multiple data types

Inf1 instances support multiple data types including INT8, BF16, and FP16 with mixed precision to support a wide range of models and performance needs.

Amazon SageMaker (support for Inf1 instances coming soon)

Amazon SageMaker will make it easy to deploy your trained model in production on Amazon EC2 Inf1 instances with a single click so that you can start generating predictions for real-time or batch data. Amazon SageMaker is a fully-managed service that covers the entire machine learning workflow to label and prepare your data, choose an algorithm, train the model, tune and optimize it for deployment, make predictions, and take action. Your models get to production faster with much less effort and lower cost. Your model will run on auto-scaling clusters of Amazon SageMaker Inf1 instances that are spread across multiple availability zones to deliver both high performance and high availability.

Learn more »

How it works

How to use Inf1 and AWS Inferentia

AWS Inferentia chips

AWS Inferentia is a machine learning inference chip designed and built by AWS to deliver high performance at low cost. Each AWS Inferentia chip has 4 Neuron Cores and supports FP16, BF16, and INT8 data types. AWS Inferentia chips feature a large amount of on-chip memory which can be used for caching large models, removing the need to store them off-chip. In addition, the AWS Neuron SDK, a specialized SDK for AWS Inferentia chips, can split large models across multiple Inferentia chips using a high speed interconnect, creating a powerful inference processing pipeline.

Learn more >>

AWS Neuron SDK

AWS Neuron is a specialized SDK for AWS Inferentia chips that optimizes the machine learning inference performance of Inferentia chips. It consists of a compiler, run-time, and profiling tools for AWS Inferentia chips that enable developers to run high-performance and low latency inference workloads on Inferentia based EC2 Inf1 instances.

Learn more >>

Use cases


Machine learning is being increasingly used to improve customer engagement by powering personalized product and content recommendations, tailored search results, and targeted marketing promotions.


Companies today use everything from simple spreadsheets to complex financial planning software to attempt to accurately forecast future business outcomes such as product demand, resource needs, or financial performance. These tools build forecasts by looking at a historical series of data, which is called time series data. Companies are increasingly using machine learning to combine time series data with additional variables to build forecasts.

Image & video analysis

Machine learning is being used today to identify the objects, people, text, scenes, and activities, as well as detect any inappropriate content contained in images or video. In addition, facial analysis and facial recognition on images and video can detect, analyze, and compare faces for a wide variety of user verification, people counting, and public safety use cases.

Advanced text analytics

Machine learning is particularly good at accurately identifying specific items of interest inside vast swathes of text (such as finding company names in analyst reports), and can learn the sentiment hidden inside language (identifying negative reviews, or positive customer interactions with customer service agents), at almost limitless scale.

Document analysis

Machine learning can be used to instantly “read” virtually any type of document to accurately extract text and data without the need for any manual effort or custom code. You can quickly automate document workflows, enabling you to process millions of document pages in hours.


Businesses can use machine learning to turn text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Text-to-Speech (TTS) services can use advanced deep learning technologies to synthesize speech that sounds like a human voice.

Conversational agents

AI is playing a role in improving the customer experience in call centers to include engagement through chatbots -- intelligent, natural language virtual assistants. These chatbots are able to recognize human speech and understand the caller’s intent without requiring the caller to speak in specific phrases. Callers can perform tasks such as changing a password, requesting a balance on an account, or scheduling an appointment, without the need to speak to an agent.


Companies can use machine learning based translation to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms. Companies can localize content - such as websites and applications - for international users, and easily translate large volumes of text efficiently.


Machine learning transcription can be used for lots of common applications, including the transcription of customer service calls and generating subtitles on audio and video content. Transcription services can place time stamps for every word so that you can easily locate the audio in the original source by searching for the text.

Fraud detection

Fraud Detection using machine learning detects potentially fraudulent activity and flags that activity for review. Fraud detection is typically used in the financial services industry to classify transactions as legitimate or fraudulent using a model that scores a transaction based on the amount, location, merchant, or time.


Machine learning in healthcare enables physicians to treat patients more quickly, not only cutting costs, but also improving outcomes. Hospitals are improving traditional x-ray imaging technologies like ultrasounds and CT scans by incorporating a variety of data sets—patient-reported data, sensor data and numerous other sources—into the scan process, and machine learning algorithms are able to recognize the difference between normal and abnormal results.