Amazon EC2 Inf1 Instances

High performance and the lowest cost machine learning inference in the cloud

Businesses across a diverse set of industries are turning to machine learning to address use cases such as providing personalized shopping recommendations, improving online content moderation, and enhancing customer engagement with context aware chatbots. However, as machine learning models become more capable, they are also becoming more complex. This drives up the need for compute which leads to increased costs. In many cases, up to 90% of the infrastructure spend for developing and running a ML application is on inference, making the need for high-performance, cost-effective ML inference infrastructure critical.

Amazon EC2 Inf1 instances deliver up to 30% higher throughput and up to 45% lower cost per inference than Amazon EC2 G4 instances, which were already the lowest cost instance for machine learning inference in the cloud. Inf1 instances are built from the ground up to support machine learning inference applications. These instances feature up to 16 AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS. Additionally, Inf1 instances include the latest 2nd generation Intel® Xeon® Scalable processors and up to 100 Gbps networking to enable high throughput inference. Using Inf1 instances, customers can run large scale machine learning inference applications such as search recommendation, computer vision, speech recognition, natural language processing, personalization, and fraud detection, at the lowest cost in the cloud.

Developers can deploy their machine learning models to Inf1 instances using the AWS Neuron SDK, which is integrated with popular machine learning frameworks such as TensorFlow, PyTorch, and MXNet. It consists of a compiler, a run-time, and profiling tools to optimize the inference performance on AWS Inferentia. The easiest and quickest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service that enables developers to build, train, and deploy machine learning models quickly. Developers who prefer to manage their own machine learning application development platforms, can get started by either launching Inf1 instances with AWS Deep Learning AMIs, which include the Neuron SDK, or use Inf1 instances via Amazon Elastic Kubernetes Service (EKS) or Amazon Elastic Container Service (ECS) for containerized ML applications.


Free trial: Up to $10,000 in AWS credits for EC2 Hardware Accelerated Instances, ideal for ML, HPC, & Graphics applications.

Click here to apply 
Amazon EC2 Inf1 instances based on AWS Inferentia (2:51)


Up to 45% lower cost per inference

The high throughput of Inf1 instances enables the lowest cost per inference in the cloud, up to 45% lower cost-per-inference than Amazon EC2 G4 instances, which were already the lowest cost instances for machine learning inference in the cloud. With machine learning inference representing up to 90% of overall operational costs for running machine learning workloads, this results in significant cost savings.

Up to 30% higher throughput

Inf1 instances deliver high throughput for batch inference applications, up to 30% higher throughput than Amazon EC2 G4 instances. Batch inference applications, such as photo tagging, are sensitive to inference throughput or how many inferences can be processed per second. Inf1 instances are optimized to provide high performance for small batches, which is critical for applications that have strict response time requirements. With 1 to 16 AWS Inferentia chips per instance, Inf1 instances can scale in performance to up to 2000 Tera Operations per Second (TOPS).

Extremely low latency

Inf1 instances deliver low latency for real-time applications. Real-time inference applications, such as speech generation and search, need to make inferences in response to a user’s input quickly and are sensitive to inference latency. The large on-chip memory on AWS Inferentia chips used in Inf1 instances allows caching of machine learning models directly on the chip. This eliminates the need to access outside memory resources during inference, enabling low latency without impacting bandwidth.

Machine learning inference for a broad range of use cases

Developers can leverage high performance, low latency, and low cost inference with Inf1 instances for a broad range of machine learning applications applicable across diverse business verticals including image and video analysis, conversational agents, fraud detection, financial forecasting, healthcare automation, recommendation engines, text analytics, and transcription.

Ease of use and code portability

Since the Neuron SDK is integrated with common machine learning frameworks such as TensorFlow and PyTorch, developers can deploy their existing models to EC2 Inf1 instances with minimal code changes. This gives them the freedom to continue to use ML framework of their choice, to choose the compute platform that best meets their price performance requirements, and to take advantage of latest technologies without being tied to vendor-specific software libraries.

Support for different machine learning models and data types

Using AWS Neuron, Inf1 instances support many commonly used machine learning models such as single shot detector (SSD) and ResNet for image recognition/classification as well as Transformer and BERT for natural language processing and translation. Multiple data types including INT8, BF16, and FP16 with mixed precision are also supported for wide range of models and performance needs.


Powered By AWS Inferentia

AWS Inferentia is a machine learning chip custom built by AWS to deliver high performance inference at low cost. Each AWS Inferentia chip provides up to 128 TOPS (trillions of operations per second) of performance, and support for FP16, BF16, and INT8 data types. AWS Inferentia chips also feature a large amount of on-chip memory which can be used for caching large models, which is especially beneficial for models that require frequent memory access.

The AWS Neuron software development kit (SDK) consists of a compiler, run-time, and profiling tools. It enables complex neural net models, created and trained in popular frameworks such as TensorFlow, PyTorch, and MXNet, to be executed using Inf1 instances. AWS Neuron also supports the ability to split large models for execution across multiple Inferentia chips using a high-speed physical chip-to-chip interconnect, delivering high inference throughput and lower inference costs.

High performance networking and storage

Inf1 instances offer up to 100 Gbps of networking throughput for applications that require access to high speed networking. Next generation Elastic Network Adapter (ENA) and NVM Express (NVMe) technology provide Inf1 instances with high throughput, low latency interfaces for networking and Amazon Elastic Block Store (Amazon EBS).

Built on AWS Nitro System

The AWS Nitro System is a rich collection of building blocks that offloads many of the traditional virtualization functions to dedicated hardware and software to deliver high performance, high availability, and high security while also reducing virtualization overhead.

How it works

How to use Inf1 and AWS Inferentia

Customer Testimonials

Anthem is one of the nation's leading health benefits companies, serving the health care needs of 40+ million members across dozens of states. "The market of digital health platforms is growing at a remarkable rate. Gathering intelligence on this market is a challenging task due to the vast amounts of customer opinions data and its unstructured nature. Our application automates the generation of actionable insights from customer opinions via deep learning natural language models (Transformers). Our application is computationally intensive and needs to be deployed in a highly performant manner. We seamlessly deployed our deep learning inferencing workload onto Amazon EC2 Inf1 instances powered by the AWS Inferentia processor. The new Inf1 instances provide 2X higher throughput to GPU-based instances and allowed us to streamline our inference workloads.”

Numan Laanait, PhD, Principal AI/Data Scientist & Miro Mihaylov, PhD, Principal AI/Data Scientist

Condé Nast
"Condé Nast's global portfolio encompasses over 20 leading media brands, including Wired, Vogue, and Vanity Fair. Within a few weeks, our team was able to integrate our recommendation engine with AWS Inferentia chips. This union enables multiple runtime optimizations for state-of-the-art natural language models on SageMaker's Inf1 instances. As a result, we observed a performance improvement of a 72% reduction in cost than the previously deployed GPU instances."

Paul Fryzel, Principal Engineer, AI Infrastructure

CS Disco
“CS Disco is reinventing legal technology as a leading provider of AI solutions for e-discovery developed by lawyers for lawyers. Disco AI accelerates the thankless task of combing through terabytes of data, speeding up review times and improving review accuracy by leveraging complex Natural Language Processing models, which are computationally expensive and cost-prohibitive. Disco has found that AWS Inferentia-based Inf1 instances reduce the cost of inference in Disco AI by at least 35% as compared with today's GPU instances. Based on this positive experience with Inf1 instances CS Disco will explore opportunies for migration into Inferentia.”

Alan Lockett, Sr. Director of Research at CS Disco

Digital Media Professionals (DMP)
Digital Media Professionals (DMP) visualizes the future with a real-time ZIA platform based on AI (Artificial Intelligence). DMP’s efficient computer vision classification technologies are used to build insight on large amount of real-time image data, such as condition observation, crime prevention, and accident prevention. We are actively evaluating Inf1 instances over alternative options, as we believe Inferentia will give us the performance and cost structure we need to deploy our AI applications at scale.” 

Hiroyuki Umeda - Director & General Manager, Sales & Marketing Group, Digital Media Professionals empowers non-designers to create attractive graphics and helps professional designers to automate rote tasks. "Since machine learning is core to our strategy, we were excited to try AWS Inferentia-based Inf1 instances. We found the Inf1 instances easy to integrate into our research and development pipeline. Most importantly, we observed impressive performance gains compared to the G4dn GPU-based instances. With our first model, the Inf1 instances yielded about 45% higher throughput and decreased cost per inference by almost 50%. We intend to work closely with the AWS team to port other models and shift most of our ML inference infrastructure to AWS Inferentia."

Clarence Hu, Founder,

"INGA’s mission is to create advanced text summarizing solutions based on artificial intelligence and deep learning technologies which can be easily integrated into current business pipelines. We believe that text summarization will be critical in helping businesses derive meaningful insights from data. We quickly ramped up on AWS Inferentia based Amazon EC2 Inf1 instances and integrated them in our development pipeline. The impact was immediate and significant. The Inf1 instances provide high performance, which enables us to improve the efficiency and effectiveness of our inference model pipelines. Out of the box, we have experienced 4X higher throughput, and 30% lower overall pipeline costs compared to our previous GPU-based pipeline."

Yaroslav Shakula, Chief Business Development Officer, INGA Technologies

"SkyWatch processes hundreds of trillions of pixels of Earth observation data, captured from space everyday. Adopting the new AWS Inferentia-based Inf1 instances using Amazon SageMaker for real-time cloud detection and image quality scoring was quick and easy. It was all a matter of switching the instance type in our deployment configuration. By switching instance types to Inferentia-based Inf1, we improved performance by 40% and decreased overall costs by 23%. This is a big win. It has enabled us to lower our overall operational costs while continuing to deliver high quality satellite imagery to our customers, with minimal engineering overhead. We are looking forward to transitioning all of our inference endpoints and batch ML processes to use Inf1 instances to further improve our data reliability and customer experience."

Adler Santos, Engineering Manager, SkyWatch

Amazon Services Using Amazon EC2 Inf1 instances

Amazon Alexa

Over 100 million Alexa devices have been sold globally, and customers have also left over 400,000 5-star reviews for Echo devices on Amazon. “Amazon Alexa’s AI and ML-based intelligence, powered by Amazon Web Services, is available on more than 100 million devices today – and our promise to customers is that Alexa is always becoming smarter, more conversational, more proactive, and even more delightful,” said Tom Taylor, Senior Vice President, Amazon Alexa. “Delivering on that promise requires continuous improvements in response times and machine learning infrastructure costs, which is why we are excited to use Amazon EC2 Inf1 to lower inference latency and cost-per-inference on Alexa text-to-speech. With Amazon EC2 Inf1, we’ll be able to make the service even better for the tens of millions of customers who use Alexa each month.”


* Prices shown are for US East (Northern Virginia) AWS Region. Prices for 1-year and 3-year reserved instances are for "Partial Upfront" payment options or "No Upfront" for instances without the Partial Upfront option.

Amazon EC2 Inf1 instances are available in the US East (N. Virginia), US West (Oregon) AWS Regions as On-Demand, Reserved, or Spot Instances.

Getting Started

Using Amazon SageMaker

Amazon SageMaker makes it easy to compile and deploy your trained machine learning model in production on Amazon Inf1 instances so that you can start generating real-time predictions with low latency. AWS Neuron, the compiler for AWS Inferentia, is integrated with Amazon SageMaker Neo enabling you to compile your trained machine learning models to run optimally on Inf1 instances. With Amazon SageMaker you can easily run your models on auto-scaling clusters of Inf1 instances that are spread across multiple availability zones to deliver both high performance and highly available real-time inference. Learn how to deploy to Inf1 using Amazon SageMaker with examples on Github.

Using AWS Deep Learning AMI

The AWS Deep Learning AMIs (DLAMI) provide machine learning practitioners and researchers with the infrastructure and tools to accelerate deep learning in the cloud, at any scale. The AWS Neuron SDK comes pre-installed in AWS Deep Learning AMIs to compile and run your machine learning models optimally on Inf1 instances. To help guide you through the getting started process, visit the AMI selection guide and more deep learning resources. Refer to the AWS DLAMI Getting Started guide to learn how to use the DLAMI with Neuron.

Using Deep Learning Containers

Developers can now deploy Inf1 instances in Amazon Elastic Kubernetes Service (EKS), which is a fully managed Kubernetes service, as well as in Amazon Elastic Container Service (ECS), which is a fully managed container orchestration service from Amazon. Learn more about getting started with Inf1 on Amazon EKS in this blog. More details about running containers on Inf1 instances are available on the Neuron container tools tutorial page. Inf1 support for AWS DL Containers is coming soon.