Amazon EC2 Inf2 Instances

High performance at the lowest cost in Amazon EC2 for generative AI inference

Get started with Inf2 instances using AWS Neuron

Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances are purpose built for deep learning (DL) inference. They deliver high performance at the lowest cost in Amazon EC2 for generative artificial intelligence (AI) models, including large language models (LLMs) and vision transformers. You can use Inf2 instances to run your inference applications for text summarization, code generation, video and image generation, speech recognition, personalization, fraud detection, and more.

Inf2 instances are powered by AWS Inferentia2, the second-generation AWS Inferentia chip. Inf2 instances raise the performance of Inf1 by delivering 3x higher compute performance, 4x larger total accelerator memory, up to 4x higher throughput, and up to 10x lower latency. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between Inferentia chips. You can now efficiently and cost-effectively deploy models with hundreds of billions of parameters across multiple chips on Inf2 instances.

The AWS Neuron SDK helps developers deploy models on the AWS Inferentia chips (and train them on AWS Trainium chips). It integrates natively with frameworks, such as PyTorch and TensorFlow, so you can continue using your existing workflows and application code and run on Inf2 instances.

How it works

Using AWS DLAMI
Using Amazon EKS
Using Amazon ECS
Using Amazon SageMaker

Using AWS DLAMI
The diagram shows the workflow for deploying Amazon EC2 Inf2 instances using AWS Deep Learning AMIs (DLAMI).

The first column includes two sections stacked vertically. The first section on top includes the following user applications grouped in a box: AWS Command Line Interface (CLI), AWS Tools and SDKs, and AWS Cloud Control API. The section below includes the AWS Management Console.

The first section in this first column has an arrow pointing to a rocket launching with the following text under it: "Launch DLAMI automatically using AWS CLI, SDK, or API." The second section in that first column has an arrow pointing to a rocket launching with the following text: "Launch DLAMI via the console."

Both rocket icons have a shared arrow pointing to a box representing Amazon EC2 Inf2 instances.

To the right of the Inf2 instance box, there is a box representing DLAMI. This DLAMI box is grouped using a box around the following text: "Local terminal," "EC2 remote terminal," and "Application script." These three items include an arrow pointing back to the DLAMI box. The DLAMI box then has an arrow pointing back to the Inf2 instances box.

Click to enlarge
Using Amazon EKS
The diagram shows the workflow for creating Kubernetes clusters, deploying Amazon EC2 Inf2 instances for your clusters, and running your inference applications on Kubernetes.

The first box represents Amazon Elastic Kubernetes Service (Amazon EKS) and includes the following text: "Create Kubernetes clusters (powered by Amazon EKS Distro)."

An arrow points from the first box to the second box for Amazon EC2 Inf2 instances. This box includes the following text: "Deploy Inf2 worker nodes for your EKS cluster."

An arrow points from this second box to the last item with the following text: "Run your inference applications on Kubernetes."

Click to enlarge
Using Amazon ECS
The diagram shows the workflow for deploying Amazon EC2 Inf2 instances using AWS Deep Learning Containers with Amazon Elastic Container Service (Amazon ECS).

The first box represented Amazon Elastic Container Registry (Amazon ECR). It includes the following text: "Build images and store using ECR or any other repository."

An arrow points from this box to a box for Amazon ECS.

An arrow points from this box to an item that includes the following text: "Select the Deep Learning Container image for your application."

An arrow points from this information to a box for Amazon EC2 Inf2 instances. This box includes the following text: "Deploy inference workload on Inf2."

An arrow points from this box to an item that includes the following text: "Manage containers using Amazon ECS."

Click to enlarge
Using Amazon SageMaker
The diagram shows the workflow for using model artifacts stored in an Amazon Simple Storage Service (Amazon S3) bucket and an Amazon ECR container image with Amazon SageMaker to deploy inference on Inf2 instances.

The first group includes two boxes stacked vertically. The first box on top is for Amazon S3 and includes the following text: "Model artifacts stored in S3 bucket." The second box below it is for Amazon Elastic Container Registry (Amazon ECR) and includes the following text: "Container image."

This first group has an arrow pointing to Amazon SageMaker. This item is grouped with a box that includes the following workflow information:

First is the following text: "Create a SageMaker model." An arrow points from this item to a box for Amazon EC2 Inf2 instances with the following text: "Choose Inf2 as your SageMaker inference option (ml.inf2)." The next arrow points from this box to the following text: "Configure, create, and invoke a SageMaker endpoint to get inference."

Click to enlarge

Benefits

Deploy 100B+ parameter, generative AI models at scale

Inf2 instances are the first inference-optimized instances in Amazon EC2 to support distributed inference at scale. You can now efficiently deploy models with hundreds of billions of parameters across multiple Inferentia chips on Inf2 instances, using the ultra-high-speed connectivity between the chips.

Increase performance while significantly lowering inference costs

Inf2 instances are designed to deliver high performance at the lowest cost in Amazon EC2 for your DL deployments. They offer up to 4x higher throughput and up to 10x lower latency than Amazon EC2 Inf1 instances. Inf2 instances deliver up to 40% better price performance than other comparable Amazon EC2 instances.

Use your existing ML frameworks and libraries

Use the AWS Neuron SDK to extract the full performance of Inf2 instances. With Neuron, you can use your existing frameworks like PyTorch and TensorFlow and get optimized out-of-the-box performance for models in popular repositories like Hugging Face. Neuron supports runtime integrations with serving tools like TorchServe and TensorFlow Serving. It also helps optimize performance with built-in profile and debugging tools like Neuron-Top and integrates into popular visualization tools like TensorBoard.

Meet your sustainability goals with an energy-efficient solution

Inf2 instances deliver up to 50% better performance/watt over other comparable Amazon EC2 instances. These instances and the underlying Inferentia2 chips use advanced silicon processes and hardware and software optimizations to deliver high energy efficiency when running DL models at scale. Use Inf2 instances to help meet your sustainability goals when deploying ultra-large models.

Features

Up to 2.3 petaflops with AWS Inferentia2

Inf2 instances are powered by up to 12 AWS Inferentia2 chips connected with ultra-high-speed NeuronLink for streamlined collective communications. They offer up to 2.3 petaflops of compute and up to 4x higher throughput and 10x lower latency than Inf1 instances.

Up to 384 GB high-bandwidth accelerator memory

To accommodate large DL models, Inf2 instances offer up to 384 GB of shared accelerator memory (32 GB HBM in every Inferentia2 chip, 4x larger than first-generation Inferentia) with 9.8 TB/s of total memory bandwidth (10x faster than first-generation Inferentia).

NeuronLink interconnect

For fast communication between Inferentia2 chips, Inf2 instances support 192 GB/s of NeuronLink, a high-speed, nonblocking interconnect. Inf2 is the only inference-optimized instance to offer this interconnect, a feature that is only available in more expensive training instances. For ultra-large models that do not fit into a single chip, data flows directly between chips with NeuronLink, bypassing the CPU completely. With NeuronLink, Inf2 supports faster distributed inference and improves throughput and latency.

Optimized for novel data types with automatic casting

Inferentia2 supports FP32, TF32, BF16, FP16, UINT8, and the new configurable FP8 (cFP8) data type. AWS Neuron can take high-precision FP32 and FP16 models and autocast them to lower-precision data types while optimizing accuracy and performance. Autocasting reduces time to market by removing the need for lower-precision retraining and enabling higher-performance inference with smaller data types.

State-of-the-art DL optimizations

To support the fast pace of DL innovation, Inf2 instances have several innovations that make them flexible and extendable to deploy constantly evolving DL models. Inf2 instances have hardware optimizations and software support for dynamic input shapes. To allow support for new operators in the future, they support custom operators written in C++. They also support stochastic rounding, a method for rounding probabilistically to achieve high performance and higher accuracy compared to legacy rounding modes.

Product details

Instance Size	Inferentia2 Chips	Accelerator Memory (GB)	vCPU	Memory (GiB)	Local Storage	Inter-Chip Interconnect	Network Bandwidth (Gbps)	EBS Bandwidth (Gbps)	On-Demand Price	1-Year Reserved Instance	3-Year Reserved Instance
inf2.xlarge	1	32	4	16	EBS Only	N/A	Up to 15	Up to 10	$0.76	$0.45	$0.30
inf2.8xlarge	1	32	32	128	EBS Only	N/A	Up to 25	10	$1.97	$1.81	$0.79
inf2.24xlarge	6	192	96	384	EBS Only	Yes	50	30	$6.49	$3.89	$2.60
inf2.48xlarge	12	384	192	768	EBS Only	Yes	100	60	$12.98	$7.79	$5.19

Customer testimonials

"Our team at Leonardo leverages generative AI to enable creative professionals and enthusiasts to produce visual assets with unmatched quality, speed, and style consistency. The price to performance of AWS Inf2 Utilizing AWS Inf2 we are able to reduce our costs by 80%, without sacrificing performance, fundamentally changing the value proposition we can offer customers, enabling our most advanced features at a more accessible price point. It also alleviates concerns around cost and capacity availability for our ancillary AI services, which are increasingly important as we grow and scale. It is a key enabling technology for us as we continue to push the envelope on what’s possible with generative AI, enabling a new era of creativity and expressive power for our users. "

Pete Werner, Head of AI, Leonardo.ai

"At Runway, our suite of AI Magic Tools enables our users to generate and edit content like never before. We are constantly pushing the boundaries of what is possible with AI-powered content creation, and as our AI models become more complex, the underlying infrastructure costs to run these models at scale can become expensive. Through our collaboration with Amazon EC2 Inf2 instances powered by AWS Inferentia, we’re able to run some of our models with up to 2x higher throughput than comparable GPU-based instances. This high-performance, low-cost inference enables us to introduce more features, deploy more complex models, and ultimately deliver a better experience for the millions of creators using Runway."

Cristóbal Valenzuela, Cofounder and CEO, Runway

Qualtrics designs and develops experience management software.

"At Qualtrics, our focus is building technology that closes experience gaps for customers, employees, brands, and products. To achieve that, we are developing complex multi-task, multi-modal DL models to launch new features, such as text classification, sequence tagging, discourse analysis, key-phrase extraction, topic extraction, clustering, and end-to-end conversation understanding. As we utilize these more complex models in more applications, the volume of unstructured data grows, and we need more performant inference-optimized solutions that can meet these demands, such as Inf2 instances, to deliver the best experiences to our customers. We are excited about the new Inf2 instances because it will not only allow us to achieve higher throughputs, while dramatically cutting latency, but also introduces features like distributed inference and enhanced dynamic input shape support, which will help us scale to meet the deployment needs as we push toward larger, more complex large models."

Aaron Colak, Head of Core Machine Learning, Qualtrics

Finch Computing is a natural language technology company providing artificial intelligence applications for government, financial services, and data integrator clients.

"To meet our customers’ needs for real-time natural language processing, we develop state-of-the-art DL models that scale to large production workloads. We have to provide low-latency transactions and achieve high throughputs to process global data feeds. We already migrated many production workloads to Inf1 instances and achieved an 80% reduction in cost over GPUs. Now, we are developing larger, more complex models that enable deeper, more insightful meaning from written text. A lot of our customers need access to these insights in real time, and the performance on Inf2 instances will help us deliver lower latency and higher throughput over Inf1 instances. With the Inf2 performance improvements and new Inf2 features, such as support for dynamic input sizes, we are improving our cost-efficiency, elevating the real-time customer experience, and helping our customers glean new insights from their data."

Franz Weckesser, Chief Architect, Finch Computing

Money Forward Inc. serves businesses and individuals with an open and fair financial platform. As part of this platform, HiTTO Inc., a Money Forward group company, offers an AI chatbot service, which uses tailored natural language processing (NLP) models to address the diverse needs of their corporate customers.

"We launched a large-scale AI chatbot service on the Amazon EC2 Inf1 instances and reduced our inference latency by 97% over comparable GPU-based instances while also reducing costs. We were very pleased to see further performance improvements in our initial test results on Amazon EC2 Inf2 instances. Using the same custom NLP model, AWS Inf2 was able to further reduce the latency by 10x over Inf1. As we move to larger multibillion parameter models, Inf2 gives us the confidence that we can continue to provide our customers with a superior end-to-end user experience."

Takuya Nakade, CTO, Money Forward Inc.

"At Fileread.ai, we are building solutions to make interacting with your docs as easy as asking them questions, enabling users to find what they looking for, from all their docs and getting the right information faster. Since switching to the new Inf2 EC2 instance, we've seen a significant improvement in our NLP inference capabilities. The cost savings alone have been a game-changer for us, allowing us to allocate resources more efficiently without sacrificing quality. We reduced our inferencing latency by 33% while increasing throughput by 50%—delighting our customers on faster turnarounds. Our team has been blown away by the speed and performance of Inf2 compared to the older G5 instances, and it's clear that this is the future deploying NLP models."

Daniel Hu, CEO, Fileread

"At Yaraku, our mission is to build the infrastructure that helps people communicate across language barriers. Our flagship product, YarakuZen, enables anyone, from professional translators to monolingual individuals, to confidently translate and post-edit texts and documents. To support this process, we offer a wide range of sophisticated tools based on DL models, covering tasks such as translation, bitext word alignment, sentence segmentation, language modeling, and many others. By using Inf1 instances, we have been able to speed up our services to meet the increasing demand while reducing the inference cost by more than 50% compared to GPU-based instances. We are now moving into the development of next-generation larger models that will require the enhanced capabilities of Inf2 instances to meet demand while maintaining low latency. With Inf2, we will be able to scale up our models by 10x while maintaining similar throughput, allowing us to deliver even higher levels of quality to our customers."

Giovanni Giacomo, NLP Lead, Yaraku

AWS Partner testimonials

"Hugging Face’s mission is to democratize good ML to help ML developers around the world solve real-world problems. And key to that is ensuring the latest and greatest models run as fast and efficiently as possible on the best ML chips in the cloud. We are incredibly excited about the potential for Inferentia2 to become the new standard way to deploy generative AI models at scale. With Inf1, we saw up to 70% lower cost than traditional GPU-based instances, and with Inf2 we have seen up to 8x lower latency for BERT-like Transformers compared to Inferentia1. With Inferentia2, our community will be able to easily scale this performance to LLMs at the 100B+ parameters scale, and to the latest diffusion and computer vision models as well.”

"PyTorch accelerates the path from research prototyping to production deployments for ML developers. We have collaborated with the AWS team to provide native PyTorch support for the new AWS Inferentia2 powered Amazon EC2 Inf2 instances. As more members of our community look to deploy large generative AI models, we are excited to partner with the AWS team to optimize distributed inference on Inf2 instances with high-speed NeuronLink connectivity between chips. With Inf2, developers using PyTorch can now easily deploy ultra-large LLMs and vision transformer models. Additionally, Inf2 instances bring other innovative capabilities to PyTorch developers, including efficient data types, dynamic shapes, custom operators, and hardware-optimized stochastic rounding, making them well-suited for wide adoption by the PyTorch community.”

"Weights & Biases (W&B) provides developer tools for ML engineers and data scientists to build better models faster. The W&B platform provides ML practitioners a wide variety of insights to improve the performance of models, including the utilization of the underlying compute infrastructure. We have collaborated with the AWS team to add support for Amazon Trainium and Inferentia2 to our system metrics dashboard, providing valuable data much needed during model experimentation and training. This enables ML practitioners to optimize their models to take full advantage of AWS’s purpose-built hardware to train their models faster and at lower cost."

Phil Gurbacki, VP of Product, Weights & Biases

"OctoML helps developers reduce costs and build scalable AI applications by packaging their DL models to run on high-performance hardware. We have spent the last several years building expertise on the best software and hardware solutions and integrating them into our platform. Our roots as chip designers and system hackers make AWS Trainium and Inferentia even more exciting for us. We see these chips as a key driving factor for the future of AI innovation on the cloud. The GA launch of Inf2 instances is especially timely, as we are seeing the emergence of popular LLM as a key building block of next-generation AI applications. We are excited to make these instances available in our platform to help developers easily take advantage of their high performance and cost-saving benefits."

Jared Roesch, CTO and Cofounder, OctoML

"The historic challenge with LLMs, and more broadly with enterprise-level generative AI applications, are the costs associated with training and running high-performance DL models. Along with AWS Trainium, AWS Inferentia2 removes the financial compromises our customers make when they require high-performance training. Now, our customers looking for advantages in training and inference can achieve better results for less money. Trainium and Inferentia accelerate scale to meet even the most demanding DL requirements for today’s largest enterprises. Many Nextira customers running large AI workloads will benefit directly with these new chipsets, increasing efficiencies in cost savings and performance and leading to faster results in their market."

Jason Cutrer, founder and CEO, Nextira

Amazon services using Amazon EC2 Inf2 instances

Amazon CodeWhisperer is an AI coding companion that generates real-time single-line or full-function code recommendations in your integrated development environment (IDE) to help you quickly build software.

"With CodeWhisperer, we're improving software developer productivity by providing code recommendations using generative AI models. To develop highly effective code recommendations, we scaled our DL network to billions of parameters. Our customers need code recommendations in real time as they type, so low-latency responses are critical. Large generative AI models require high-performance compute to deliver response times in a fraction of a second. With Inf2, we're delivering the same latency as running CodeWhisperer on training optimized GPU instances for large input and output sequences. Thus, Inf2 instances are helping us save cost and power while delivering the best possible experience for developers.”

Doug Seven, General Manager, Amazon CodeWhisperer

Amazon's product search engine indexes billions of products, serves billions of customer queries daily, and is one of the most heavily used services in the world.

"I am super excited at the Inf2 GA launch. The superior performance of Inf2, coupled with its ability to handle larger models with billions of parameters, makes it the perfect choice for our services and enables us to unlock new possibilities in terms of model complexity and accuracy. With the significant speedup and cost-efficiency offered by Inf2, integrating them into Amazon Search serving infrastructure can help us meet the growing demands of our customers. We are planning to power our new shopping experiences using generative LLMs using Inf2.”

Trishul Chilimbi, VP, Amazon Search

Getting started

Using Amazon SageMaker

Deploy models on Inf2 instances more easily using Amazon SageMaker and significantly reduce the costs to deploy ML models and increase performance without the need to manage infrastructure. SageMaker is a fully managed service and integrates with MLOps tools. Therefore, you can scale your model deployment, manage models more effectively in production, and reduce operational burden.

Using the AWS Deep Learning AMIs

The AWS Deep Learning AMIs (DLAMI) provide DL practitioners and researchers with the infrastructure and tools to accelerate DL in the cloud, at any scale. AWS Neuron drivers come preconfigured in the DLAMI to deploy your DL models optimally on Inf2 instances.

Using AWS Deep Learning Containers

You can now deploy Inf2 instances in Amazon Elastic Kubernetes Service (Amazon EKS), a fully managed Kubernetes service, and in Amazon Elastic Container Service (Amazon ECS), a fully managed container orchestration service. Neuron is also available preinstalled in AWS Deep Learning Containers. To learn more about running containers on Inf2 instances, see the Neuron containers tutorials.

Sign up for an AWS account

Instantly get access to the AWS Free Tier.

Learn with 10-minute tutorials

Explore and learn with simple tutorials.

Start building in the console

Begin building with step-by-step guides to help you launch your AWS project.