Amazon EC2 Inf2 Instances
High performance at the lowest cost in Amazon EC2 for the most demanding inference workloads
Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances are purpose built for deep learning (DL) inference. They are designed to deliver high performance at the lowest cost in Amazon EC2 for your most demanding DL applications. You can use Inf2 instances to run your inference applications for natural language understanding, language translation, video and image generation, speech recognition, personalization, fraud detection, and more.
Inf2 instances are powered by AWS Inferentia2, the second-generation AWS Inferentia accelerator. Compared to Inf1 instances, Inf2 instances deliver 3x higher compute performance, 4x higher accelerator memory, up to 4x higher throughput, and up to 10x lower latency. Inf2 instances are optimized to deploy increasingly complex models such as large language models (LLM) and vision transformers at scale. Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference with ultra-high-speed connectivity between accelerators. You can now efficiently deploy a 175B parameter model for inference across multiple accelerators on a single Inf2 instance. Inf2 instances also deliver better price performance than Inf1 for smaller models.
AWS Neuron is an SDK that helps developers train models on AWS Trainium and deploy models on the AWS Inferentia accelerators. It integrates natively with frameworks, such as PyTorch and TensorFlow, so you can continue to use your existing workflows and run on Inf2 instances with only a few lines of code.
Deploy 100B+ parameter models at scale
Inf2 instances are the first inference-optimized instances in Amazon EC2 to support scale-out distributed inference and provide ultra-high speed connectivity between accelerators. You can now efficiently deploy a 175B parameter model for inference across multiple accelerators on a single Inf2 instance.
Increase performance while significantly lowering inference costs
Inf2 instances are designed to deliver high performance at the lowest cost in Amazon EC2 for your DL deployments. They offer up to 4x higher throughput and up to 10x lower latency than Amazon EC2 Inf1 instances.
Enjoy native support for ML frameworks and libraries
AWS Neuron SDK makes it easy for you to extract the full performance of Inf2 instances with only a few lines of code. By using the Neuron SDK, you can run your applications on Inf2 instances and continue to use your existing workflows in PyTorch and TensorFlow.
Meet your sustainability goals with an energy-efficient solution
Inf2 instances offer up to 50% better performance/watt compared to GPU-based instances in Amazon EC2 because they and the underlying Inferentia2 accelerators are purpose built to run DL models at scale. Inf2 instances help you meet your sustainability goals when deploying ultra-large models.
Up to 2.3 petaflops with AWS Inferentia2
Inf2 instances are powered by up to 12 AWS Inferentia2 accelerators connected with ultra-high speed NeuronLink for streamlined collective communications. They offer up to 2.3 petaflops of compute and up to 4x higher throughput and 10x lower latency than Inf1 instances.
Up to 384 GB high-bandwidth accelerator memory
To accommodate large DL models, Inf2 instances offer up to 384 GB of shared accelerator memory (32 GB HBM2e in every Inferentia2 accelerator) with 9.8 TB/s of total memory bandwidth.
NeuronLink intra-instance interconnect
For fast communication between accelerators, Inf2 instances support NeuronLink, an intra-instance ultra-high-speed, nonblocking interconnect.
Support for 6 data types with automatic casting
Inf2 instances have full stack support for FP32, TF32, BF16, FP16, UINT8, and the new configurable FP8 (cFP8) data type. AWS Neuron takes high precision FP32 models and autocasts them to lower precision data types while optimizing accuracy and performance. Autocasting reduces time to market by removing the need for lower precision retraining.
State-of-the-art deep learning optimizations
Inf2 instances have hardware optimizations and software support for dynamic input sizes and custom operators written in C++. They also support stochastic rounding, a method for rounding probabilistically that enables high performance and higher accuracy compared to legacy rounding modes.
|Instance Size||Inferentia2 accelerators||Accelerator
|inf2.xlarge||1||32||4||16||EBSOnly||NA||Up to 15||Up to 6.6|
|inf2.8xlarge||1||32||32||128||EBS Only||NA||Up to 25||6.6|
Qualtrics designs and develops experience management software.
"At Qualtrics, our focus is building technology that closes experience gaps for customers, employees, brands, and products. To achieve that, we are developing complex multi-task, multi-modal deep learning models to launch new features, such as text classification, sequence tagging, discourse analysis, key-phrase extraction, topic extraction, clustering, and end-to-end conversation understanding. As we utilize these more complex models in more applications, the volume of unstructured data grows, and we need more performant inference-optimized solutions that can meet these demands, such as Inf2 instances, to deliver the best experiences to our customers. We are excited about the new Inf2 instances, because it will not only allow us to achieve higher throughputs, while dramatically cutting latency, but also introduces features like distributed inference and enhanced dynamic input shape support, which will help us scale to meet the deployment needs as we push towards larger, more complex large models.”
Aaron Colak, Head of Core Machine Learning, Qualtrics
Finch Computing is a natural language technology company providing artificial intelligence applications for government, financial services, and data integrator clients.
"To meet our customers’ needs for real-time natural language processing, we develop state-of-the-art deep learning models that scale to large production workloads. We have to provide low-latency transactions and achieve high throughputs to process global data feeds. We already migrated many production workloads to Inf1 instances and achieved an 80% reduction in cost over GPUs. Now, we are developing larger, more complex models that enable deeper, more insightful meaning from written text. A lot of our customers need access to these insights in real-time and the performance on Inf2 instances will help us deliver lower latency and higher throughput over Inf1 instances. With the Inf2 performance improvements and new Inf2 features, such as support for dynamic input sizes, we are improving our cost-efficiency, elevating the real-time customer experience, and helping our customers glean new insights from their data.”
Franz Weckesser, Chief Architect, Finch Computing