Amazon EC2 Inf1 Instances
Businesses across a diverse set of industries are looking at AI-powered transformation to drive business innovation, improve customer experience and process improvements. Machine learning models that power AI applications are becoming increasingly complex resulting in rising underlying compute infrastructure costs. Up to 90% of the infrastructure spend for developing and running ML applications is often on inference. Customers are looking for cost-effective infrastructure solutions for deploying their ML applications in production.
Amazon EC2 Inf1 instances deliver high-performance ML inference at the lowest cost in the cloud. They deliver up to 2.3x higher throughput and up to 70% lower cost per inference than comparable current generation GPU-based Amazon EC2 instances. Inf1 instances are built from the ground up to support machine learning inference applications. They feature up to 16 AWS Inferentia chips, high-performance machine learning inference chips designed and built by AWS. Additionally, Inf1 instances include 2nd generation Intel® Xeon® Scalable processors and up to 100 Gbps networking to deliver high throughput inference.
Customers can use Inf1 instances to run large scale machine learning inference applications such as search, recommendation engines, computer vision, speech recognition, natural language processing, personalization, and fraud detection, at the lowest cost in the cloud.
Developers can deploy their machine learning models to Inf1 instances by using the AWS Neuron SDK, which is integrated with popular machine learning frameworks such as TensorFlow, PyTorch and MXNet. They can continue using the same ML workflows and seamlessly migrate applications on to Inf1 instances with minimal code changes and with no tie-in to vendor specific solutions.
Get started easily with Inf1 instances using Amazon SageMaker, AWS Deep Learning AMIs that come pre-configured with Neuron SDK, or using Amazon ECS or Amazon EKS for containerized ML applications.
Up to 70% lower cost per inference
Using Inf1, developers can significantly reduce the cost of their machine learning production deployments with the lowest cost per inference in the cloud. The combination of low instance cost and high throughput of Inf1 instances delivers up to 70% lower cost-per-inference than comparable current generation GPU-based EC2 instances.
Ease of use and code portability
Neuron SDK is integrated with common machine learning frameworks such as TensorFlow, PyTorch, and MXNet. Developers can continue using the same ML workflows and seamlessly migrate their application on to Inf1 instances with minimal code changes. This gives them the freedom to use the machine learning framework of choice, the compute platform that best meets their requirements, and leverage the latest technologies without being tied to vendor-specific solutions.
Up to 2.3x higher throughput
Inf1 instances deliver up to 2.3x higher throughput than comparable current generation GPU-based Amazon EC2 instances. AWS Inferentia chips that power Inf1 instances are optimized for inference performance for small batch sizes, enabling real-time applications to maximize throughput and meet latency requirements.
Extremely low latency
The AWS Inferentia chips are equipped with large on-chip memory that enables caching of machine learning models directly on the chip itself. You can deploy your models using capabilities such as the NeuronCore Pipeline that eliminate the need to access outside memory resources. With Inf1 instances, you can deploy real-time inference applications at near real-time latencies without impacting bandwidth.
Support for wide range of machine learning models and data types
Inf1 instances support many commonly used machine learning model architectures such as SSD, VGG and ResNext for image recognition/classification as well as Transformer and BERT for natural language processing. Additionally, support for HuggingFace model repository in Neuron provides customers the ability to compile and run inference using the pretrained models – or even fine-tuned ones, easily, by changing just a single line of code. Multiple data types including BF16 and FP16 with mixed precision are also supported for wide range of models and performance needs.
Powered By AWS Inferentia
AWS Inferentia is a machine learning chip custom built by AWS to deliver high performance inference at low cost. Each AWS Inferentia chip provides up to 128 TOPS (trillions of operations per second) of performance, and support for FP16, BF16, and INT8 data types. AWS Inferentia chips also feature a large amount of on-chip memory which can be used for caching large models, which is especially beneficial for models that require frequent memory access.
Deploy with popular ML frameworks using AWS Neuron
The AWS Neuron software development kit (SDK) consists of a compiler, run-time driver, and profiling tools. It enables deployment of complex neural net models, created and trained in popular frameworks such as TensorFlow, PyTorch, and MXNet, to be executed using Inf1 instances. With Neuron’s NeuronCore Pipeline, you can split large models for execution across multiple Inferentia chips using a high-speed physical chip-to-chip interconnect, delivering high inference throughput, and lower inference costs.
High performance networking and storage
Inf1 instances offer up to 100 Gbps of networking throughput for applications that require access to high speed networking. Next generation Elastic Network Adapter (ENA) and NVM Express (NVMe) technology provide Inf1 instances with high throughput, low latency interfaces for networking and Amazon Elastic Block Store (Amazon EBS).
Built on AWS Nitro System
The AWS Nitro System is a rich collection of building blocks that offloads many of the traditional virtualization functions to dedicated hardware and software to deliver high performance, high availability, and high security while also reducing virtualization overhead.
How it works
Founded in 2008, San Francisco-based Airbnb is a community marketplace with over 4 million Hosts who have welcomed more than 900 million guest arrivals in almost every country across the globe.
"Airbnb’s Community Support Platform enables intelligent, scalable, and exceptional service experiences to our community of millions of guests and hosts around the world. We are constantly looking for ways to improve the performance of our Natural Language Processing models that our support chatbot applications use. With Amazon EC2 Inf1 instances powered by AWS Inferentia , we see a 2x improvement in throughput out of the box, over GPU-based instances for our PyTorch based BERT models. We look forward to leveraging Inf1 instances for other models and use cases in the future.”
Bo Zeng, Engineering Manager - AirBnB
"We incorporate machine learning (ML) into many aspects of Snapchat, and exploring innovation in this field is a key priority. Once we heard about Inferentia we started collaborating with AWS to adopt Inf1/Inferentia instances to help us with ML deployment, including around performance and cost. We started with our recommendation models, and look forward to adopting more models with the Inf1 instances in the future.”
Nima Khajehnouri, VP Engineering - Snap Inc.
"Sprinklr's AI-driven unified customer experience management (Unified-CXM) platform enables companies to gather and translate real-time customer feedback across multiple channels into actionable insights – resulting in proactive issue resolution, enhanced product development, improved content marketing, better customer service, and more. Using Amazon EC2 Inf1, we were able to significantly improve the performance of one of our natural language processing (NLP) models and improve the performance of one of our computer vision models. We're looking forward to continuing to use Amazon EC2 Inf1 to better serve our global customers."
Vasant Srinivasan, Senior Vice President of Product Engineering - Sprinklr
"At Finch Computing, our state-of-the-art Natural Language Processing (NLP) product, Finch for Text, requires significant computing resources to provide our clients with low-latency enrichments for global data feeds. We are now using Amazon EC2 Inf1 instances in our PyTorch NLP, translation, and entity disambiguation models. With Inf1 instances, we were able to reduce our inference costs by 6x (over comparable GPU-based instances) with minimal optimizations while maintaining our inference speed and performance, something that’s critical for our financial services, data aggregator and public sector customers.”
Scott Lightner, Chief Technology Officer - Finch Computing
"Autodesk is advancing the cognitive technology of our AI-powered virtual assistant, Autodesk Virtual Agent (AVA) by using Inferentia. AVA answers over 100,000 customer questions per month by applying natural language understanding (NLU) and deep learning techniques to extract the context, intent, and meaning behind inquiries. Piloting Inferentia, we are able to obtain a 4.9x higher throughput over G4dn for our NLU models, and look forward to running more workloads on the Inferentia-based Inf1 instances.”
Binghui Ouyang, Sr Data Scientist - Autodesk
NTTPC Communications is a network service and communication solution provider in Japan who is a telco leader in introducing new innovative products in the Information and communication technology market.
"NTTPC developed “AnyMotion", a motion analysis API platform service based on advanced posture estimation machine-learning models. NTTPC deployed their AnyMotion platform on Amazon EC2 Inf1 instances using Amazon Elastic Container Service (ECS) for a fully managed container orchestration service. By deploying their AnyMotion containers on Amazon EC2 Inf1, NTTPC saw 4.5x higher throughout , a 25% lower inference latency, and 90% lower cost compared to current generation GPU-based EC2 instances. These superior results will help to improve the quality of AnyMotion service at scale."
Toshiki Yanagisawa, Software Engineer - NTT PC Communications Incorporated
Anthem is one of the nation's leading health benefits companies, serving the health care needs of 40+ million members across dozens of states.
"The market of digital health platforms is growing at a remarkable rate. Gathering intelligence on this market is a challenging task due to the vast amounts of customer opinions data and its unstructured nature. Our application automates the generation of actionable insights from customer opinions via deep learning natural language models (Transformers). Our application is computationally intensive and needs to be deployed in a highly performant manner. We seamlessly deployed our deep learning inferencing workload onto Amazon EC2 Inf1 instances powered by the AWS Inferentia processor. The new Inf1 instances provide 2X higher throughput to GPU-based instances and allowed us to streamline our inference workloads.”
Numan Laanait, PhD, Principal AI/Data Scientist - Anthem
Miro Mihaylov, PhD, Principal AI/Data Scientist - Anthem
"Condé Nast's global portfolio encompasses over 20 leading media brands, including Wired, Vogue, and Vanity Fair. Within a few weeks, our team was able to integrate our recommendation engine with AWS Inferentia chips. This union enables multiple runtime optimizations for state-of-the-art natural language models on SageMaker's Inf1 instances. As a result, we observed a 72% reduction in cost than the previously deployed GPU instances."
Paul Fryzel, Principal Engineer, AI Infrastructure - Condé Nast
“Ciao is evolving conventional security cameras into high-performance analysis cameras equivalent to the capability of a human eye. Our application is advancing disaster prevention, monitoring environmental conditions using cloud-based AI camera solutions to alert before it becomes a disaster. Such alert enables reacting to the situation beforehand. Based on the object detection, we can also provide insight by estimating the number of incoming guests without staff from videos in brick and mortar stores. Ciao Camera commercially adopted AWS Inferentia-based Inf1 instances with 40% better price performance than G4dn with YOLOv4. We are looking forward to more of our services with Inf1 leveraging its significant cost efficiency.”
Shinji Matsumoto, Software Engineer - Ciao Inc.
“The Asahi Shimbun is one of the most popular daily newspapers in Japan. Media Lab, established as one of our company's departments, has the missions to research the latest technology, especially AI, and connect the cutting-edge technologies for new businesses. With the launch of AWS Inferentia based Amazon EC2 Inf1 instances in Tokyo, we tested our PyTorch based text summarization AI application on these instances. This application processes a large amount of text and generates headlines and summary sentences trained on articles from the last 30 years. Using Inferentia, we lowered costs by an order of magnitude over CPU-based instances. This dramatic reduction in costs will enable us to deploy our most complex models at scale, which we previously believed was not economically feasible”
Hideaki Tamori, PhD, Senior Administrator, Media Lab - The Asahi Shimbun Company
“CS Disco is reinventing legal technology as a leading provider of AI solutions for e-discovery developed by lawyers for lawyers. Disco AI accelerates the thankless task of combing through terabytes of data, speeding up review times and improving review accuracy by leveraging complex Natural Language Processing models, which are computationally expensive and cost-prohibitive. Disco has found that AWS Inferentia-based Inf1 instances reduce the cost of inference in Disco AI by at least 35% as compared with today's GPU instances. Based on this positive experience with Inf1 instances CS Disco will explore opportunities for migration into Inferentia.”
Alan Lockett, Sr. Director of Research - CS Disco
“At Talroo, we provide our customers with a data-driven platform that enables them to attract unique job candidates, so they can make hires. We are constantly exploring new technologies to ensure we offer the best products and services to our customers. Using Inferentia we extract insights from a corpus of text data to enhance our AI-powered search-and-match technology. Talroo leverages Amazon EC2 Inf1 instances to create high throughput Natural Language Understanding models with SageMaker. Talroo’s initial testing shows that the Amazon EC2 Inf1 instances deliver 40% lower inference latency and 2X higher throughput compared to G4dn GPU-based instances. Based on these results, Talroo looks forward to using Amazon EC2 Inf1 instances as part of its AWS infrastructure.”
Janet Hu, Software Engineer - Talroo
"Digital Media Professionals (DMP) visualizes the future with a ZIA™ platform based on AI (Artificial Intelligence). DMP’s efficient computer vision classification technologies are used to build insight on large amount of real-time image data, such as condition observation, crime prevention, and accident prevention. We recognized that our image segmentation models run four times faster on AWS Inferentia based Inf1 instances compared to GPU-based G4 instances. Due to this higher throughput and lower cost, Inferentia enables us to deploy our AI workloads such as applications for car dashcams at scale."
Hiroyuki Umeda, Director & General Manager, Sales & Marketing Group - Digital Media Professionals
Hotpot.ai empowers non-designers to create attractive graphics and helps professional designers to automate rote tasks.
"Since machine learning is core to our strategy, we were excited to try AWS Inferentia-based Inf1 instances. We found the Inf1 instances easy to integrate into our research and development pipeline. Most importantly, we observed impressive performance gains compared to the G4dn GPU-based instances. With our first model, the Inf1 instances yielded about 45% higher throughput and decreased cost per inference by almost 50%. We intend to work closely with the AWS team to port other models and shift most of our ML inference infrastructure to AWS Inferentia."
Clarence Hu, Founder - Hotpot.ai
"SkyWatch processes hundreds of trillions of pixels of Earth observation data, captured from space everyday. Adopting the new AWS Inferentia-based Inf1 instances using Amazon SageMaker for real-time cloud detection and image quality scoring was quick and easy. It was all a matter of switching the instance type in our deployment configuration. By switching instance types to Inferentia-based Inf1, we improved performance by 40% and decreased overall costs by 23%. This is a big win. It has enabled us to lower our overall operational costs while continuing to deliver high quality satellite imagery to our customers, with minimal engineering overhead. We are looking forward to transitioning all of our inference endpoints and batch ML processes to use Inf1 instances to further improve our data reliability and customer experience."
Adler Santos, Engineering Manager - SkyWatch
Amazon Services Using Amazon EC2 Inf1 instances
Amazon Advertising helps businesses of all sizes connect with customers at every stage of their shopping journey. Millions of ads, including text and images, are moderated, classified, and served for the optimal customer experience every single day.
“For our text ad processing, we deploy PyTorch based BERT models globally on AWS Inferentia based Inf1 instances. By moving to Inferentia from GPUs, we were able to lower our cost by 69% with comparable performance. Compiling and testing our models for AWS Inferentia took less than three weeks. Using Amazon SageMaker to deploy our models to Inf1 instances ensured our deployment was scalable and easy to manage. When I first analyzed the compiled models, the performance with AWS Inferentia was so impressive that I actually had to re-run the benchmarks to make sure they were correct! Going forward we plan to migrate our image ad processing models to Inferentia. We have already benchmarked 30% lower latency and 71% cost savings over comparable GPU-based instances for these models.”
Yashal Kanungo, Applied Scientist, Amazon Advertising
“Amazon Alexa’s AI and ML-based intelligence, powered by Amazon Web Services, is available on more than 100 million devices today – and our promise to customers is that Alexa is always becoming smarter, more conversational, more proactive, and even more delightful. Delivering on that promise requires continuous improvements in response times and machine learning infrastructure costs, which is why we are excited to use Amazon EC2 Inf1 to lower inference latency and cost-per-inference on Alexa text-to-speech. With Amazon EC2 Inf1, we’ll be able to make the service even better for the tens of millions of customers who use Alexa each month.”
Tom Taylor, Senior Vice President, Amazon Alexa
"We are constantly innovating to further improve our customer experience and to drive down our infrastructure costs. Moving our web-based question answering (WBQA) workloads from GPU-based P3 instances to AWS Inferentia-based Inf1 instances not only helped us reduce inference costs by 60%, but also improved the end-to-end latency by more than 40%, helping enhance customer Q&A experience with Alexa. Using Amazon SageMaker for our Tensorflow-based model made the process of switching to Inf1 instances straightforward and easy to manage. We are now using Inf1 instances globally to run these WBQA workloads and are optimizing their performance for AWS Inferentia to further reduce cost and latency.”
Eric Lind, Software Development Engineer, Alexa AI.
“Amazon Rekognition is a simple and easy image and video analysis application that helps customer identify objects, people, text, and activities. Amazon Rekognition needs high-performance deep learning infrastructure that can analyze billions of images and videos daily for our customers. With AWS Inferentia-based Inf1 instances, running Rekognition models such as object classification, resulted in 8X lower latency, and 2X the throughput compared to running these models on GPUs. Based on these results we are moving Rekognition to Inf1, enabling our customers to get accurate results, faster.”
* Prices shown are for US East (Northern Virginia) AWS Region. Prices for 1-year and 3-year reserved instances are for "Partial Upfront" payment options or "No Upfront" for instances without the Partial Upfront option.
Amazon EC2 Inf1 instances are available in the US East (N. Virginia), US West (Oregon) AWS Regions as On-Demand, Reserved, or Spot Instances.
Using Amazon SageMaker
Amazon SageMaker makes it easy to compile and deploy your trained machine learning model in production on Amazon Inf1 instances so that you can start generating real-time predictions with low latency. AWS Neuron, the compiler for AWS Inferentia, is integrated with Amazon SageMaker Neo enabling you to compile your trained machine learning models to run optimally on Inf1 instances. With Amazon SageMaker you can easily run your models on auto-scaling clusters of Inf1 instances that are spread across multiple availability zones to deliver both high performance and highly available real-time inference. Learn how to deploy to Inf1 using Amazon SageMaker with examples on Github.
Using AWS Deep Learning AMI
The AWS Deep Learning AMIs (DLAMI) provide machine learning practitioners and researchers with the infrastructure and tools to accelerate deep learning in the cloud, at any scale. The AWS Neuron SDK comes pre-installed in AWS Deep Learning AMIs to compile and run your machine learning models optimally on Inf1 instances. To help guide you through the getting started process, visit the AMI selection guide and more deep learning resources. Refer to the AWS DLAMI Getting Started guide to learn how to use the DLAMI with Neuron.
Using Deep Learning Containers
Developers can now deploy Inf1 instances in Amazon Elastic Kubernetes Service (EKS), which is a fully managed Kubernetes service, as well as in Amazon Elastic Container Service (ECS), which is a fully managed container orchestration service from Amazon. Learn more about getting started with Inf1 on Amazon EKS or with Amazon ECS. More details about running containers on Inf1 instances are available on the Neuron container tools tutorial page. Neuron is also available pre-installed in AWS DL Containers.