AWS Inferentia Customers

See how customers are using AWS Inferentia to deploy deep learning models.

Customer testimonials

"Our team at Leonardo leverages generative AI to enable creative professionals and enthusiasts to produce visual assets with unmatched quality, speed, and style consistency. Utilizing AWS Inferentia2 we are able to reduce our costs by 80%, without sacrificing performance, fundamentally changing the value proposition we can offer customers, enabling our most advanced features at a more accessible price point. It also alleviates concerns around cost and capacity availability for our ancillary AI services, which are increasingly important as we grow and scale. It is a key enabling technology for us as we continue to push the envelope on what’s possible with generative AI, enabling a new era of creativity and expressive power for our users. "

Pete Werner, Head of AI, Leonardo.ai

"At Runway, our suite of AI Magic Tools enables our users to generate and edit content like never before. We are constantly pushing the boundaries of what is possible with AI-powered content creation, and as our AI models become more complex, the underlying infrastructure costs to run these models at scale can become expensive. Through our collaboration with Amazon EC2 Inf2 instances powered by AWS Inferentia, we’re able to run some of our models with up to 2x higher throughput than comparable GPU-based instances. This high-performance, low-cost inference enables us to introduce more features, deploy more complex models, and ultimately deliver a better experience for the millions of creators using Runway."

Cristóbal Valenzuela, Cofounder and CEO, Runway

Qualtrics designs and develops experience management software.

"At Qualtrics, our focus is building technology that closes experience gaps for customers, employees, brands, and products. To achieve that, we are developing complex multi-task, multi-modal DL models to launch new features, such as text classification, sequence tagging, discourse analysis, key-phrase extraction, topic extraction, clustering, and end-to-end conversation understanding. As we utilize these more complex models in more applications, the volume of unstructured data grows, and we need more performant inference-optimized solutions that can meet these demands, such as Inf2 instances, to deliver the best experiences to our customers. We are excited about the new Inf2 instances, because it will not only allow us to achieve higher throughputs, while dramatically cutting latency, but also introduces features like distributed inference and enhanced dynamic input shape support, which will help us scale to meet the deployment needs as we push towards larger, more complex large models."

Aaron Colak, Head of Core Machine Learning, Qualtrics

Finch Computing is a natural language technology company providing artificial intelligence applications for government, financial services, and data integrator clients.

"To meet our customers’ needs for real-time NLP, we develop state-of-the-art DL models that scale to large production workloads. We have to provide low-latency transactions and achieve high throughputs to process global data feeds. We already migrated many production workloads to Inf1 instances and achieved an 80% reduction in cost over GPUs. Now, we are developing larger, more complex models that enable deeper, more insightful meaning from written text. A lot of our customers need access to these insights in real time, and the performance on Inf2 instances will help us deliver lower latency and higher throughput over Inf1 instances. With the Inf2 performance improvements and new Inf2 features, such as support for dynamic input sizes, we are improving our cost-efficiency, elevating the real-time customer experience, and helping our customers glean new insights from their data."

Franz Weckesser, Chief Architect, Finch Computing

"We alert on many types of events all over the world in many languages, in different formats (images, video, audio, text sensors, combinations of all these types) from hundreds of thousands of sources. Optimizing for speed and cost given that scale is absolutely critical for our business. With AWS Inferentia, we have lowered model latency and achieved up to 9x better throughput per dollar. This has allowed us to increase model accuracy and grow our platform's capabilities by deploying more sophisticated DL models and processing 5x more data volume while keeping our costs under control."

Alex Jaimes, Chief Scientist and Senior Vice President of AI, Dataminr

"We incorporate ML into many aspects of Snapchat, and exploring innovation in this field is a key priority. Once we heard about Inferentia, we started collaborating with AWS to adopt Inf1/Inferentia instances to help us with ML deployment, including around performance and cost. We started with our recommendation models and look forward to adopting more models with the Inf1 instances in the future."

Nima Khajehnouri, VP Engineering, Snap Inc.

"Sprinklr's AI-driven unified customer experience management (Unified-CXM) platform enables companies to gather and translate real-time customer feedback across multiple channels into actionable insights—resulting in proactive issue resolution, enhanced product development, improved content marketing, better customer service, and more. Using Amazon EC2 Inf1, we were able to significantly improve the performance of one of our NLP models and improve the performance of one of our computer vision models. We're looking forward to continuing to use Amazon EC2 Inf1 to better serve our global customers."

Vasant Srinivasan, Senior Vice President of Product Engineering, Sprinklr

"Autodesk is advancing the cognitive technology of our AI-powered virtual assistant, Autodesk Virtual Agent (AVA), by using Inferentia. AVA answers over 100,000 customer questions per month by applying natural language understanding (NLU) and DL techniques to extract the context, intent, and meaning behind inquiries. Piloting Inferentia, we are able to obtain a 4.9x higher throughput over G4dn for our NLU models, and look forward to running more workloads on the Inferentia-based Inf1 instances."

Binghui Ouyang, Sr. Data Scientist, Autodesk

"The use of ground-penetrating radar and detection of visual defects is typically the domain of expert surveyors. An AWS microservices-based architecture enables us to process videos captured by automated inspection vehicles and inspectors. By migrating our in-house–built models from traditional GPU-based instances to Inferentia, we were able to reduce costs by 50%. Moreover, we were able to see performance gains when comparing the times with a G4dn GPU instance. Our team is looking forward to running more workloads on the Inferentia-based Inf1 instances."

Jesús Hormigo, Chief of Cloud and AI Officer, Screening Eagle Technologies

NTT PC Communications, a network service and communication solution provider in Japan, is a telco leader in introducing new innovative products in the information and communication technology market.

"NTT PC developed AnyMotion, a motion analysis API platform service based on advanced posture estimation ML models. We deployed our AnyMotion platform on Amazon EC2 Inf1 instances using Amazon ECS for a fully managed container orchestration service. By deploying our AnyMotion containers on Amazon EC2 Inf1, we saw 4.5x higher throughout, a 25% lower inference latency, and 90% lower cost compared to current-generation GPU-based EC2 instances. These superior results will help to improve the quality of the AnyMotion service at scale."

Toshiki Yanagisawa, Software Engineer, NTT PC Communications Inc.

Anthem is one of the nation's leading health benefits companies, serving the healthcare needs of 40+ million members across dozens of states.

"The market of digital health platforms is growing at a remarkable rate. Gathering intelligence on this market is a challenging task due to the vast amounts of customer opinions data and its unstructured nature. Our application automates the generation of actionable insights from customer opinions via DL natural language models (Transformers). Our application is computationally intensive and needs to be deployed in a highly performant manner. We seamlessly deployed our DL inferencing workload onto Amazon EC2 Inf1 instances powered by the AWS Inferentia processor. The new Inf1 instances provide 2x higher throughput to GPU-based instances and allowed us to streamline our inference workloads."

Numan Laanait and Miro Mihaylov, PhDs, Principal AI/Data Scientists, Anthem

"Condé Nast's global portfolio encompasses over 20 leading media brands, including Wired, Vogue, and Vanity Fair. Within a few weeks, our team was able to integrate our recommendation engine with AWS Inferentia chips. This union enables multiple runtime optimizations for state-of-the-art natural language models on SageMaker's Inf1 instances. As a result, we observed a 72% reduction in cost than the previously deployed GPU instances."

Paul Fryzel, Principal Engineer, AI Infrastructure, Condé Nast

“Ciao is evolving conventional security cameras into high-performance analysis cameras equivalent to the capability of a human eye. Our application is advancing disaster prevention, monitoring environmental conditions using cloud-based AI camera solutions to alert before it becomes a disaster. Such alert enables reacting to the situation beforehand. Based on the object detection, we can also provide insight by estimating the number of incoming guests without staff from videos in brick and mortar stores. Ciao Camera commercially adopted AWS Inferentia-based Inf1 instances with 40% better price performance than G4dn with YOLOv4. We are looking forward to more of our services with Inf1 leveraging its significant cost efficiency."

Shinji Matsumoto, Software Engineer, Ciao Inc.

"The Asahi Shimbun is one of the most popular daily newspapers in Japan. Media Lab, established as one of our company's departments, has the missions to research the latest technology, especially AI, and connect the cutting-edge technologies for new businesses. With the launch of AWS Inferentia based Amazon EC2 Inf1 instances in Tokyo, we tested our PyTorch based text summarization AI application on these instances. This application processes a large amount of text and generates headlines and summary sentences trained on articles from the last 30 years. Using Inferentia, we lowered costs by an order of magnitude over CPU-based instances. This dramatic reduction in costs will enable us to deploy our most complex models at scale, which we previously believed was not economically feasible."

Hideaki Tamori, PhD, Senior Administrator, Media Lab, The Asahi Shimbun Company

"CS Disco is reinventing legal technology as a leading provider of AI solutions for e-discovery developed by lawyers for lawyers. Disco AI accelerates the thankless task of combing through terabytes of data, speeding up review times and improving review accuracy by leveraging complex NLP models, which are computationally expensive and cost-prohibitive. Disco has found that AWS Inferentia-based Inf1 instances reduce the cost of inference in Disco AI by at least 35% as compared with today's GPU instances. Based on this positive experience with Inf1 instances, CS Disco will explore opportunities for migration into Inferentia."

Alan Lockett, Sr. Director of Research, CS Disco

"At Talroo, we provide our customers with a data-driven platform that enables them to attract unique job candidates so they can make hires. We are constantly exploring new technologies to ensure we offer the best products and services to our customers. Using Inferentia, we extract insights from a corpus of text data to enhance our AI-powered search-and-match technology. Talroo leverages Amazon EC2 Inf1 instances to create high-throughput NLU models with SageMaker. Talroo’s initial testing shows that the Amazon EC2 Inf1 instances deliver 40% lower inference latency and 2x higher throughput compared to G4dn GPU-based instances. Based on these results, Talroo looks forward to using Amazon EC2 Inf1 instances as part of its AWS infrastructure."

Janet Hu, Software Engineer, Talroo

"Digital Media Professionals (DMP) visualizes the future with a ZIA™ platform based on AI. DMP’s efficient computer vision classification technologies are used to build insight on large amounts of real-time image data, such as condition observation, crime prevention, and accident prevention. We recognized that our image segmentation models run four times faster on AWS Inferentia based Inf1 instances compared to GPU-based G4 instances. Due to this higher throughput and lower cost, Inferentia enables us to deploy our AI workloads, such as applications for car dashcams, at scale."

Hiroyuki Umeda, Director & General Manager, Sales & Marketing Group, Digital Media Professionals

Hotpot.ai empowers non-designers to create attractive graphics and helps professional designers automate rote tasks.

"Since ML is core to our strategy, we were excited to try AWS Inferentia-based Inf1 instances. We found the Inf1 instances easy to integrate into our research and development pipeline. Most importantly, we observed impressive performance gains compared to the G4dn GPU-based instances. With our first model, the Inf1 instances yielded about 45% higher throughput and decreased cost per inference by almost 50%. We intend to work closely with the AWS team to port other models and shift most of our ML inference infrastructure to AWS Inferentia."

Clarence Hu, Founder, Hotpot.ai

"SkyWatch processes hundreds of trillions of pixels of earth observation data, captured from space every day. Adopting the new AWS Inferentia-based Inf1 instances using Amazon SageMaker for real-time cloud detection and image quality scoring was quick and easy. It was all a matter of switching the instance type in our deployment configuration. By switching instance types to Inferentia-based Inf1, we improved performance by 40% and decreased overall costs by 23%. This is a big win. It has enabled us to lower our overall operational costs while continuing to deliver high-quality satellite imagery to our customers, with minimal engineering overhead. We are looking forward to transitioning all of our inference endpoints and batch ML processes to use Inf1 instances to further improve our data reliability and customer experience."

Adler Santos, Engineering Manager, SkyWatch

Money Forward Inc. serves businesses and individuals with an open and fair financial platform. As part of this platform, HiTTO Inc., a Money Forward group company, offers an AI chatbot service that uses tailored NLP models to address the diverse needs of their corporate customers.

"Migrating our AI chatbot service to Amazon EC2 Inf1 instances was straightforward. We completed the migration within two months and launched a large-scale service on the Inf1 instances using Amazon ECS. We were able to reduce our inference latency by 97% and our inference costs by over 50% (over comparable GPU-based instances) by serving multiple models per Inf1 instance. We look forward to running more workloads on the Inferentia-based Inf1 instances."

Kento Adachi, Technical lead, CTO office, Money Forward Inc.

Collapse

AWS Partner testimonials

"Hugging Face’s mission is to democratize good ML to help ML developers around the world solve real-world problems. And key to that is ensuring the latest and greatest models run as fast and efficiently as possible on the best ML accelerators in the cloud. We are incredibly excited about the potential for Inferentia2 to become the new standard way to deploy generative AI models at scale. With Inf1, we saw up to 70% lower cost than traditional GPU-based instances, and with Inf2 we have seen up to 8x lower latency for BERT-like Transformers compared to Inferentia1. With Inferentia2, our community will be able to easily scale this performance to LLMs at the 100B+ parameters scale, and to the latest diffusion and computer vision models as well.”

"PyTorch accelerates the path from research prototyping to production deployments for ML developers. We have collaborated with the AWS team to provide native PyTorch support for the new AWS Inferentia2 powered Amazon EC2 Inf2 instances. As more members of our community look to deploy large generative AI models, we are excited to partner with the AWS team to optimize distributed inference on Inf2 instances with high-speed NeuronLink connectivity between accelerators. With Inf2, developers using PyTorch can now easily deploy ultra-large LLMs and vision transformer models. Additionally, Inf2 instances bring other innovative capabilities to PyTorch developers, including efficient data types, dynamic shapes, custom operators, and hardware-optimized stochastic rounding, making them well-suited for wide adoption by the PyTorch community.”

"Weights & Biases (W&B) provides developer tools for ML engineers and data scientists to build better models faster. The W&B platform provides ML practitioners a wide variety of insights to improve the performance of models, including the utilization of the underlying compute infrastructure. We have collaborated with the AWS team to add support for Amazon Trainium and Inferentia2 to our system metrics dashboard, providing valuable data much needed during model experimentation and training. This enables ML practitioners to optimize their models to take full advantage of AWS’s purpose-built hardware to train their models faster and at lower cost."

Phil Gurbacki, VP of Product, Weights & Biases

"OctoML helps developers reduce costs and build scalable AI applications by packaging their DL models to run on high-performance hardware. We have spent the last several years building expertise on the best software and hardware solutions and integrating them into our platform. Our roots as chip designers and system hackers make AWS Trainium and Inferentia even more exciting for us. We see these accelerators as a key driving factor for the future of AI innovation on the cloud. The GA launch of Inf2 instances is especially timely, as we are seeing the emergence of popular LLM as a key building block of next-generation AI applications. We are excited to make these instances available in our platform to help developers easily take advantage of their high performance and cost-saving benefits."

Jared Roesch, CTO and Cofounder, OctoML

"The historic challenge with LLMs, and more broadly with enterprise-level generative AI applications, are the costs associated with training and running high-performance DL models. Along with AWS Trainium, AWS Inferentia2 removes the financial compromises our customers make when they require high-performance training. Now, our customers looking for advantages in training and inference can achieve better results for less money. Trainium and Inferentia accelerate scale to meet even the most demanding DL requirements for today’s largest enterprises. Many Nextira customers running large AI workloads will benefit directly with these new chipsets, increasing efficiencies in cost savings and performance and leading to faster results in their market."

Jason Cutrer, founder and CEO, Nextira

Amazon services using AWS Inferentia2

Amazon CodeWhisperer is an AI coding companion that generates real-time single-line or full-function code recommendations in your integrated development environment (IDE) to help you quickly build software.

"With CodeWhisperer, we're improving software developer productivity by providing code recommendations using generative AI models. To develop highly effective code recommendations, we scaled our DL network to billions of parameters. Our customers need code recommendations in real time as they type, so low-latency responses are critical. Large generative AI models require high-performance compute to deliver response times in a fraction of a second. With Inf2, we're delivering the same latency as running CodeWhisperer on training optimized GPU instances for large input and output sequences. Thus, Inf2 instances are helping us save cost and power while delivering the best possible experience for developers.”

Doug Seven, General Manager, Amazon CodeWhisperer

Amazon's product search engine indexes billions of products, serves billions of customer queries daily, and is one of the most heavily used services in the world.

"I am super excited at the Inf2 GA launch. The superior performance of Inf2, coupled with its ability to handle larger models with billions of parameters, makes it the perfect choice for our services and enables us to unlock new possibilities in terms of model complexity and accuracy. With the significant speedup and cost-efficiency offered by Inf2, integrating them into Amazon Search serving infrastructure can help us meet the growing demands of our customers. We are planning to power our new shopping experiences using generative LLMs using Inf2.”

Trishul Chilimbi, VP, Amazon Search

Amazon services using AWS Inferentia

Amazon Advertising helps businesses of all sizes connect with customers at every stage of their shopping journey. Millions of ads, including text and images, are moderated, classified, and served for the optimal customer experience every single day.

“For our text ad processing, we deploy PyTorch based BERT models globally on AWS Inferentia based Inf1 instances. By moving to Inferentia from GPUs, we were able to lower our cost by 69% with comparable performance. Compiling and testing our models for AWS Inferentia took less than three weeks. Using Amazon SageMaker to deploy our models to Inf1 instances ensured our deployment was scalable and easy to manage. When I first analyzed the compiled models, the performance with AWS Inferentia was so impressive that I actually had to re-run the benchmarks to make sure they were correct! Going forward, we plan to migrate our image ad processing models to Inferentia. We have already benchmarked 30% lower latency and 71% cost savings over comparable GPU-based instances for these models."

Yashal Kanungo, Applied Scientist, Amazon Advertising

Read the news blog »

“Amazon Alexa’s AI- and ML-based intelligence, powered by AWS, is available on more than 100 million devices today—and our promise to customers is that Alexa is always becoming smarter, more conversational, more proactive, and even more delightful. Delivering on that promise requires continuous improvements in response times and ML infrastructure costs, which is why we are excited to use Amazon EC2 Inf1 to lower inference latency and cost per inference on Alexa text-to-speech. With Amazon EC2 Inf1, we’ll be able to make the service even better for the tens of millions of customers who use Alexa each month."

Tom Taylor, Senior Vice President, Amazon Alexa

"We are constantly innovating to further improve our customer experience and to drive down our infrastructure costs. Moving our web-based question answering (WBQA) workloads from GPU-based P3 instances to AWS Inferentia-based Inf1 instances not only helped us reduce inference costs by 60%, but also improved the end-to-end latency by more than 40%, helping enhance customer Q&A experience with Alexa. Using Amazon SageMaker for our TensorFlow-based model made the process of switching to Inf1 instances straightforward and easy to manage. We are now using Inf1 instances globally to run these WBQA workloads and are optimizing their performance for AWS Inferentia to further reduce cost and latency."

Eric Lind, Software Development Engineer, Alexa AI

“Amazon Prime Video uses computer vision ML models to analyze video quality of live events to ensure an optimal viewer experience for Prime Video members. We deployed our image classification ML models on EC2 Inf1 instances and were able to see 4x improvement in performance and up to 40% savings in cost. We are now looking to leverage these cost savings to innovate and build advanced models that can detect more complex defects, such as synchronization gaps between audio and video files, to deliver more enhanced viewing experience for Prime Video members."

Victor Antonino, Solutions Architect, Amazon Prime Video

“Amazon Rekognition is a simple and easy image and video analysis application that helps customers identify objects, people, text, and activities. Amazon Rekognition needs high-performance DL infrastructure that can analyze billions of images and videos daily for our customers. With AWS Inferentia-based Inf1 instances, running Amazon Rekognition models such as object classification resulted in 8x lower latency and 2x the throughput compared to running these models on GPUs. Based on these results, we are moving Amazon Rekognition to Inf1, enabling our customers to get accurate results faster."

Rajneesh Singh, Director, SW Engineering, Amazon Rekognition and Video

Next Steps

Console

Start building in the console

Resources

Inference Samples/Tutorials (Inf2/Trn1)

Learn more