Hugging Face BERT on AWS Inferentia

Get the highest performance at the lowest cost on Hugging Face BERT inference

Hugging Face is a leading repository for BERT-based NLP models, which are a common foundation for many NLP applications. As more companies deploy Hugging Face BERT models into production, they are faced with cost, performance, and time to market challenges. Amazon EC2 Inf1 instances, powered by AWS Inferentia are purpose built for deep learning inference and are ideal for BERT models.

Save up to 70% on cost per inference

Inf1 instances deliver up to 70% lower inference costs than comparable GPU-based EC2 instances for many natural language processing applications such as text classification, language translation, sentiment analysis, and conversational AI.
Deploy easily with a few lines of code

With support for Hugging Face models in the Neuron SDK, you can easily compile and run inference using pre-trained or fine-tuned transformer models with just a few lines of code. Inf1 instances support popular ML frameworks such as PyTorch and TensorFlow.
Enjoy up to 2.3x higher throughput

Inf1 instances deliver up to 2.3x higher throughput than comparable GPU-based Amazon EC2 instances. Inf1 instances are optimized for inference performance for small batch sizes, enabling real-time applications to maximize throughput and meet latency requirements.

Highest performance and lowest cost inference

Bert-Base numbers derived from Nvidia Performance Page PyTorch 1.9, seq =128, FP16

Customer Stories

Adevinta found that AWS Inferentia offered a reduction of up to 92% in prediction latency, at 75% lower cost compared to best initial alternatives when deploying Hugging Face BERT models. “It was, in other words, like having the best of GPU power at CPU cost.”

Read the blog »

Amazon Advertising found running BERT models on Inferentia, decreased latency by 30%, and costs by 71%. “...the performance with AWS Inferentia was so impressive that I actually had to re-run the benchmarks to make sure they were correct!”

Read the blog »

Sprinklr is able to deploy a [BERT] model using Inf1 Instances in under 2 weeks. With ample resources and support, it found migration to Inf1 simple. “The support from AWS helps us boost our customer satisfaction and staff productivity.”

Read the case study »

Getting started is easy

Hugging Face Webinar and Blog

Accelerate BERT inference (distilBERT) with AWS Inferentia. Watch webinar or read blog.

Amazon SageMaker Tutorial

Bring your own Hugging Face pre-trained BERT container to SageMaker. View tutorial or notebook.

Pre-trained BERT Tutorial

Compile and deploy pre-trained bert-base version from Hugging Face. View tutorial or notebook.

Save inference costs with AWS Inferentia

Speak to Inferentia experts today

Select your cookie preferences

Hugging Face BERT on AWS Inferentia

Get the highest performance at the lowest cost on Hugging Face BERT inference

Save up to 70% on cost per inference

Deploy easily with a few lines of code