AWS Machine Learning Blog

Announcing managed inference for Hugging Face models in Amazon SageMaker

Hugging Face is the technology startup, with an active open-source community, that drove the worldwide adoption of transformer-based models thanks to its eponymous Transformers library. Earlier this year, Hugging Face and AWS collaborated to enable you to train and deploy over 10,000 pre-trained models on Amazon SageMaker. For more information on training Hugging Face models at scale on SageMaker, refer to AWS and Hugging Face collaborate to simplify and accelerate adoption of Natural Language Processing models and the sample notebooks.

In this post, we discuss different methods to create a SageMaker endpoint for a Hugging Face model.


If you’re unfamiliar with transformer-based models and their place in the natural language processing (NLP) landscape, here is an overview. A lot of use cases in NLP can be modeled as supervised learning tasks. The classic supervised learning scenario is based on learning in isolation, where a model is trained on a specific dataset for a specific task. Any change in the dataset or task requires training a new model. This scenario becomes challenging in the absence of sufficient labeled data to train a task-specific model.

Transfer learning alleviates this challenge by first pre-training—using vast amounts of data to build knowledge in an unsupervised manner—and then fine-tuning, namely transferring that knowledge, supplemented by a labeled dataset, to adapt to a downstream task. Although transfer learning has been a part of NLP over the past decade, the field had a major breakthrough in 2017 with the transformer architecture (Attention is all you Need) proposed by Vaswani et al. Since then, adaptations of the transformer architecture in models such as BERT, RoBERTa, GPT-2, and DistilBERT have pushed the boundaries for state-of-the-art NLP models on a wide range of tasks, such as text classification, question answering, summarization, and text generation. Hugging Face enables you to develop NLP applications for such tasks without the need to train state-of-the-art transformer models from scratch, which could be expensive in terms of computation, cost, and time.

The Hugging Face Deep Learning Containers (DLCs) make it easier not only to train Hugging Face transformer models on SageMaker, but also deploy them, thereby making the management of inference infrastructure easier. The Hugging Face Inference Toolkit for SageMaker is an open-source library for serving Hugging Face Transformers models on SageMaker. It utilizes the SageMaker Inference Toolkit for starting up the model server, which is responsible for handling inference requests.

You can deploy models with Hugging Face DLCs on SageMaker the following ways:

  • A fully managed method to deploy the model to a SageMaker endpoint without the need for writing any custom inference functions. These models could either be:
    1. Fine-tuned models based on your use case
    2. Pre-trained models from the Hugging Face Hub
  • A module that provides more customization through an inference script and allows you to override the default methods of the HuggingFaceHandlerService. This module consists of a model_fn() to override the default method for loading the model. After the model is loaded, predictions are obtained by either implementing a transform_fn() or by implementing input_fn(), predict_fn(), or output_fn() to override the default preprocessing, prediction, and post-processing methods, respectively.

One of the benefits of using the Hugging Face SDK is that it handles inference containers on your behalf and you don’t need to manage Docker files or Docker registries. For more information, refer to Deep Learning Containers Images.

In the following sections, we walk through the three methods to deploy endpoints.

Create a SageMaker endpoint with a trained model

To deploy a SageMaker-trained Hugging Face model from Amazon Simple Storage Service (Amazon S3), make sure that all required files are saved in model.tar.gz file, including the Tokenizer, and use model_data to point your saved model file in Amazon S3. See the following code:

from sagemaker.huggingface import HuggingFaceModel
import sagemaker 

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data="s3://bucket/model.tar.gz", # S3 path to your trained sagemaker model
   role=<SageMaker Role>, # IAM role with permissions to create an Endpoint
   transformers_version="4.6", # transformers version used
   pytorch_version="1.7", # pytorch version used
   py_version="py36", # python version of the DLC

The sample code is available on GitHub.

Create a SageMaker endpoint with a model from the Hugging Face Hub

You shouldn’t use this feature in production for loading large models; models over 10 GB aren’t supported with this feature.

To deploy a model directly from the Hub to SageMaker, you need to initialize the following environment variables:

  • HF_MODEL_ID – Defines the model ID, which is automatically loaded from Hugging Face when creating a SageMaker endpoint. The Hub provides over 10,000 models, all available through this environment variable.
  • HF_TASK – Defines the task for the used Transformers pipeline. For a full list of tasks, see Pipelines.

The value of HF_TASK can be one from the following list:

"feature-extraction", "text-classification",  "token-classification","table-question-answering","question-answering", "fill-mask",  "summarization",  "translation",  "text2text-generation,  "text-generation","zero-shot-classification" or "conversational"

The following is a code snippet showing the steps:

from sagemaker.huggingface import HuggingFaceModel
import sagemaker 

# Hub Model configuration.
hub = {
  'HF_MODEL_ID':'distilbert-base-uncased-distilled-squad', # model_id from
  'HF_TASK':'question-answering' # NLP task you want to use for predictions

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   role=<SageMaker Role>, # iam role with permissions to create an Endpoint
   transformers_version="4.6", # transformers version used
   pytorch_version="1.7", # pytorch version used
   py_version="py36", # python version of the DLC

The sample code is available on GitHub.

Next, you deploy the Hugging Face model to SageMaker and specify the initial instance count and instance type. For more information about the various supported instance types, see Amazon SageMaker Pricing.

deploy returns a Predictor object, which you can use to do inference on the endpoint hosting your Hugging Face model. Each Predictor provides a predict method, which can do inference with NumPy arrays or Python lists. See the following code:

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(

predict returns the result of inference against your model. By default, the inference result is a JSON serializer. See the following code:

# example request, you always need to define "inputs"
data = {"inputs": {
       "question": "Which name is also used to describe the Amazon rainforest in English?",
       "context": "The Amazon rainforest (Portuguese: Floresta Amazônica or Amazônia; Spanish: Selva Amazónica, Amazonía or usually Amazonia; French: Forêt amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America."
} } 
result = predictor.predict(data)

Create a SageMaker endpoint using a custom inference script

The Hugging Face Inference Toolkit allows you to override the default methods of HuggingFaceHandlerService by specifying a custom with model_fn and optionally input_fn, predict_fn, output_fn, or transform_fn. Therefore, you need to create a named code/ with a file in it. For example:

  |- pytorch_model.bin
  |- ....
  |- code/

In this example, pytroch_model.bin is the model file saved from training, is the custom inference module, and requirements.txt is a requirements file to add additional dependencies. The custom module can override the model_fn,  input_fn, predict_fn, output_fn or transform_fn methods. For more information, see the GitHub repo.

Clean up

Make sure you delete the SageMaker endpoints to avoid unnecessary costs:


Customer success stories

This integration makes it easier and quicker to deploy advanced NLP models, even if you don’t have a lot of machine learning expertise.

Customers are already using Hugging Face models on SageMaker. For example, SESAMm is in the business of providing a suite of products based on alternative data towards private and public markets investors. Mehdi Nemlaghi, Chief Algorithm Officer and Senior Data Scientist at SESAMm, says, “We use Hugging Face NLP models to do named entity recognition. We are excited by this new feature, which we expect to help us efficiently query our data lake with more than 100 billion sentences in more than 200 languages, and improve our query capabilities at least by a factor of two (given a NER model, % of detected entities in our data lake).”

Prophia Inc. is a data and asset management platform, designed exclusively for commercial real estate owners. Eric Finkel, Data Scientist at Prophia, says, “Using the new HuggingFaceModel() class was super intuitive and reduced the amount of custom code needed to work with the Hugging Face Transformers library. I was able to deploy a pre-trained RoBERTa model to perform question answering as well as a T5 model for extractive summarization in less than 5 minutes.”

Documentation and code samples to get started

You can start using Hugging Face models on SageMaker for managed inference today, in all AWS Regions where SageMaker is available.

Give it a try, and let us know what you think. As always, we’re looking forward to your feedback. You can send it to your usual AWS Support contacts, or in the AWS Forum for SageMaker.

About the Authors

Sai Sharanya Nalla is a Data Scientist at AWS Professional Services. She works with customers to develop and implement AI and ML solutions on AWS. In her spare time, she enjoys listening to podcasts and audiobooks, long walks, and engaging in outreach activities.




Kartik Kannapur is a Data Scientist with AWS Professional Services. He holds a master’s degree in Applied Mathematics and Statistics from Stony Brook University and focuses on using machine learning to solve customer business problems.