AWS Machine Learning Blog

Announcing the launch of new Hugging Face LLM Inference containers on Amazon SageMaker

This post is co-written with Philipp Schmid and Jeff Boudier from Hugging Face.

Today, as part of Amazon Web Services’ partnership with Hugging Face, we are excited to announce the release of a new Hugging Face Deep Learning Container (DLC) for inference with Large Language Models (LLMs). This new Hugging Face LLM DLC is powered by Text Generation Inference (TGI), an open source, purpose-built solution for deploying and serving Large Language Models. TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, StableLM, Llama, and T5.

Large Language Models are growing in popularity but can be difficult to deploy

LLMs have emerged as the leading edge of artificial intelligence, captivating developers and enthusiasts alike with their ability to comprehend and generate human-like text across diverse domains. These powerful models, such as those based on the GPT and T5 architectures, have experienced an unprecedented surge in popularity for a broad set of applications, including language understanding, conversational experiences, and automated writing assistance. As a result, companies across industries are seizing the opportunity to unlock their potential and offer new LLM-powered experiences in their applications.

Hosting LLMs at scale presents a unique set of complex engineering challenges. To provide an ideal user experience, an LLM hosting service should provide adequate response times while scaling to a large number of concurrent users. Given the high resource requirements of large models, general-purpose inference frameworks may not provide the optimizations required to maximize the utilization of available resources and provide the best possible performance.

Some of these optimizations include:

  • Tensor parallelism to distribute the computation across multiple accelerators
  • Model quantization to reduce the memory footprint of the model
  • Dynamic batching of inference requests to improve throughput, and many others.

The Hugging Face LLM DLC provides these optimizations out of the box and makes it easier to host LLM models at scale.

Hugging Face’s Text Generation Inference simplifies LLM deployment

TGI is an open source, purpose-built solution for deploying Large Language Models (LLMs). It incorporates optimizations including tensor parallelism for faster multi-GPU inference, dynamic batching to boost overall throughput, and optimized transformers code using flash-attention for popular model architectures including BLOOM, T5, GPT-NeoX, StarCoder, and LLaMa.

With the new Hugging Face LLM Inference DLCs on Amazon SageMaker, AWS customers can benefit from the same technologies that power highly concurrent, low latency LLM experiences like HuggingChat, OpenAssistant, and Inference API for LLM models on the Hugging Face Hub, while enjoying SageMaker’s managed service capabilities, such as autoscaling, health checks, and model monitoring.

Get started with TGI on SageMaker Hosting

Let’s walk through a code example that deploys a GPT NeoX 20B parameter model on a SageMaker Endpoint. You can find our complete example notebook here.

First, make sure that the latest version of SageMaker SDK is installed:

%pip install sagemaker>=2.161.0

Then, we import the SageMaker Python SDK and instantiate a sagemaker_session to find the current region and execution role.

import sagemaker
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import time

sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
role = sagemaker.get_execution_role()

Next we retrieve the LLM image URI. We use the helper function get_huggingface_llm_image_uri() to generate the appropriate image URI for the Hugging Face Large Language Model (LLM) inference. The function takes a required parameter backend and several optional parameters. The backend specifies the type of backend to use for the model, the values can be “lmi” and “huggingface”.  “lmi” stands for SageMaker Large Model Inference backend and “huggingface” refers to using Hugging Face TGI backend that is used in this tutorial.

image_uri = get_huggingface_llm_image_uri(
  backend="huggingface", # or lmi

Now that we have the image uri, the next step is to configure the model object. We specify a unique name, the image_uri for the managed TGI container, and the execution role for the endpoint. Additionally, we specify a number of environment variables including the HF_MODEL_ID which corresponds to the model from the HuggingFace Hub that will be deployed, and the HF_TASK which configures the inference task to be performed by the model.

You should also define SM_NUM_GPUS, which specifies the tensor parallelism degree of the model. Tensor parallelism can be used to split the model across multiple GPUs, which is necessary when working with LLMs that are too big for a single GPU. To learn more about tensor parallelism with inference, see our previous blog post. Here, you should set SM_NUM_GPUS to the number of available GPUs on your selected instance type. For example, in this tutorial, we set SM_NUM_GPUS to 4 because our selected instance type ml.g4dn.12xlarge has 4 available GPUs.

Note that you can optionally reduce the memory and computational footprint of the model by setting the HF_MODEL_QUANTIZE environment variable to “true”, but this lower weight precision could affect the quality of the output for some models.

model_name = "gpt-neox-20b-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

hub = {

model = HuggingFaceModel(

Next, we invoke the deploy method to deploy the model.

predictor = model.deploy(

Once the model is deployed, we can invoke it to generate text. We pass an input prompt and run the predict method to generate a text response from the LLM running in the TGI container.

input_data = {
  "inputs": "The diamondback terrapin was the first reptile to",
  "parameters": {
    "do_sample": True,
    "max_new_tokens": 100,
    "temperature": 0.7,
    "watermark": True


We receive the following auto-generated text response:.

[{'generated_text': 'The diamondback terrapin was the first reptile to make the list, followed by the American alligator, the American crocodile, and the American box turtle. The polecat, a ferret-like animal, and the skunk rounded out the list, both having gained their slots because they have proven to be particularly dangerous to humans.\n\nCalifornians also seemed to appreciate the new list, judging by the comments left after the election.\n\n“This is fantastic,” one commenter declared.\n\n“California is a very'}]

To mitigate the risk of potential exploitation of Generative AI capabilities by automated bots, the response is watermarked. Such watermarked responses can be easily detected by algorithms, promoting the responsible use of Generative AI.

Once we are done experimenting, we delete the endpoint and the model resources.


Conclusion and next steps

Deploying Large Language Models using Hugging Face’s Text Generation Inference and SageMaker Hosting is a straightforward solution for hosting open source models like GPT-NeoX, Flan-T5-XXL, StarCoder or LLaMa. The state of the art LLMs are deployed within the secure managed SageMaker environment, and AWS customers can benefit from Large Language Models while keeping full control over their implementation, and without sending their data over to a third-party API.

In the tutorial, we demonstrated the deployment of GPT-NeoX using the new Hugging Face LLM Inference DLC, leveraging the power of 4 GPUs on a SageMaker ml.g4dn.12xlarge  instance. With this approach, users can effortlessly harness the capabilities of state-of-the-art language models, enabling a wide range of applications and advancements in natural language processing.

As a next step, you can learn more about Hugging Face LLM Inference on SageMaker with the following resources:

About the authors

Philipp Schmid
is a Technical Lead at Hugging Face with the mission to democratize good machine learning through open source and open science. Philipp is passionate about productionizing cutting-edge & generative AI machine learning models.

Jeff Boudier builds products at Hugging Face, the #1 open platform for AI builders. Previously Jeff was a co-founder of Stupeflix, acquired by GoPro, where he served as director of Product Management, Product Marketing,  Business Development and Corporate Development.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads deep learning model optimization for applications such as large model inference.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

Xin Yang is a Software Development Engineer at AWS. She has been working on deploying and optimizing deep learning inference systems. Her work spans both the realms of real-time inference and scalable offline inference solutions. In her spare time, Xin enjoys reading and hiking.

Gagan Singh is a Senior Technical Account Manager at AWS helping digital native startups maximize business success. He helps customers with adoption and optimization of real-time, multi-model ML inferencing endpoints using Amazon SageMaker. In his spare time, Gagan enjoys trekking in the Himalayas and listening to music.