AWS Public Sector Blog

Deploy LLMs in AWS GovCloud (US) Regions using Hugging Face Inference Containers

AWS branded background design with text overlay that says "Deploy LLMs in AWS GovCloud (US) Regions using Hugging Face Inference Containers"

Introduction

Government agencies are increasingly using large language models (LLMs) powered by generative artificial intelligence (AI) to extract valuable insights from their data in the Amazon Web Services (AWS) GovCloud (US) Regions. LLMs are advanced AI systems trained on vast amounts of text data, capable of understanding and generating human-like language, providing powerful natural language processing (NLP) capabilities for various applications. In this guide, we walk you through the process of hosting LLMs on Amazon Elastic Compute Cloud (Amazon EC2) instances, using the Hugging Face Text Generation Inference (TGI) Container (TGI) for serving custom LLMs.

Solution overview

AWS GovCloud (US) plays a crucial role in supporting public sector workloads by providing a dedicated cloud environment to meet stringent regulatory and compliance requirements. It specifically caters to US government agencies and other entities with sensitive workloads to enforce data sovereignty and adherence to various security standards. The importance lies in enabling these organizations to use the benefit of cloud computing and generative AI while addressing specific regulatory constraints. AWS GovCloud (US) Regions offer a wide range of machine learning (ML) services, including AWS Deep Learning AMIs (DLAMI), Amazon Textract, and Amazon Comprehend, enabling government agencies and educational institutions to leverage the power of AI and ML for their mission-critical applications. Additionally, AWS GovCloud (US) offers accelerated compute with G4dn and P4d instance types, providing high performance computing (HPC) capabilities for ML workloads.

Importantly, compliance and the adoption of generative AI technologies like LLMs are not mutually exclusive in the AWS GovCloud (US) Regions. AWS recognizes the growing importance of generative AI and has made it possible for customers with specific regulatory needs to deploy LLMs in the AWS GovCloud (US) regions with services such as Amazon Bedrock and Amazon SageMaker. Another way this can be achieved is through Hugging Face Inference containers.

We’ll utilize Amazon EC2 GPU instances and the Hugging Face Inference Container to host and serve custom LLMs in the AWS GovCloud (US) Regions. The Hugging Face Inference Container offers broad compatibility with many LLM architectures, optimized serving throughput with GPU sharding and batch processing, and traceability with OpenTelemetry distributed metrics.

The Hugging Face Inference Container allows an Amazon EC2 instance, such as the AWS G4dn instance class, to host an API for language model serving.

Prerequisites

To follow along, you should have the following prerequisites:

Solution walkthrough

This solution will walk you through creating an Amazon EC2 instance, downloading and deploying the container image, hosting an LLM, and optionally serving language models. Follow the prerequisite checklist to make sure that you are able to properly implement this solution.

This solution was validated in us-gov-west-1 Region.

Optional: Download Hugging Face model weights to Amazon S3

If you want to host a language model with this solution, first set up proper storage and permissions. If you already have a custom LLM stored in Amazon Simple Storage Service (Amazon S3) with the safetensors or .bin file format, you may skip the Amazon S3 bucket setup steps. However, for the inference server to properly download and deploy the custom model weights, make sure to add the proper read permissions for the Amazon EC2 instance to access and download the model weights from Amazon S3. At the end of this section, there is a pre-built notebook to help you download model weights from Hugging Face as a reference.

Setting up a model artifacts bucket for custom LLMs (this step is optional if you wish to use a pre-trained model available on Hugging Face):

a. On the Amazon S3 console, select Create Bucket, give a unique name for the bucket, and note the Region. Leave everything else as the default and select Create.

b. On the AWS Identity and Access Management (IAM) console, create a new policy to allow read access to a specific bucket. In the options on the left side of the console, go to Policies and then Create policy. Select the option to view the custom policy in JSON format, then modify and paste the following code into the editor:

{
    "Version": "2012-10-17",
    "Statement": [
        {
        "Effect": "Allow",
        "Action": "s3:GetObject",
        "Resource": "arn:aws:s3:::<your-bucket-name>/*"
        }
    ]
}

c. Choose Next, give the policy a name and description, and then choose Create policy.

d. On the IAM console, create a role for the Amazon EC2 instance to read the Amazon S3 bucket you just made to pull the model artifacts. Select Create role, and then select AWS service as the trusted entity type and EC2as the service.

e. In the next step, search for the policy you just created and choose the box next to it. Choose Next, then give the role a name and description, and choose Create role.

f. Optional step: Use the following notebook to load a pre-trained model on Hugging Face to Amazon S3. You can use SageMaker Studio notebooks, SageMaker notebook instances, or a local Jupyter server if proper AWS credentials are added.

Creating EC2 instance for LLM hosting

In this section, we configure and create an Amazon EC2 instance to host the LLM. This guide uses the AWS G4dn instance class, but customers looking to use larger language models may elect for the AWS P4d instance class as well. We will be deploying a 7 billion parameter model that has a GPU memory requirement of approximately 15 gigabytes (GB). It is recommended to have about 1.5x the GPU memory capacity to run inference on a certain language model (GPU memory specifications can be found in the Amazon ECS Developer Guide).

It is important to note that you can quantize the model. Quantizing a language model reduces the model weights to a size of your choosing. For example, the LLM we’ll be using is Falcon-7b, which by default has a weight size of fp16, or 16-bit floating point. We can convert the model weights to int8 or int4 (8- or 4-bit integers) in order to shrink the memory footprint of the model by 50 percent and 25 percent respectively. Converting the weights to a smaller representation affects the quality of the LLM output. In this guide, we use the default fp16 representation of Falcon-7b, so we require an instance type with at least 22 GB of GPU memory.

Depending on the language model specifications, we also need to add Amazon Elastic Block Store (Amazon EBS) storage to properly store the model weights.

a. On the Amazon EC2 console, choose Launch instance.

b. Use the Deep Learning Proprietary Nvidia Driver AMI (Amazon Linux 2) offered by Amazon in the Quickstart AMIs tab.

Figure 1. Screenshot of an ideal Amazon Machine Image (AMI) for the Amazon EC2 instance. We use the preinstalled Nvidia drivers and Docker from this image. There are also options for OSS Nvidia drivers, which will also work for g4dn instances. Note that if you are using alternative instance classes in the AWS GovCloud (US) Regions, such as p3, you will be required to use the Nvidia proprietary AMI rather than the OSS AMI.

c. For instance type, select 12xlarge. Depending on the size of the model, you may increase the GPU memory of the instance by selecting a different instance type (GPU memory per instance type is found at Amazon EC2 Instance Types).

d. Under Configure security group, remove or alter the existing rule to allow only inbound SSH traffic from My IP.

e. Set the EBS size to 128 GB. For larger models, you may need to increase this value accordingly.

f. Create and download a key pair for SSH access and choose Launch.

Configure EC2 for hosting

Amazon EC2 provides a robust environment for hosting containers, allowing users to efficiently deploy and manage applications in isolated portable environments. In this section, you will containerize your LLM using Docker to encapsulate the model and its dependencies.

a. Start Docker. Docker is installed on this instance, so execute these commands once connected to the instance:

sudo apt-get update #or sudo yum update
sudo systemctl start docker

b. Install Hugging Face TGI on the connected instance. Commands to execute are provided in the following steps.

i. Configure Nvidia Container Toolkit, which comes preinstalled on the AMI:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

ii. If you are using a pre-trained model from Hugging Face and did not follow the preceding optional steps to download artifacts to Amazon S3, execute the following commands to download the TGI container and deploy the model for inferencing:

model=tiiuae/falcon-7b-instruct #replace with any Hugging Face model
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id $model

iii. If you run into the error “Not enough memory to handle 4 prefill tokens,” run the updated command with the max-batch-prefill-tokens parameter specified:

model=tiiuae/falcon-7b-instruct #replace with any HuggingFace model
volume=$PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id $model --max-batch-prefill-tokens 2048

The max-batch-prefill-tokens parameter is used in the Hugging Face Transformers library to control the memory usage and efficiency of the batching process during inference or fine-tuning. To optimize memory usage and speed up the inference process, the library uses a technique called “batching,” where multiple input sequences are processed together in batches.

The g4dn instance class utilizes Nvidia T4 GPUs. Certain dependencies of the Hugging Face TGI Container may not work with this hardware, such as Flash Attention 2, so certain models that depend on that software will not work. If you wish to run this container on commercial cloud Regions, additional instance classes g5 and p4d may be used.

iv. If you are using a custom model, or you downloaded the model weights to Amazon S3 with the optional notebook, execute the following commands to download the model weights from Amazon S3 and deploy the container and model for inferencing:

aws s3 --region <s3-bucket-region> sync s3://<your-bucket-name>/<model-directory> $PWD/data
docker run --gpus all --shm-size 1g -p 8080:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id /data/

c. The docker run command will launch the inference server and make it available to serve requests. You need to run this command every time you want to start the inference server and serve the language model. Once the server is running, it will remain available until you stop or terminate the container.

Testing the server

Now that the container and LLM are successfully downloaded and deployed onto the instance, we can test the server. You can do this in a variety of ways, from using SSH to connect to the server or modifying the security group rules for the instance to allow access from a specific IP address range. In this example, we’ll create a SageMaker notebook instance in a different security group than the inference server, modify the traffic rules to allow communication, and run API commands to generate some test output.

a. We will create a SageMaker notebook instance to interact with the deployed LLM endpoint. Note that there are many different options to interact with the deployed endpoint. We use SageMaker notebook instance for convenience but could use local compute or the command line interface (CLI).

b. Create a notebook instance in SageMaker with a ml.t2.medium instance type with a different security group than the inference server.

i. VPC or subnet and security group information are found on the instance details page, under the Networking and Security tabs, respectively.

ii. When creating the notebook instance, under the Network – optional tab, enter the respective values.

c. On the Amazon EC2 security group page, modify the inbound rules to allow TCP traffic to port 8080 from the specified security group used for the notebook instance.

Figure 2. Screenshot showing security group rules for the inference server, allowing inbound connections from the user’s IP address for SSH access and communication to a security group over port 8080.

d. Navigate to your notebook instance by going to the SageMaker console-> Notebook-> Notebook instances-> your notebook instance name-> “Open Jupyter.” Create a new notebook with the conda_python3 kernel.

e. Copy the inference server’s public IPv4 address from the Amazon EC2 instance details page and paste it as the value of the ec2_public_ip_address variable in the following snippet. Then, execute this code in the first cell of the notebook.

import requests
import json

ec2_public_ip_address = '<EC2 Public IPv4 Address>'
inference_server_port = 8080

url = 'http://'+ ec2_public_ip_address + ':' +str(inference_server_port) +'/generate'

data = {
"inputs":"What is Deep Learning?",
"parameters":{"max_new_tokens":20}
}

headers = {'Content-Type': 'application/json'}

response = requests.post(url, json=data, headers=headers)

print(response.status_code)
print(response.text)

f. You should expect to see an output similar to the following, but the text may vary slightly due to the randomness inherent in the inference process of LLMs:

200
{"generated_text":"\nDeep learning is a branch of machine learning that uses artificial neural networks to learn and make decisions."}

Cleanup

In this guide, we created security groups, an optional Amazon S3 bucket, SageMaker notebook instances, and an Amazon EC2 inference server. It’s important to terminate resources created to avoid incurring additional costs. To do so, delete the Amazon S3 bucket, SageMaker notebook instance, and Amazon EC2 inference server along with Amazon EBS volume.

Usage in Cloud applications

Launching templates in Amazon EC2 can be used to deploy multiple instances of the inference server, with options for load balancing or auto scaling. You can use AWS Systems Manager to deploy patches or changes. Additionally, a shared file system could be used across all deployed Amazon EC2 resources to store the LLM weights for multiple LLMs. You may also use Amazon API Gateway as an API endpoint for REST-based applications.

Conclusion

In this post, we walked you through setting up an Amazon EC2 instance for language model hosting and serving, storing, and accessing custom model weights in Amazon S3, and interacting with the inference server through a SageMaker notebook instance. You’re able to build applications enabled by generative AI like document processing services, entity extraction engines, and chatbots. Read our upcoming post about training LLMs on AWS GovCloud (US) (scheduled to publish May 30) to learn to customize your language models to suit your use case.

Relevant links

John Kitaoka

John Kitaoka

John is a solutions architect at Amazon Web Services (AWS) and works with government entities, universities, nonprofits, and other public sector organizations to design and scale artificial intelligence (AI) solutions. His work covers a broad range of machine learning (ML) use cases, with a primary interest in inference, responsible AI, and security. In his spare time, he loves woodworking and snowboarding.

Ana Gosseen

Ana Gosseen

Ana is a solutions architect at Amazon Web Services (AWS) and works closely with independent software vendors (ISVs) in the public sector space. She is passionate about driving digital transformation in the public sector through the latest advancements in generative artificial intelligence (AI). In her spare time, you can find Ana getting out into nature with her family and dog.

Joseph Gramstad

Joseph Gramstad

Joseph is a solutions architect at Amazon Web Services (AWS) and supports public sector customers, primarily in aerospace and defense. He is also a specialist in machine learning (ML) and focuses on distributed training and compression techniques. Joseph enjoys playing table tennis in his free time.