Fine-tuning an LLM using QLoRA in AWS GovCloud (US)

Introduction

Government agencies are increasingly using large language models (LLMs) powered by generative artificial intelligence (AI) to extract valuable insights from their data in the Amazon Web Services (AWS) GovCloud (US) Regions. In this guide, we walk you through the process of adapting LLMs to specific domains with parameter efficient fine-tuning techniques made accessible through Amazon SageMaker integrations with Hugging Face.

Solution overview

AWS GovCloud (US) plays a crucial role in supporting public sector workloads by providing a dedicated cloud environment to meet stringent regulatory and compliance requirements. It offers a range of services, such as Amazon Elastic Compute Cloud (Amazon EC2), to power and host applications, and SageMaker, to build, train, and deploy machine learning (ML) models, that enable public sector entities to innovate, scale, and modernize their IT infrastructure securely.

For this walkthrough, we use SageMaker notebooks, Amazon EC2 graphics processing unit (GPU) instances, and prebuilt Hugging Face training containers to fine-tune custom language models in the AWS GovCloud (US) Regions.

Language models aim to predict the most likely sequence of tokens based on their understanding of relationships between tokens or the relationships between words and phrases. A pre-trained general-use language model may not be properly equipped to handle domain-specific terminology (for example, in the fields of medicine, legal, and finance). Thus, customers may benefit from fine-tuning a language model on text completion in those domains.

Hugging Face provides popular libraries for implementing Quantized Low Rank Adaptation (QLoRA), a fine-tuning method that we use in this walkthrough. QLoRA is a technique that allows language models to efficiently learn on smaller, high-quality datasets by freezing a portion of the original model weights and training an adapter overlay, which can be merged with the base model.

Prerequisites

In order to follow along, you should have the following prerequisites:

An AWS accountwith access to the AWS GovCloud (US) Regions.
Access to an Amazon SageMaker Training g4dn.12xlarge instance type or larger, such as an ml.p3dn.24xlarge for training and to an ml.t3.medium instance for an Amazon SageMaker Notebook. Please note that instance types prepended with ‘ml.’ are specifically used by SageMaker and are limited by a different quota than EC2 instances. Confirm quota increases prior to deployment.

Solution walkthrough: Fine-tuning an LLM using QLoRA in AWS GovCloud (US)

This walkthrough will cover how to deploy a notebook instance in SageMaker, preprocess our fine-tuning dataset and then launch a SageMaker training job from this notebook. This training job will execute on a separate EC2 instance managed by SageMaker. Upon conclusion of our training job, the updated model weights will be saved to Amazon Simple Storage Service (S3) for subsequent deployment to an inference endpoint. The walk though code can be downloaded using the following links:

Deploying SageMaker notebook

Navigate to the SageMaker console and from the menu, under the Notebook tab, select Notebook instances. To provision a notebook instance you must specify a name and an AWS Identity and Access Management (IAM)role that defines the notebook’s privileges. Using default values for all other fields will work for this walkthrough, but you can also specify an Amazon Virtual Private Cloud (Amazon VPC) to deploy into, the type of notebook kernel to use, and git repositories to clone, among other configurations. Select Create notebook instance when finished.

Figure 1. Screenshot of the Create notebook instance in Amazon SageMaker.

You will be redirected to the Notebook instances pane, as shown in Figure 2. Wait until the status indicates InService, and then underneath Actions, select Open Jupyter link.

Figure 2. The Notebook instances pane in SageMaker after a notebook is created.

A Jupyter notebook dashboard will open. From the dropdown menu, select New and then Folder. Name the new folder “scripts” by clicking the ‘rename’ button, navigate into it, and upload the requirements.txt and train.py files.

Figure 3. The scripts folder after the required code is uploaded.

Navigate back to the top-level pane and upload the qlora.ipynb file. Click on this file to get started. In the top right corner of the qlora notebook conda_pytorch_p310 should be visible. This specifies the environment of the notebook. In this case, it’s Python version 3.10 with conda as package manager and PyTorch packages installed.

Preprocess dataset and launch training job

We are using a summarization dataset from Hugging Face that is accessed via the datasets library, but if you wish to use your own custom dataset, you can follow these steps to properly format your dataset for training. You may need to change the prompt generation and tokenizer functions to adapt to your dataset. We will also be formatting the dataset for instruction tuning, which can allow a fine-tuned model to respond better to instructions in an input-output format.

The Jupyter notebook can be executed in full by pressing Shift + Enter or incrementally (cell by cell). The first cell contains the following code, which creates a SageMaker session, specifies a default Amazon S3 bucket to store the input and output of the training process, and a SageMaker execution role that defines the permissions of the notebook instance:

%pip install transformers datasets

import sagemaker
import time
from sagemaker.huggingface import HuggingFace
from transformers import AutoTokenizer
from datasets import load_dataset

sess = sagemaker.Session()
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker bucket: {sess.default_bucket()}")
print(f"sagemaker session region: {sess.boto_region_name}")

The format_prompt function takes an entry from our dataset and restructures it in question-answering format. This will help our fine-tuned language model recognize a common query format.

def format_prompt(entry):
    return f"""
        <Text>: {entry["text"]}
        <Summary>: {entry["summary"]}
        """.strip()

The tokenize_prompt function is a wrapper function that first reformats the datapoint text using the format_prompt function and then tokenizes the text. Tokenization involves splitting the text into parts, mapping those parts to integers, and standardizing the length of each data point by adding special tokens at the end of the sequence (padding) or removing tokens from the end of the sequence (truncation).

def tokenize_prompt(entry):
    formatted_prompt = format_prompt(entry)
    tokenized_prompt = tokenizer(formatted_prompt, padding=True, truncation=True)
    return tokenized_prompt

The next cell specifies the model to fine-tune, the dataset to fine-tune on, the dataset split to use for training, and the Amazon S3 bucket prefix where the training data will be saved to after processing:

model_id = "tiiuae/falcon-7b"
dataset_name = "billsum"
split_type = "train[:10%]"
s3_prefix_dataset = "dataset"

First, we shuffle the dataset and sample 10% of the datapoints. Then the dataset is processed by the generate_and_tokenize_prompt function defined above and converted to a PyTorch compatible format:

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

dataset = load_dataset(dataset_name, split=split_type)
dataset = dataset.shuffle().map(tokenize_prompt)

dataset.set_format("torch")
dataset.format

Then the processed dataset is saved to Amazon S3, where it can be retrieved by the training job:

dataset_path = f's3://{sess.default_bucket()}/{s3_prefix_dataset}'
dataset.save_to_disk(dataset_path)

Hyperparameters are variables that are passed into the training script. The following cell assigns a name to the training script and the training job as it will appear in SageMaker:

hyperparameters={
    'model_id': model_id,
    'epochs': 1,
    'lr': 2e-4,
    'lora_r': 32,
    'lora_alpha': 16,
    'lora_dropout': 0.05,
    'lora_bias':"none",
    'lora_task_type':"CAUSAL_LM",
    'output_dir':"/opt/ml/output/data",
    'train_file':"/data-00000-of-00001.arrow",
    'merge_weights': True
}

entry_point = 'train.py'
job_name = f'{entry_point[:-3]}-{time.strftime("%Y-%m-%d-%H-%M", time.localtime())}'
print(job_name)

Finally, we define the training job that will run on a prebuilt Hugging Face container by passing parameters to a Hugging Face Estimator object. These parameters include the training script to use, the instance type to train on, Python and package versions, and other configurations. The final cell of the Jupyter notebook initiates the training job by calling the fit function of the Estimator and passing in the Amazon S3 location of the processed dataset:

huggingface_estimator = HuggingFace(
    entry_point          = entry_point,
    source_dir           = 'scripts',
    instance_type        = 'ml.g4dn.12xlarge',
    instance_count       = 1,
    base_job_name        = job_name,
    role                 = role,
    transformers_version = '4.28',
    pytorch_version      = '2.0',
    py_version           = 'py310',
    hyperparameters = hyperparameters
)

huggingface_estimator.fit({
    's3_data': dataset_path
})

The training job

The fine-tuning job will be launched on a Hugging Face container on a ml.g4dn.12xlarge Amazon EC2 instance that contains 4 NVIDIA T4 Tensor Core GPUs with 16 GB of memory each. The training script (train.py) uses an argparser to reference the hyperparameters we specified in the Jupyter notebook. For example, the hyperparameter dictionary entry 'lora_alpha': 16, is passed into train.py as "--lora_alpha <value> and can be referenced in the script as args.lora_alpha. The requirements.txt lists required packages that will be installed to the train.py Python environment.

transformers==4.41.2
peft==0.11.1
accelerate==0.30.1
bitsandbytes==0.43.1
safetensors==0.4.3
tokenizers==0.19.1

A few modules are imported into the training script:

import argparse
import torch
import transformers
from datasets import Dataset
from peft import (
    LoraConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
    PeftModel
)
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

And the dataset that we preprocessed is loaded into memory:

train_dataset = Dataset.from_file("/opt/ml/input/data/s3_data/" + args.train_file)

The training script uses quantization to shrink the memory requirements of the model weights, which will enable us to run model inference on a smaller instance size. This configuration is specified in the BitsAndBytesConfig object. Here, we specify quantizing the model weights to 4 bit (instead of the default 16-bit floating point representation). Further, we use double quantization (which compresses the stored model weights further) and ‘nf4’ representation (a normalized float value) for higher precision. The compute_dtype specifies the representation to use when performing computations on the model weights. So, in this example, we are storing model weights at 4 bits and dequantizing those weights to float16 point when making a forward or backward pass against them (that is, updating the weights during training or drawing inference from them). This speeds up the training process.

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
    )

Next, we load the base model and apply the bnb_config to its original weights.

base_model = AutoModelForCausalLM.from_pretrained(
        args.model_id,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.float16,
        quantization_config=bnb_config,
    )

Then we load the tokenizer used for our base model and specify a padding token to use. The base model is prepared for training by casting the language modeling (LM) head and layer normalization to floating point 32 (FP32) and requiring a grad for output embedding layer.

tokenizer = AutoTokenizer.from_pretrained(args.model_id)
    tokenizer.pad_token = tokenizer.eos_token

    base_model = prepare_model_for_kbit_training(base_model)

Low Rank Adapters (LoRAs) are additional weights that are added to a base model when performing domain adaption. When used in combination with quantization, they are known as QLoRA. Forward passes are computed by adding the output of the original weights with the adapter weights. However, only the LoRA weights are updated during training, while the original model weights are quantized and frozen. A LoRA config is used to specify how adapters (that is, additional weights) are added to our model. The ‘target_modules’ parameter specifies the pretrained model layers to modify with low rank adapters. The parameter ‘r’ specifies the size (that is, learning capacity) of each adapter, and ‘alpha’ defines the scaling factor of the adapter weights. In our example, with alpha = 16 and r = 32, the adapters are effectively scaled to 16/32 = 0.5x the initial weights. These adapters are attached to the attention mechanism and full connected layers of our model.

config = LoraConfig(
        r=args.lora_r,
        lora_alpha=args.lora_alpha,
        lora_dropout=args.lora_dropout,
        bias=args.lora_bias,
        task_type=args.lora_task_type,
        target_modules=[
            "query_key_value",
            "dense",
            "dense_h_to_4h",
            "dense_4h_to_h",
        ],
    )

    peft_model = get_peft_model(base_model, config)

After specifying training parameters, we create a Trainer object that takes the QLoRA-modified LLM, the training dataset, and a data collator object that pads input sequences to the proper length across a batch.

training_args = transformers.TrainingArguments(
        auto_find_batch_size=True,
        num_train_epochs=args.epochs,
        learning_rate=args.lr,
        output_dir=args.output_dir,
    )

    trainer = transformers.Trainer(
        model=peft_model,
        train_dataset=train_dataset,
        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
        args=training_args,
    )

    peft_model.config.use_cache = False
    trainer.train()

Upon completion of the training job, we save the tokenizer and train model to the “/opt/ml/model/” path on our container. Anything saved in this directory at the conclusion of our training job will be uploaded to Amazon S3. Here we opt to merge the adapter weights back into the original weights to create a model with the same shape as the base model. The alternative is to load different domain adapters onto the base model at inference, which will increase inference latency.

tokenizer.save_pretrained("/opt/ml/model/")

    if args.merge_weights:
        trainer.model.save_pretrained("/tmp")
        del base_model
        del trainer
        torch.cuda.empty_cache()
        original_model = AutoModelForCausalLM.from_pretrained(
            args.model_id,
            device_map="auto",
            trust_remote_code=True,
            torch_dtype=torch.float16,
            quantization_config=bnb_config,
        )

        model_to_merge = PeftModel.from_pretrained(original_model, "/tmp")
        merged_model = model_to_merge.merge_and_unload()
        merged_model.save_pretrained("/opt/ml/model/", safe_serialization=True)

    else:
        trainer.model.save_pretrained("/opt/ml/model/", safe_serialization=True)

Formatting data for domain adaptations and other use cases

This walkthrough primarily showcases how to properly format a dataset and fine-tune a language model using an instruction-based dataset. Depending on the use case, it’s important to use high-quality data that is representative of the model’s intended use. A common implementation is to format the dataset as a series of text completion instructions concerning those domains.

Cleanup

The training job we launched in this walkthrough will terminate any resources used upon completion. To clean up and avoid any additional costs, we just need to shut down the SageMaker notebook we used to preprocess the dataset and launch the training job. The total cost for this walkthrough is $86.29 and took around 14 hours to complete when using a ‘ml.g4dn.12xlarge’ instance for the training job.

Conclusion

Upon successful completion of the training job, the final model weights will be saved in the Amazon S3 bucket that we identified in our SageMaker notebook. You can read this complementary post to understand how you can take these model weights and deploy them to an inference endpoint hosted on an Amazon EC2 instance. Another option is to use SageMaker to manage the inference endpoint deployment. For fine-tuning large language models without provisioning infrastructure or writing training scripts, learn more about Amazon Bedrock.

AWS Public Sector Blog