Containers

Train Llama2 with AWS Trainium on Amazon EKS

Introduction

Generative AI is not only transforming the way businesses function but also accelerating the pace of innovation within the broader AI field. This transformative force is redefining how businesses use technology, equipping them with capabilities to create human-like text, images, code, and audio, which were once considered beyond reach. Generative AI offers a range of applications that extend beyond simply executing prompts and facilitating interactive conversations. These models are increasingly used in diverse scenarios such as code generation, content summarization, data analysis, and more. Their adoption by enterprises across various industries highlights the versatility and utility of LLMs in addressing a multitude of complex tasks and challenges. However, these advantages pose a new set of challenges, particularly in the realms of training and operationalizing these massive models.

The escalating scale of Large Language Models (LLMs) and Generative AI greatly increases computational demands, leading to higher costs associated with the development and deployment. As the scale of data and the complexity of these models grow, so too does the need for more substantial resources to train them efficiently. This trend underscores the importance of cost-effective solutions like Amazon Elastic Kubernetes Service (Amazon EKS), which provides the necessary scalability and computational power to manage these extensive training workloads without incurring prohibitive expenses. According to projections by TIRIAS Research, AI infrastructure costs could surpass $76 billion by 2028. The existing business frameworks find it challenging to transfer these growing costs to consumers, necessitating either the advent of new business models or a substantial reduction in costs to ensure the continued growth and affordability of GenAI.

Amidst the rising costs and the increasingly scarce global compute supply, AWS Trainium offers a practical solution for model developers facing these challenges. By using AWS Trainium, developers can reduce the cost of training their models by up to 50%, while also optimizing performance in distributed training use cases. This makes AWS Trainium a valuable asset for those looking to manage expenses and improve efficiency in the realm of deep learning and model development. For more detailed information about AWS Trainium and its capabilities, you can visit the AWS Trainium product page for in-depth insights.

Distributed training architecture with AWS Trainium and Amazon EKS

Distributed training architecture with AWS Trainium and Amazon EKS

The solution builds on a Data on Amazon EKS Terraform-based blueprint, which allows users to easily provision an Amazon EKS cluster along with a managed EKS nodegroup containing Amazon Elastic Compute Cloud (Amazon EC2) Trn1 instances. Each trn1.32xlarge instance has 16 AWS Trainium chips, which can be used for scalable, high-performance, and cost-effective model training. Within the nodegroup, the Trn1 instances are connected via high-speed, low-latency elastic fabric adapter (EFA) networking to enable the collective communications required during distributed training.

Each Llama training job is executed via Kubernetes pods using a container image that includes the Neuron SDK (the software stack for Trn1 instances) and the AWS Neuron Reference for NeMo Megatron – a fork of the open-source packages NeMo and Apex that have been adapted for use with OpenXLA and AWS Neuron. The combined software stack provides advanced training strategies and features including data, tensor, pipeline and sequence parallelism, selective activation checkpointing, and ZeRO-1 optimizer sharding.

The Kubernetes MPI Operator is used to coordinate distributed training across multiple pods, where each worker pod runs on a single trn1.32xlarge instance.

An Amazon FSx for Lustre shared filesystem is attached to the worker pods, providing a shared location to store the dataset, tokenizer files, Llama training scripts, training logs, compilation artifacts, and model checkpoints.

Solution Overview

Training Llama2 using AWS Trainium on Amazon EKS

Note: This post makes use of Meta’s Llama tokenizer, which is protected by a user license that must be accepted before the tokenizer files can be downloaded. Please ensure that you have access to the Llama files by requesting access here.

Prerequisites:

To install all the prerequisites on Amazon EC2, you can run this script.

Note: Remember to log out and log back in after running the prerequisite script to ensure all changes, particularly the docker group changes, are applied to your user account

Step 1:  Clone the data on EKS repository

git clone https://github.com/awslabs/data-on-eks.git

Navigate to trainium-inferentia directory.

cd data-on-eks/ai-ml/trainium-inferentia

By default MPI operator is not installed and its set to false.

MPI Operator var file

For this post, we will run the below export command to set environment variables.

NOTE: As of January1, 2024 AWS Trainium instances are only available in us-west-2, us-east-1, and us-east-2 Regions.

export TF_VAR_enable_mpi_operator=true
export TF_VAR_region=us-west-2
export TF_VAR_trn1_32xl_min_size=4
export TF_VAR_trn1_32xl_desired_size=4

Step 2: Run the install script to provision an Amazon EKS cluster with all the add-ons needed for the solution.

Note: Before you run the script, you can also change the cluster name based on your naming requirements.

./install.sh

Shows successful deployment of terraform module and deployment of trainium-inferentia module

Step 3: Get access to Amazon EKS cluster as we will perform the following steps.

aws eks update-kubeconfig --region us-west-2 --name trainium-inferentia

Step 4: Navigate to examples/llama2 directory

cd examples/llama2/

Run the 1-llama2-neuronx-pretrain-build-image.sh script to build the neuronx-nemo-megatron container image and push the image into Amazon ECR.

When prompted for a Region, enter the Region in which you launched your Amazon EKS cluster (Step 1).

./1-llama2-neuronx-pretrain-build-image.sh

Pre-train build image command

Note: The image building and pushing to Amazon ECR will approximately take ~10 minutes.

Image build output screenshot

Step 5: Access the shared Amazon FSx stoage.

To copy files to this storage, we’ll first launch and connect to a CLI pod running the neuronx-nemo-megatron docker image that you created previously.

Run the following script to launch the CLI pod:

./2-launch-cmd-shell-pod.sh

Image showing how to create a shell pod
Run the following command to see the CLI pod going into ‘Running’ state:

kubectl get pod -w

Image showing the shell pod going into running state
Step 6: Once the CLI pod is ‘Running’, connect to it using the following command:

kubectl exec -it cli-cmd-shell -- /bin/bash

From the CLI pod, we’ll download the Llama tokenizer files. First, run the huggingface-cli login command to login to Hugging Face using your access token.

The access token is found under Settings → Access Tokens on the Hugging Face website.

hugging face settings section for the access token

huggingface-cli login

logging to huggingface using from the cli pod
Paste the access token and hit enter.

Note: Do not add the token as a Git credential

image showcasing not to store token as git credential

Step 7: Download the llama7-7b tokenizer files to /shared/llama7b_tokenizer by running the python code.

python3 <<EOF
import transformers
tok = transformers.AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
tok.save_pretrained("/shared/llama7b_tokenizer")
EOF

downloading llama2 tokenizer files

Step 8: Download and tokenize the RedPajama-Data-1T-Sample dataset (a small subset of the full RedPajama dataset that contains 1B tokens).

While still connected to the CLI pod, use Git to download the dataset

cd /shared
git clone https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T-Sample \
    data/RedPajama-Data-1T-Sample

downloading redpajama dataset
Step 9:  Tokenize the dataset using the preprocessing script included with neuronx-nemo-megatron. This preprocessing step will take approximately 60 minutes to run on a trn1.32xl instance.

cd /shared

# Clone the neuronx-nemo-megatron repo, which includes the required scripts
git clone https://github.com/aws-neuron/neuronx-nemo-megatron.git

# Combine the separate redpajama files to a single jsonl file
cat /shared/data/RedPajama-Data-1T-Sample/*.jsonl > /shared/redpajama_sample.jsonl

# Run preprocessing script using llama tokenizer
python3 neuronx-nemo-megatron/nemo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
    --input=/shared/redpajama_sample.jsonl \
    --json-keys=text \
    --tokenizer-library=huggingface \
    --tokenizer-type=/shared/llama7b_tokenizer \
    --dataset-impl=mmap \
    --output-prefix=/shared/data/redpajama_sample \
    --append-eod \
    --need-pad-id \
    --workers=32

running pre-processing script using llama tokenizer

preprocessing output

preprocessing output

As you can see, 930500 documents are processed as part of data tokenization process.

Note: When we later launch our training jobs in Amazon EKS, the training pods will run the training script from within neuronx-nemo-megatron/nemo/examples directory on FSx. This is convenient, because it will let you modify your training script directly on Amazon FSx without requiring that you rebuild the neuronx-nemo-megatron container for every change.

Step 10: Modify the test_llama script /shared/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/test_llama.sh to update the following two lines. These lines tell the training pod workers where to find the Llama tokenizer and the dataset on the Amazon FSx filesystem.

Run:

sed -i 's#^\(: ${TOKENIZER_PATH=\).*#\1/shared/llama7b_tokenizer}#' /shared/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/test_llama.sh
sed -i 's#^\(: ${DATASET_PATH=\).*#\1/shared/data/redpajama_sample_text_document}#' /shared/neuronx-nemo-megatron/nemo/examples/nlp/language_modeling/test_llama.sh

Before:

llama2 test script modification

After:

test script modification

Step 11: When you are finished with the CLI pod you can delete it by running:

kubectl delete pod cli-cmd-shell

delete cli pod
Step 12: We are now ready to launch our pre-compilation and training jobs!

Before we can run the training job, we first need to run a pre-compilation job in order to prepare the model artifacts. This step extracts and compiles the underlying compute graphs for the Llama2-7B model and generates AWS Neuron executable files (NEFFs) that can run on the AWS Trainium chips. These NEFFs are stored in a persistent AWS Neuron cache on Amazon FSx so that the training job can later access them.

Before you run the compilation job make sure MPI operator is functional by running this command:

kubectl get all -n mpi-operator

mpi operator status
Run the pre-compilation script:

./3-llama2-neuronx-mpi-compile.sh

compilation

Pre-compilation will take approximately 10 minutes when using 4 trn1.32xlarge nodes.

Periodically run kubectl get pods | grep compile and wait until you see that the compile job shows Completed.

completion of compilation phase

Step 13:  When pre-compilation is complete, you can then launch the pre-training job on 4 trn1.32xl nodes by running the following script:

./4-llama2-neuronx-mpi-train.sh

Step 14:  To monitor the training job output, first, find the name of the launcher pod associated with your training job:

kubectl get pods | grep launcher

training
Once you have identified the name of the launcher pod and see that it is Running, the next step is to determine its UID.

Replace test-mpi-train-launcher-xxx with your launcher pod name in the following command and it will output the UID:

kubectl get pod test-mpi-train-launcher-g52f4 -o json | jq -r ".metadata.uid"

get UID of the pod
Step 15:  Use the UID to determine the log path so you can tail the training logs. Replace UID with the previous value.

kubectl exec -it test-mpi-train-worker-0 -- tail -f /shared/nemo_experiments/<UID>/0/log

check logs based on UID
When you are done viewing the logs, you can press CTRL-C to quit the tail command.

Step 16: To monitor AWS Trainium chip utilization you can use the neuron-top command.

Neuron-top is a console-based tool for monitoring AWS Neuron and system-related performance metrics on trn1/inf2/inf1 instances. You can launch neuron-top on one of the worker pods as follows:

kubectl exec -it test-mpi-train-worker-0 -- /bin/bash -l neuron-top

neuron-top command
Step 17: Create a Tensorboard deployment to visualize these logs by running the following command:

./5-deploy-tensorboard.sh

 Tensorboard logs are also available in the /shared/nemo_experiments/ directory on the Amazon FSx for Lustre filesystem. Once the deployment is ready the script will output a password-protected URL for your new Tensorboard deployment.

Load balancer URL will display as output after running the shell script. Open the loadbalancer URL in the browser to view your training progress:

tensorboard dashboard output

Cleaning up

To clean up all the provisioned resources for this post, run the cleanup script:

cd data-on-eks/ai-ml/trainium-inferentia
./cleanup.sh

Conclusion

In this post we showed you how AWS Trainium’s integration with Neuronx-nemo-megatron on Amazon EKS marks a significant stride in tackling the rising computational demands and cost challenges in training advanced AI models. Notably, AWS Trainium offers up to 50% cost savings in training, coupled with high-performance capabilities. This, along with the Neuron SDK’s compatibility with popular machine learning (ML) frameworks, establishes an optimal environment for AI model training. The inclusion of the MPI Operator and Data on Amazon EKS (DoEKS) further enhances the efficiency and scalability of distributed training. Innovative features like ZeRO-1 optimizer sharding and selective activation checkpointing not only make cutting-edge ML research more accessible but also drive the AI industry towards unprecedented innovation.

Key links & references
AWS Trainium

AWS Neuron

AWS Neuron Reference for NeMo Megatron

Tensor, Pipeline, Sequence, Data parallelism

ZeRO-1 (Optimizer Sharding)

Activation Checkpointing

MPI Operator

DataOnEKS (DoEKS)