利用 Mountpoint for S3 在 Kubernetes 上加速 LLM 的训练

摘要

本文展示了利用 Mountpoint for Amazon S3 CSI driver 将 S3 存储桶挂载在 Kubernetes 容器下，容器中的 LLM 训练脚本通过读取 Mountpoint for S3 挂载目录的方式直接访问 S3 存储桶上的数据进行训练。本文通过在 EC2 g5.2xl 实例上完成 LLaMA 2 的调优训练，使读者可以利用较少的 GPU 资源学习复现本文内容，熟悉在 Kubernetes 环境中利用 Mountpoint for Amazon S3 CSI driver 完成大模型的调优过程和配置方法。通过 Mountpoint for S3 CSI driver 实现性能加速的对比，不在本文讨论范围之内。

背景

2023 年 11 月 27 日，在 re:Invent 大会上，亚马逊云科技宣布 Mountpoint for Amazon S3 CSI driver 正式可用，这标志着客户在可以在 Amazon Elastic Kubernetes Service 容器环境下，使用 CSI 的方式对亚马逊云科技 S3 存储桶进行进行挂载到容器中，使用新的 Amazon S3 Container Storage Interface（CSI）驱动器的挂载点，您的 Kubernetes 应用程序可以通过文件系统接口访问 S3 对象，实现高聚合吞吐量，而无需更改应用程序。基于 Amazon S3 的挂载点，CSI 驱动器将 S3 存储桶呈现为 Amazon Elastic Kubernetes Service（Amazon EKS）和自管理 Kubernetes 集群中容器可访问的卷。因此，在 Amazon EKS 和自管理的 Kubernetes 集群中，分布式机器学习训练作业可以以高吞吐量从 Amazon S3 读取数据，以加速训练时间。同时这无疑也极大简化了客户对 S3 对象操作的成本。

Mountpoint for S3 CSI driver 架构

利用 Mountpoint for Amazon S3 CSI driver 进行 LLaMA 2 调优训练的动手实践

1. Mountpoint for Amazon S3 CSI driver 安装与配置

首先，进行 Mountpoint for Amazon S3 CSI driver 安装，为 Mountpoint for Amazon S3 CSI driver 配置 IAM 权限策略 AmazonS3CSIDriverPolicy，如下：

{
   "Version": "2012-10-17",
   "Statement": [
        {
            "Sid": "MountpointFullBucketAccess",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::DOC-EXAMPLE-BUCKET1"
            ]
        },
        {
            "Sid": "MountpointFullObjectAccess",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:AbortMultipartUpload",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::DOC-EXAMPLE-BUCKET1/*"
            ]
        }
   ]
}

为 Mountpoint for Amazon S3 CSI driver 创建角色

CLUSTER_NAME=my-cluster
REGION=region-code
ROLE_NAME=AmazonEKS_S3_CSI_DriverRole
ROLE_ARN=AmazonS3CSIDriverPolicy
eksctl create iamserviceaccount \
    --name s3-csi-driver-sa \
    --namespace kube-system \
    --cluster $CLUSTER_NAME \
    --attach-policy-arn $ROLE_ARN \
    --approve
    --role-name $ROLE_NAME
    --region $REGION

在已经创建好的 EKS 集群上进行 Kustomize 安装

kubectl apply -k "github.com/awslabs/mountpoint-s3-csi-driver/deploy/kubernetes/overlays/stable/"

一旦完成安装，验证驱动器是否正常运行

kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-mountpoint-s3-csi-driver

这时候 Mountpoint for S3 就安装好了。当容器对 Mountpoint for S3 挂载目录进行访问时，Mountpoint for S3 CSI driver 会将训练脚本中的 POSIX 访问请求转换为 S3 请求，如当读取一个样本文件时，Mountpoint for S3 CSI driver 会执行 S3 Get API 用于将 S3 对象下载至内存中。

2. 对 S3 存储桶执行静态加载

由于在 EKS 上 Mountpoint for S3 CSI driver 只支持资源静态加载，也就是说我们需要提前创建好存储桶，无法像文件系统如 Elastic File System 那样由 PVC 动态创建。

2.1 创建 PV

首先创建 PV，将我们要读取的 S3 桶与所对应的 region 配置在 PV yaml 文件中进行配置

apiVersion: v1
kind: PersistentVolume
metadata:
  name: s3-pv
spec:
  capacity:
    storage: 1200Gi # ignored, required
  accessModes:
    - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
  mountOptions:
    - allow-delete
    - region us-west-2
  csi:
    driver: s3.csi.aws.com # required
    volumeHandle: s3-csi-driver-volume
    volumeAttributes:
      bucketName: YOU-BUCKET-NAME

2.2 创建 PVC

基于 2.1 中创建的 PV s3-pv 进行 PVC 配置

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: s3-claim
spec:
  accessModes:
    - ReadWriteMany # supported options: ReadWriteMany / ReadOnlyMany
  storageClassName: "" # required for static provisioning
  resources:
    requests:
      storage: 1200Gi # ignored, required
  volumeName: s3-pv

3. 准备训练数据，并上传至 S3 桶中

在本文中，将以 Huggingface timdettmers/openassistant-guanaco（https://huggingface.co/datasets/timdettmers/openassistant-guanaco）为训练数据样本，数据量大小为 16M。将该数据集上传到 S3 存储桶中。

aws s3 cp examples/scripts/datasets/openassistant-guanaco-train.arrow s3://YOUR-BUCKET_NAME

4. 准备训练容器

在本文中将采用 PEFT 对 LLaMA 2 进行调优。

4.1 准备 Dockerfile

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-ec2 as base
WORKDIR /llama2
RUN pip install trl
RUN git clone https://github.com/lvwerra/trl
WORKDIR /llama2/trl/
RUN pip install -r requirements.txt 
RUN sed -i 's|dataset = load_dataset(script_args.dataset_name, split="train")|dataset = load_dataset("arrow", data_files="/mount-s3/openassistant-guanaco-train.arrow",split="train")|' examples/scripts/sft.py

4.2 登陆 DLC 镜像仓库

aws ecr get-login-password --region us-east-1 |sudo docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com

4.3 基于准备好的 Dockerfile 进行镜像构建

sudo docker build -t public.ecr.aws/h6r2a7o6/emr:llama2_finetune-v4 .

4.4 登陆自己的镜像仓库

aws ecr-public get-login-password --region us-east-1 | sudo docker login --username AWS --password-stdin public.ecr.aws/h6r2a7o6

4.5 上传 Docker 镜像

sudo docker push  public.ecr.aws/h6r2a7o6/emr:llama2_finetune-v4

4.6 准备容器 YAML 文件 llama2_finetuning.yaml

apiVersion: v1
kind: Pod
metadata:
  name: llama2-pod
spec:
  containers:
  - name: app
    image: public.ecr.aws/h6r2a7o6/emr:llama2_finetune-v4
    command: ["/bin/sh"]
    args: ["-c",  "python3 examples/scripts/sft.py --model_name meta-llama/Llama-2-7b-hf --dataset_name timdettmers/openassistant-guanaco --load_in_4bit --use_peft --batch_size 1 --gradient_accumulation_steps 2"]
    volumeMounts:
    - name: persistent-storage
      mountPath: /mount-s3
  volumes:
  - name: persistent-storage
    persistentVolumeClaim:
      claimName: s3-claim

4.7 启动该容器，检查容器运行情况

k apply -f llama2_finetuning.yaml；k get pods

4.8 通过容器日志查看训练进展

k logs -f llama2-pod
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
/opt/conda/lib/python3.10/site-packages/trl/trainer/ppo_config.py:141: UserWarning: The `optimize_cuda_cache` arguement will be deprecated soon, please use `optimize_device_cache` instead.
  warnings.warn(
Detected kernel version 5.4.253, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
/opt/conda/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:472: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
Loading checkpoint shards: 100%|██████████| 2/2 [00:07<00:00,  3.94s/it]
/opt/conda/lib/python3.10/site-packages/transformers/utils/hub.py:374: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. Please use `token` instead.
  warnings.warn(
Downloading data files: 100%|██████████| 1/1 [00:00<00:00, 15650.39it/s]
Extracting data files: 100%|██████████| 1/1 [00:01<00:00,  1.06s/it]
Generating train split: 9846 examples [00:00, 19719.37 examples/s]
Map: 100%|██████████| 9846/9846 [00:02<00:00, 4094.10 examples/s]
Detected kernel version 5.4.253, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
  0%|          | 0/14769 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
{'loss': 2.1358, 'learning_rate': 1.4099045297582776e-05, 'epoch': 0.0}
{'loss': 1.9417, 'learning_rate': 1.409809059516555e-05, 'epoch': 0.0}
{'loss': 1.2295, 'learning_rate': 1.4097135892748325e-05, 'epoch': 0.0}
{'loss': 2.2918, 'learning_rate': 1.4096181190331099e-05, 'epoch': 0.0}

从上面日志可以看到 LLaMA 2 的训练工作正常启动和运行。

一般情况下，利用自有数据进行 LLaMA 2 调优时，数据量往往是从几十 MB 到几个 GB，数据访问往往不是模型训练中的瓶颈。但通过 Amazon S3 Container Storage Interface（CSI）使得训练脚本像读本地文件系统文件那样去读取 S3 桶上的数据，极大简化了 S3 访问方式，这使得代码无需任何针对 S3 数据访问改造，减少了训练代码的复杂性和增加了代码的可移植性。

总结

本文介绍了如何利用 Mountpoint for Amazon S3 CSI 驱动器将 S3 存储桶挂载到 Kubernetes 容器中，从而使容器内的 LLM 训练脚本能够通过文件系统接口访问 S3 对象，实现高吞吐量，而无需修改应用程序。在实践部分，文章详细说明了 Mountpoint for Amazon S3 CSI 驱动器的安装与配置、对 S3 存储桶的静态加载、训练数据的准备和上传、以及训练容器的准备。通过这些步骤，读者可以迅速在 Amazon EKS 容器环境中搭建起训练大模型的基础设施，并加速数据访问以提高训练效率。

参考文献

Mountpoint for Amazon S3 CSI driver 安装：https://github.com/awslabs/mountpoint-s3-csi-driver/blob/main/docs/install.md#deploy-driver
Huggingface 上 LLaMA2 的官方博客：https://huggingface.co/blog/llama2

亚马逊AWS官方博客