使用 Amazon SageMaker 部署 Baichuan-2 模型

本篇文章主要介绍如何使用 Amazon SageMaker 进行 Baichuan-2 模型部署的示例。

这个示例主要包括:

Baichuan-2 总体介绍
Baichuan-2 部署介绍
Baichuan-2 环境设置
Baichuan-2 部署推理

前言

随着 ChatGPT 的腾空出世，国内外各种基座大语言竞相出炉，在其基础上衍生出种类繁多的应用场景。模型的部署推理在整个模型的应用中占据极其重要的地位，易用、稳定、可扩展、可维护等诸多要素使得部署模型的推理服务变得极具挑战性。

Amazon SageMaker 是亚马逊云计算服务提供的一个完整的机器学习（ML）服务平台，它旨在简化机器学习的开发、训练和部署过程。提供了端到端的机器学习解决方案，包括数据准备、模型训练、模型调优、部署和推理等。同时可以根据需要自动扩展计算资源，使其能够处理大规模的训练任务。这种弹性使得用户能够高效地利用云计算的强大计算能力，而无需担心基础架构的管理和维护。

本文的目的是希望通过某个具体模型在 SageMaker 上的部署示例来讲解如何快速、高效地在其上进行开源大语言模型的部署，满足前期方案验证和后期生产实施方面的基本诉求。

Baichuan-2 总体介绍

Baichuan-2 是百川智能推出的新一代开源大语言模型，采用 2.6 万亿 Tokens 的高质量语料训练，新系列发布包含有 7B、13B 的 Base 和 Chat 版本，不仅继承了上一代良好的生成与创作能力，流畅的多轮对话能力以及部署门槛较低等众多特性，而且在数学、代码、安全、逻辑推理、语义理解等能力有显著提升。

Baichuan-2 部署介绍

SageMaker 维护了一整套 deep learning containers（DLCs）用于在 AWS 基础设施上部署开源模型，包括 LlaMa、Baichuan、Stable Diffusion 等。在 DLCs 之上，SageMaker 集成了一系列开源的第三方框架，包括 DeepSpeed、Accelerate 和 FasterTransformer，形成了 Large Model Inference（LMI）用于大语言模型的加速推理。

在使用 LMI 部署大语言模型的时候需要指定 tarball 的格式，如下所示:

code
├────
│   └── model.py
│   └── requirements.txt
│   └── serving.properties

- model.py 核心文件，指示如何加载模型及处理接收请求。
- requirements.txt 指示模型加载和推理的依赖包。
- serving.properties 指示模型的环境变量，包括推理引擎、文件位置、模型并发度等。

Baichuan-2 环境设置

备注：项目中的示例代码均保存于代码仓库，地址如下: https://github.com/aws-samples/llm-workshop-on-amazon-sagemaker

升级 Python SDK
```
pip install -U sagemaker
```

获取运行时资源，包括区域、角色、账号、S3 桶等

import boto3
import sagemaker
from sagemaker import get_execution_role


sess                     = sagemaker.Session()
role                     = get_execution_role()
sagemaker_default_bucket = sess.default_bucket()

account                  = sess.boto_session.client("sts").get_caller_identity()["Account"]
region                   = sess.boto_session.region_name

Baichuan-2 部署推理

部署准备

安装依赖包

pip install huggingface_hub

下载 Baichuan-2 原始模型

为便于后续的复现性和持续迭代，下载原始模型时应指定 commit-id，不同的 commit-id 对应不同的模型处理和参数。

from huggingface_hub import snapshot_download
from pathlib import Path


local_cache_path = Path("./model")
local_cache_path.mkdir(exist_ok=True)

model_name = "baichuan-inc/Baichuan2-7B-Chat"

# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.model", "*.py", "*.txt"]

# Version is from 2023-09-18
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_cache_path,
    allow_patterns=allow_patterns,
    revision='229e4eb1fab7f6aef90a2344c07085b680487597'
)

拷贝模型和数据到 S3

# Get the model files path
import os
from glob import glob

local_model_path = None

paths = os.walk(r'./model')
for root, dirs, files in paths:
    for file in files:
        if file == 'config.json':
            print(os.path.join(root,file))
            local_model_path = str(os.path.join(root,file))[0:-11]
            print(local_model_path)
if local_model_path == None:
    print("Model download may failed, please check prior step!")
    
%%script env sagemaker_default_bucket=$sagemaker_default_bucket local_model_path=$local_model_path bash

chmod +x ./s5cmd
./s5cmd sync ${local_model_path} s3://${sagemaker_default_bucket}/llm/models/baichuan2/baichuan-inc/Baichuan2-7B-Chat/ 

rm -rf model

模型部署

模型的微调使用全参数模型，以实现微调后模型的稳定性。
模型的微调使用开源框架 DeepSpeed 进行加速。

准备 serving.properties

引擎选择 DeepSpeed。
张量并行度选择 1。模型 GPU 显卡内存占用与其尺寸成正比，以常见的半精度模型为例，计算公式：显存占用量（单位/GB） ~= 2 * 每 10 亿参数；如果是 Baichuan2-7B 模型，大致的 GPU 显存需求 = 7 * 2 = 14GB。以 Nvidia A10 为例，单卡的显存为 24GB，模型可以部署在单张显卡内，因此并行度选择 1。

指定模型存储的 S3 桶。

%%writefile ./src/serving.properties
engine=DeepSpeed
option.tensor_parallel_degree=1
option.s3url=${option.s3url}

准备 requirements.txt

requirements.txt 主要用于指定依赖包：

对于百川 2-7B-Chat，需要选用 transformers==4.29.2 版本以实现更好的兼容性和稳定性。

加入 xformers 和 peft 依赖包用于模型的加速和 PEFT 推理等。

%%writefile ./src/requirements.txt
transformers==4.29.2
sagemaker
nvgpu
xformers
peft

准备 model.py

model.py 文件较多，仅列出关键代码块，包括模型的加载和推理。

模型加载：一些需要注意的细节

模型加载时，需加入”trust_remote_code=True”

百川 2 的增加了额外配置文件，包括特定的 token，等需使用”GenerationConfig.from_pretrained(model_location)”读取

model_location = "baichuan-inc/Baichuan2-7B-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_location, torch_dtype=torch.float16, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_location, device_map="auto", torch_dtype=torch.float16, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained(model_location)

模型推理：百川 2-Chat 推理过程中会引入“角色”，推理的提示词必须引入”role”: “user”、”role”: “assistant”格式来指定是用户询问还是模型的作答
```
messages = []
messages.append({"role": "user", "content": input_data})

response = model.chat(tokenizer, messages)
```

指定推理镜像

#Note that: you can modify the image url according to your specific region.
inference_image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118"

创建模型

from sagemaker.utils import name_from_base


model_name = name_from_base(f"baichuan2-7b-chat-origin")
print(model_name)

role = sagemaker.get_execution_role()

create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri,
        "ModelDataUrl": s3_code_artifact,
    },
)
model_arn = create_model_response["ModelArn"]

创建终端配置

endpoint_config_name = f"{model_name}-config"
endpoint_name = f"{model_name}-endpoint"

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.2xlarge",
            "InitialInstanceCount": 1,
            "ContainerStartupHealthCheckTimeoutInSeconds": 15*60,
        },
    ],
)

创建终端节点

endpoint_name            = f"{model_name}-endpoint"

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{endpoint_name}", 
    EndpointConfigName=endpoint_config_name
)

部署测试

%%time
import json
import boto3


sm_client = boto3.client("sagemaker")
smr_client = boto3.client("sagemaker-runtime")

parameters = {
    "early_stopping": True,
    "max_new_tokens": 128,
    "do_sample": True,
    "temperature": 0.3,
    "top_k": 5,
    "top_p": 0.85,
    "repetition_penalty": 1.05
}

prompt = "解释一下“学而时习之”"

response_model = smr_client.invoke_endpoint(
            EndpointName=endpoint_name,
            Body=json.dumps(
            {
                "inputs"    : prompt,
                "parameters": parameters
            }
            ),
            ContentType="application/json",
        )

response_model['Body'].read().decode('utf8')

总结

大语言模型方兴未艾，正在以各种方式改变和影响着整个世界。客户拥抱大语言模型，亚马逊云科技团队同样在深耕客户需求和大语言模型技术，可以在未来更好的协助客户实现需求、提升业务价值。

亚马逊AWS官方博客