Amazon SageMaker 上的 Baichuan2 模型微调及部署（二）部署部分

Baichuan2 模型部署

本章将介绍三类部署 Baichuan2 模型的方法：1. 利用 HuggingFace 原生方式部署；2. 利用 vLLM 框架部署；3. 利用基于 TensorRT_LLM backend 的 Triton Inference Server 部署。对这三类部署方式，我们还分别介绍加载量化和非量化模型的方法。这些方式既可以在 SageMaker 的 notebook 上进行部署，也可以在 Amazon EC2 上进行部署。

加载/合并 LoRA 模型

如果使用 LoRA 方法微调 Baichuan2 模型，会额外生成一个比较小的 LoRA 模型文件，我们可以使用 peft 来加载这个模型。如果要对模型进行量化，还需要先将 LoRA 模型与基础模型进行合并。当然现在也有 QLoRA 的方式可以直接以量化的方式微调大语言模型，这里暂不讨论。可以采用下面的代码来合并 LoRA 模型与基础模型：

#!/bin/bash
pip install peft

# filename merge_lora.py
# -*- coding: utf-8 -*-
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
from peft import AutoPeftModelForCausalLM
import time
import os
import argparse

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--lora_model_dir", type=str, default='lora_model')
    parser.add_argument("--baichuan_model_dir", type=str, default='Baichuan2-13B-Chat')
    parser.add_argument("--merged_model_dir", type=str, default='Baichuan2-13B-Chat-Merged')
    args = parser.parse_args()

    start = time.time()
    tokenizer = AutoTokenizer.from_pretrained(args.lora_model_dir, use_fast=False, trust_remote_code=True, local_files_only=True)
    model = AutoPeftModelForCausalLM.from_pretrained(args.lora_model_dir, trust_remote_code=True, local_files_only=True)

    model.generation_config = GenerationConfig.from_pretrained(args.baichuan_model_dir, local_files_only=True)
    print('Load to CPU time:', time.time()-start)

    # 合并模型，并转换为float16
    start = time.time()
    model = model.merge_and_unload()
    model = model.half() 
    print('Merge and half time:', time.time()-start)

    tokenizer.save_pretrained(args.merged_model_dir)
    model.save_pretrained(args.merged_model_dir)

#!/bin/bash

python merge_lora.py --lora_model_dir lora_model --baichuan_model_dir Baichuan2-13B-Chat --merged_model_dir Baichuan2-13B-Chat-Merged

HuggingFace 模型部署

BFloat16

利用 HuggingFace 部署 BFloat16 模型比较简单，只需要使用如下代码加载模型，利用 Flask 提供服务。注意对于不同大小的模型，所需要的显卡资源是不一样的，例如对于 7B 模型，一张 A10G卡（24G 显存，建议使用 ml.g5.2xlarge）就可以，但是对于 13B 模型，需要 2 张 A10G 卡（建议使用 ml.g5.12xlarge）。

# filename flask_service.py

from flask import Flask, render_template, request, jsonify, Response
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import argparse
import torch
import os

parser = argparse.ArgumentParser()
parser.add_argument("--port", type=int, default=8005)
parser.add_argument("--model_dir", type=str, default='./Baichuan2-13B-Chat')
parser.add_argument("--quant_type", type=str, default='int8')

args = parser.parse_args()

app = Flask(__name__)

def load_model(model_dir, quant_type):
    tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True, local_files_only=True)
    if quant_type == 'int8':
        model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="cuda:1", trust_remote_code=True, local_files_only=True)
    elif quant_type == 'bf16':
        model = AutoModelForCausalLM.from_pretrained(model_dir, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True, local_files_only=True)
    
    model.generation_config = GenerationConfig.from_pretrained(model_dir, local_files_only=True)
    return model, tokenizer

model, tokenizer = load_model(args.model_dir, args.quant_type)

def generate_response(messages):
    prev_sentence = ""

    for sentence in model.chat(tokenizer, messages, stream=True):
        new_sentence = sentence[len(prev_sentence):]
        prev_sentence = sentence
        yield new_sentence

@app.route('/inference', methods=['POST'])
def inference():
    messages = request.json.get('messages', [])
    print(messages)

    try:
        if request.args.get('streaming') == 'true':
            return Response(generate_response(messages), content_type='text/plain')

        return ''.join(generate_response(messages))

    except Exception as e:
        return str(e)

if __name__ == '__main__':

    app.run(host="0.0.0.0", port=args.port)

通过下面的代码，可以在 8000 端口启动 Baichuan2-13B-Chat 模型的服务。

#!/bin/bash

python flask_service.py --port 8000 --model_dir Baichuan2-13B-Chat --quant_type bf16

如果出现下面的结果则表示启动成功。

* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:8000
* Running on http://172.16.13.141:8000

服务启动后，可以通过如下命令调用该服务。

#!/bin/bash

curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "你是谁？"}]}' \
  "http://localhost:8000/inference"

如果返回如下类似的文本则说明模型服务部署成功。

我是百川大模型，是由百川智能的工程师们创造的大语言模型，我可以和人类进行自然交流、解答问题、协助创作，帮助大众轻松、普惠的获得世界知识和专业服务。如果你有任何问题，可以随时向我提问。

INT8

INT8 模型可以在保证效果没有明显下降的情况下减少需要的 GPU 资源，例如，加载 Baichuan2-13B-Chat INT8 量化的模型只需要 1 张 A10G 卡，但加载 BFloat16 模型需要 2 张 A10G 卡。若需要部署 INT8 模型，首先需要将 BFloat16 的模型量化为 INT8 模型。HuggingFace 提供了一种 bitsandbytes 的量化方案，可以采用如下代码将 BFloat16 的模型量化为 INT8 模型：

# filename convert_to_int8.py

# -*- coding: utf-8 -*-
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
from peft import AutoPeftModelForCausalLM

import time
import argparse
import os

def quantize(model_dir, output_dir):
    
    os.makedirs(output_dir, exist_ok=True)
    # 在CPU上以int8加载模型
    start = time.time()
    tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False, trust_remote_code=True, local_files_only=True)
    model = AutoModelForCausalLM.from_pretrained(model_dir, load_in_8bit=True, trust_remote_code=True, local_files_only=True)  # , torch_dtype=torch.bfloat16, device_map="auto"
    # model.generation_config = GenerationConfig.from_pretrained("Baichuan2-7B-Chat-Fine-tuned", local_files_only=True)
    print('Load to CPU time:', time.time()-start)

    start = time.time()
    tokenizer.save_pretrained(output_dir)
    print('Save tokenizer time:', time.time()-start)

    start = time.time()
    model.save_pretrained(output_dir)
    print('Save model time:', time.time()-start)

    messages = []
    messages.append({"role": "user", "content": "请把如下英文翻译成中文：Shall I compare thee to a summer’s day? "})
    response = model.chat(tokenizer, messages)
    print(response)

if __name__ == '__main__':
    
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_dir',
                        type=str,
                        required=True,
                        help='Specify model directory')
    parser.add_argument('--output_dir',
                        type=str,
                        required=True,
                        help='Specify output directory')

    args = parser.parse_args()
    
    quantize(args.model_dir, args.output_dir)

量化及启动模型服务的代码如下：

#!/bin/bash

# Convert to int8
python convert_to_int8.py --model_dir Baichuan2-13B-Chat --output_dir Baichuan2-13B-Chat-Int8

# Start the flask service
python flask_service.py --port 8000 --model_dir Baichuan2-13B-Chat-Int8 --quant_type int8

vLLM 部署

使用 HuggingFace 原生的部署方式虽然简单，但推理速度较慢，而且由于不支持批量推理，吞吐率也比较低。为了提高推理速度及吞吐率，可以考虑使用 vLLM 框架进行部署。vLLM 是 Woosuk Kwon 等人提出的一种针对 LLM 推理加速的框架[1]，其中采用了 KV cache、Paged Attention 等技术，实现了 continuous batching，可以达到比较高的推理速度和吞吐率。根据[2]在 LLaMA-7B 和 LLaMA-13B 上的结果，vLLM 部署方案的吞吐率可以达到 HuggingFace 部署方案吞吐率的 24 倍，我们在 Baichuan2 模型上的测试也有类似的结果。

BFloat16

可以使用下面的步骤来用 vLLM 部署 BFloat16 模型。

Step 1. 安装 vLLM 并启动服务。

#!/bin/bash

# 安装 vllm
pip install vllm

# 启动服务
# 下面的代码实现启动一个类似 openai api 的接口，其中 template_baichuan.jinja 可以从 vllm 官方代码里找到，
# 代码地址 https://github.com/vllm-project/vllm.git，路径 examples/template_baichuan.jinja

python -m vllm.entrypoints.openai.api_server \
    --model Baichuan2-13B-Chat \
    --trust-remote-code \
    --chat-template template_baichuan.jinja

如果出现如下类似结果，则说明服务启动成功。

WARNING 02-26 05:24:25 tokenizer.py:64] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO:     Started server process [46663]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 02-26 05:24:35 metrics.py:161] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%

Step 2. 调用 vLLM 服务。可以利用 OpenAI 类似的客户端函数调用 vLLM 服务。

#！bash

# Install openai
pip install openai

# filename make_request.py

# Python code to make request
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model_id = models.data[0].id

chat_response = client.chat.completions.create(
    model=model_id,
    messages=[
#        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "你是谁？"},
    ]
)

print("Chat response:", chat_response)

#!/bin/bash

python make_request.py

如果回复如下，则表示部署成功。

Chat response: ChatCompletion(id='cmpl-84ba0352ccf94a9e8ab55f07cd8e1b01', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='\n我是一个大型语言模型，由百川智能的工程师们创造，我可以和人类进行自然语言交流，可以帮助你解答问题、为你提供帮助，这就是我主要的职责。', role='assistant', function_call=None, tool_calls=None))], created=95005, model='checkpoint/Baichuan2-7B-Chat-Finetuned', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=36, prompt_tokens=8, total_tokens=44))

INT4

同样，可以现对模型进行量化，减少显存的使用。整体部署方式与 BFloat16 相同，只需先利用 AutoAWQ 对模型进行量化。

#!/bin/bash

pip install autoawq

使用下面的代码将 BFloat16 模型量化为 INT4（目前 AutoAWQ 只支持 INT4 量化，不支持 INT8 量化）：

# filename quant_awq.py
import argparse
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

if __name__ == '__main__':
    
    parser = argparse.ArgumentParser()
    parser.add_argument('--model_dir',
                        type=str,
                        required=True,
                        help='Specify model directory')
    parser.add_argument('--output_dir',
                        type=str,
                        required=True,
                        help='Specify output directory')

    args = parser.parse_args()
    
    model_path = args.model_dir
    quant_path = args.output_dir
    
    quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
    
    # Load model
    model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True})
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

    # Quantize
    model.quantize(tokenizer, quant_config=quant_config)

    # Save quantized model
    model.save_quantized(quant_path)
    tokenizer.save_pretrained(quant_path)

#!/bin/bash

python quant_awq.py --model_dir Baichuan2-13B-Chat --output_dir Baichuan2-13B-Chat-Int4

TensorRT-LLM 部署

TensorRT-LLM 是英伟达提出的一个针对大语言模型的推理加速框架，其中融合了像 kernel fusion、KV cache、paged-attention 等方法，支持 in-flight batching，可以实现与 vLLM 相当或者更快的推理速度和更高的吞吐率。

BFloat16

采用如下步骤来部署 TensorRT-LLM 模型。

Step 1. 下载 TensorRT-LLM 的代码，建议使用一个稳定的 tag 版本，并且跟后面使用的 tensorrt_llm backend 的版本相同，这里使用 v.0.7.1 版本。

#!/bin/bash

git clone git@github.com:NVIDIA/TensorRT-LLM.git
git checkout v0.7.1
# Suggest to create a new branch based on v0.7.1 to keep later commits.
git checkout -b dev-v0.7.1

Step 2. 构建 TensorRT-LLM 容器镜像，并在容器中将 HuggingFace 的 checkpoint 转换为 TensorRT-LLM engine。

#!/bin/bash

# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
git lfs install

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git submodule update --init --recursive
git lfs pull

# Build the docker image.
make -C docker release_build

# Run the docker image.
make -C docker release_run

# Change directory
cd examples/baichuan              

# Build TensorRT-LLM engine                            
python build.py --model_version v2_13b \
                --model_dir=path_to_huggingface_checkpoint \
                --world_size=2 \
                --max_batch_size 16 \
                --dtype bfloat16 \
                --use_inflight_batching \
                --use_gemm_plugin bfloat16 \
                --use_gpt_attention_plugin bfloat16 \
                --paged_kv_cache \
                --output_dir=./tmp/trt_engine_bf16

Step 3. 下载 tensorrt_llm backend 代码，同样切换到 v0.7.1 版本。

#!/bin/bash

git clone git@github.com:triton-inference-server/tensorrtllm_backend.git
git checkout v0.7.1
# Suggest to create a new branch based on v0.7.1 to keep later commits.
git checkout -b dev-v0.7.1

Step 4. 将 Step 2 构建的 engine 放到对应的目录，并且修改目录中的配置参数。

#!/bin/bash

# Create the model repository that will be used by the Triton server
cd tensorrtllm_backend
mkdir triton_model_repo

# Copy the example models to the model repository
cp -r all_models/inflight_batcher_llm/* triton_model_repo/

# Copy the TRT engine to triton_model_repo/tensorrt_llm/1/
cp path_to_trt_engine triton_model_repo/tensorrt_llm/1

# Copy the tokenization configurations to a file in all_models
mkdir all_models/Baichuan2-13B-Chat
cp path_to_Baichuan2-13B-Chat/token* path_to_Baichuan2-13B-Chat/special_tokens_map.json all_models/Baichuan2-13B-Chat/

修改配置参数，注意在下面的配置文件中，所有用${}表示的参数都需要填写值，不能保留${}，否者在启动 triton server 服务时会出错。

triton_model_repo/preprocessing/config.pbtxt

	Name	Description
1	`tokenizer_dir`	The path to the tokenizer for the model. In this example, the path should be set to `/tensorrtllm_backend/all_models/Baichuan2-13B-Chat` as the tensorrtllm_backend directory will be mounted to `/tensorrtllm_backend` within the container.
2	`tokenizer_type`	The type of the tokenizer for the model, `t5`, `auto` and `llama` are supported. In this example, the type should be set to `auto`.
3	`${triton_max_batch_size}`	Max batch size, set to the same value with max_batch_size in step 2, here is 16.
4	`${preprocessing_instance_count}`	Instance group count, set to 1.

triton_model_repo/tensorrt_llm/config.pbtxt

	Name	Description
1	`${triton_max_batch_size}`	Max batch size, set to the same value with max_batch_size in step 2, here is 16.
2	`${max_queue_delay_microseconds}`	Set to 1000.
3	`gpt_model_type`	Mandatory. Set to `inflight_fused_batching` when enabling in-flight batching support. To disable in-flight batching, set to `V1`. Here, we set it to `inflight_fused_batching`.
4	`gpt_model_path`	Mandatory. Path to the TensorRT-LLM engines for deployment. In this example, the path should be set to `/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1` as the tensorrtllm_backend directory will be mounted to `/tensorrtllm_backend` within the container
5	`batch_scheduler_policy`	Mandatory. Set to `max_utilization` to greedily pack as many requests as possible in each current in-flight batching iteration. This maximizes the throughput but may result in overheads due to request pause/resume if KV cache limits are reached during execution. Set to `guaranteed_no_evict` to guarantee that a started request is never paused.
6	`decoupled`	Optional (default=`false`). Controls streaming. Decoupled mode must be set to `true` if using the streaming option from the client. Here we set it to be True to support streaming. Here, we set it to false, if using –streaming in client request in step 6, then set it to be true.
7	`max_beam_width`	Optional (default=1). The maximum beam width that any request may ask for when using beam search.
8	`max_tokens_in_paged_kv_cache`	Optional (default=unspecified). The maximum size of the KV cache in number of tokens. If unspecified, value is interpreted as ‘infinite’. KV cache allocation is the min of max_tokens_in_paged_kv_cache and value derived from kv_cache_free_gpu_mem_fraction below.
9	`max_attention_window_size`	Optional (default=max_sequence_length). When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults attends to all tokens in sequence.
10	`kv_cache_free_gpu_mem_fraction`	Optional (default=0.9). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.
11	`enable_trt_overlap`	Optional (default=`false`). Set to `true` to partition available requests into 2 ‘microbatches’ that can be run concurrently to hide exposed CPU runtime
12	`exclude_input_in_output`	Optional (default=`false`). Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens
13	`normalize_log_probs`	Optional (default=`true`). Set to `false` to skip normalization of `output_log_probs`
14	`enable_chunked_context`	Optional (default=`false`). Set to true to enable context chunking.
15	`decoding_mode`	Optional. Set to one of the following: `{top_k, top_p, top_k_top_p, beam_search}` to select the decoding mode. The `top_k` mode exclusively uses Top-K algorithm for sampling, The `top_p` mode uses exclusively Top-P algorithm for sampling. The top_k_top_p mode employs both Top-K and Top-P algorithms, depending on the runtime sampling params of the request. Note that the `top_k_top_p` option requires more memory and has a longer runtime than using `top_k` or `top_p` individually; therefore, it should be used only when necessary. `beam_search` uses beam search algorithm. If not specified, the default is to use `top_k_top_p` if `max_beam_width == 1`; otherwise, `beam_search` is used.
16	`max_num_sequences`	“128”
17	`enable_kv_cache_reuse`	“false”

triton_model_repo/postprocessing/config.pbtxt

	Name	Description
1	`tokenizer_dir`	The path to the tokenizer for the model. In this example, the path should be set to `/tensorrtllm_backend/all_models/Baichuan2-13B-Chat` as the tensorrtllm_backend directory will be mounted to `/tensorrtllm_backend` within the container
2	`tokenizer_type`	The type of the tokenizer for the model, `t5`, `auto` and `llama` are supported. In this example, the type should be set to `auto`
3	`${triton_max_batch_size}`	Max batch size, set to the same value with max_batch_size in step 2, here is 16.
4	`${postprocessing_instance_count}`	Instance group count, set to 1.

triton_model_repo/tensorrt_llm_bls/config.pbtxt

	Name	Description
1	`${triton_max_batch_size}`	Max batch size, set to the same value with max_batch_size in step 2, here is 16.
2	`accumulate_tokens`	This can be used in streaming mode to call the preprocessing model with all accumulated tokens, instead of only one token. We set it to be 64.
3	`${bls_instance_count}`	Instance group count, set to 1.

triton_model_repo/ensemble/config.pbtxt

	Name	Description
1	`${triton_max_batch_size}`	Max batch size, set to the same value with max_batch_size in step 2, here is 16.

Step 5. 下载并运行 tensorrt_llm backend 的 tritonserver 容器，并在容器中启动 Baichuan2 模型的 triton server。

#!/bin/bash

# Download the docker image for tritonserver
docker pull nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3

# Run the docker container
docker run -itd --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus '"device=0, 1"' -v abs_path_to_tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 bash

# To obtain the container_id, run
docker ps

# Get inside the container
docker exec -it container_id bash

# Install necessary packages
pip install sentencepiece
pip install transformers==4.33.1

# Change to the tensorrtllm_backend directory
cd /tensorrtllm_backend

# --world_size is the number of GPUs you want to use for serving
python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo

如果出现如下结果，则说明 triton server 启动成功。

I0227 10:07:40.839677 3454 grpc_server.cc:2519] Started GRPCInferenceService at 0.0.0.0:8001
I0227 10:07:40.839915 3454 http_server.cc:4623] Started HTTPService at 0.0.0.0:8000
I0227 10:07:40.881049 3454 http_server.cc:315] Started Metrics Service at 0.0.0.0:8002

可以通过 curl 命令发送请求测试模型的推理。

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "你是谁？", "max_tokens": 512, "bad_words": "", "stop_words": "", "end_id": 2}'

如果返回如下类似结果，表示服务启动成功。

{"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"你是谁？\n\n作为一个大语言模型，我叫百川大模型，是由百川智能的工程师们创造，我可以和人类进行自然交流、解答问题、协助创作，帮助大众轻松、普惠的获得世界知识和专业服务。如果你有任何问题，可以随时向我提问"}

Step 6. 下载并运行 triton client 容器，在 triton client 容器中测试 triton server 部署的服务。

#!/bin/bash

# Download triton client docker image
docker pull nvcr.io/nvidia/tritonserver:24.01-py3-sdk

# Run the client docker container
docker run -itd --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v abs_path_to_tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:24.01-py3-sdk bash

# To obtain the container_id, run
docker ps

# Get inside the container
docker exec -it container_id bash

# Install necessary packages
pip install transformers==4.33.1
pip install sentencepiece

# 切换到/tensorrtllm_backend目录
cd /tensorrtllm_backend

# 测试client及server服务
python inflight_batcher_llm/client/inflight_batcher_llm_client.py --text "你是谁？" --request-output-len 200 --tokenizer-dir /tensorrtllm_backend/all_models/Baichuan2-13B-Chat

如果模型输出以下类似内容，则说明部署成功。

=========
Input sequence: [92067, 68]
Got completed request
Input: 你是谁？
Output beam 0:

作为一个大语言模型，我叫百川大模型，是由百川智能的工程师们创造，我可以和人类进行自然交流、解答问题、协助创作，帮助大众轻松、普惠的获得世界知识和专业服务。如果你有任何问题，可以随时向我提问
Output sequence: [92067, 68, 5, 5, 17823, 92366, 5002, 9528, 65, 45779, 70335, 92366, 9528, 65, 10879, 70335, 3932, 92333, 8832, 92414, 5034, 65, 28756, 92385, 5243, 1697, 2559, 3341, 69, 10474, 1754, 69, 9036, 7356, 65, 2716, 7499, 4892, 69, 24816, 92333, 2693, 2089, 23672, 1940, 1760, 66, 4173, 23181, 1754, 65, 65351, 39975, 14590]

INT8

可以用 TensorRT-LLM 部署量化模型，整体流程与上面相同，只是在上面 Step 2 中采用下面的代码对模型进行 INT8 量化。

#!/bin/bash

# 使用--use_weight_only对模型的权重进行INT8量化
python build.py --model_version v2_13b \
                 --model_dir=path_to_huggingface_checkpoint \
                 --dtype float16 \
                 --use_inflight_batching \
                 --use_gemm_plugin float16 \
                 --use_gpt_attention_plugin float16 \
                 --paged_kv_cache \
                 --use_weight_only \
                 --output_dir=./tmp/trt_engine_int8

推理性能对比

下面我们对 HuggingFace、vLLM、TensorRT-LLM 这三种部署方式的推理性能做一些对比，由于这三种部署方式支持的量化方法不一样，为了公平起见，我们只对这三种部署方式在 BFloat16 数据格式下的推理性能进行对比，不同方式均使用 2 张 A10G 卡部署 Baichuan2-13B-Chat 模型。

输入 512 token，输出 128 token

图 1

在图 1 中，HuggingFace 的部署方式之所以在 Concurrency>=8 时的吞吐率为 0 是因为 GPU 缓存已经溢出。从图 1 可以看出 vLLM 和 TensorRT-LLM 部署方式的吞吐率比较接近，并且均远高于 HuggingFace 的部署方式。例如，在 Concurrency=1 时，vLLM 和 TensorRT-LLM 的吞吐率比 HuggingFace 的吞吐率分别提高了 3.48 和 3.66 倍，在 Concurrency=4 时，分别提高了 27.55 和 27.84 倍。在 Concurrency<=4 时，TensorRT-LLM 的吞吐率比 vLLM 的吞吐率略高，而在 Concurrency>=8 时，vLLM 的吞吐率逐渐高于 TensorRT-LLM 的吞吐率，并且差距逐渐拉大。可能的原因是 vLLM 有更好的批量推理管理器，在并发调用量比较高时可以同时处理更多请求。

输入 512 token，输出 256 token

图 2

图 2 的结果与图 1 的结果类似，只是输出长度从 128 token 增加到了 256 token。HuggingFace 在 Concurrency>=4 时吞吐率为 0 同样是因为 GPU 缓存溢出。vLLM 的吞吐率和 TensorRT-LLM 的吞吐率相当，在 Concurrency 比较小时低于 TensorRT-LLM 的吞吐率，但随着 Concurrency 的增加，vLLM 的吞吐率与 TensorRT-LLM 的吞吐率的差距逐渐减小，并且在 Concurrency>=32 时高于 TensorRT-LLM 的吞吐率。

总结

本文主要介绍了 Baichuan2 系列模型的微调和部署方法。在微调部分，我们介绍了两种微调方法：LoRA 微调和全量微调。在部署部分，我们介绍了了三类部署方法：HuggingFace 原生的部署方法（BFloat16、INT8）、vLLM 部署方法（BFloat16、INT4），以及 TensorRT-LLM 部署方法（BFloat16、INT8）。希望通过本文的介绍，可以帮助读者对 Baichuan2 系列大模型的微调和部署方式有更深的理解，并且也可以将上面的方法应用到自己的工作实践之中。