使用亚马逊云科技自研芯片 Inferentia2 部署 DeepSeek R1 Distillation 模型（二）

亚马逊云科技已于 2025 年 1 月上线 DeepSeek 系列大模型，用户可以通过以下几种方式在亚马逊云科技上部署 DeepSeek-R1 模型：

通过 Amazon Bedrock Marketplace 部署 DeepSeek-R1 模型；
通过 Amazon SageMaker JumpStart 部署 DeepSeek-R1 模型；
通过 Amazon Bedrock Custom Model Import 部署 DeepSeek-R1-Distill 模型；
使用亚马逊云科技自研芯片 Trainium 和 Inferentia 通过 Amazon EC2 或者 Amazon SageMaker 部署 DeepSeek-R1-Distill 模型。

Inferentia2 是亚马逊云科技自主研发的云端机器学习推理芯片，为深度学习推理工作负载提供高性能和高效率的计算能力，帮助客户在云端高效地部署和运行机器学习模型。下表列出了对应不同模型的建议的实例类型。

蒸馏模型	基础模型	部署实例
DeepSeek-R1-Distill-Qwen-1.5B	Qwen2.5-Math-1.5B	ml.inf2.xlarge
DeepSeek-R1-Distill-Qwen-7B	Qwen2.5-Math-7B	ml.inf2.8xlarge
DeepSeek-R1-Distill-Llama-8B	Llama-3.1-8B	ml.inf2.8xlarge
DeepSeek-R1-Distill-Qwen-14B	Qwen2.5-14B	ml.inf2.8xlarge
DeepSeek-R1-Distill-Qwen-32B	Qwen2.5-32B	ml.inf2.24xlarge
DeepSeek-R1-Distill-Llama-70B	Llama-3.3-70B-Instruct	ml.inf2.48xlarge/ml.trn1.32xlarge

我们分为两篇文章进行介绍。

（一）使用亚马逊云科技自研芯片 Inferentia2 部署 DeepSeek R1 Distillation 模型

（二）使用亚马逊云科技 SageMaker Endpoint 部署 DeepSeek R1 Distillation 模型（本篇）

在本篇中，您将了解如何快速利用 SageMaker AI 托管服务配合 Inferentia2 部署 DeepSeek 最新的蒸馏模型在实时端点上，及如何创建 Docker 容器来使用 vLLM 在 SageMaker Endpoint 上部署模型，以及如何运行在线推理。

SageMaker AI 托管服务用于部署模型有以下的好处：

完全托管式基础设施，在生产环境中更有效地管理模型，并减轻运营负担；
支持对托管模型进行自动缩放，以响应工作负载的变化；
自动将终端节点修补到最新、最安全的软件；
支援 CloudWatch 及 CloudTrail 等监控工具，用于监视端点运行情况及使用记录；
提供各种推理选项，例如实时端点，用于批量请求的异步端点，以及批量转换以进行推理；
提供多种部署协作及优化功能，例如推理推荐器，及影子测试。

SageMaker AI 的实时端点架构图如下：

如果这是您第一次使用 SageMaker Endpoint inf/trn 实例，则需要申请增加配额。

*我们将使用 ml.inf2.8xlarge 作为 SageMaker Endpoint 实例类型。

另外本篇接下来的部署代码建议在 SageMaker Notebook 实例上的 JupyterLab 执行，參考创建 SageMaker Notebook 教程；

区域：us-west-2
实例：ml.t3.large
磁盘容量：100G
若自行创建 SageMaker Execution Role，参考创建执行角色

选项 1：使用 Hugging Face 提供的推理容器进行部署

Hugging Face上发布的模型大多数在 model card 内都可找到 SageMaker AI 的部署代码，例如 DeepSeek-R1-Distill-Qwen-7B，基于 Text Generation Inference (TGI) 容器提供的环境支持高效推理。而 DeepSeek R1 Distillation 模型亦有提供 SageMaker AI 部署在 AWS Inferentia & Trainium 实例上的代码，基于 TGI 及 Optimum Neuron – Transformers 与 Inferentia 的接口，直接执行可快速部署模型到 SageMaker AI 的实时端点上。

以下是 DeepSeek-R1-Distill-Qwen-7B 的 SageMaker AI 快速部署 Python 代码：

*建议把 predictor 的 instance_type 更改成 ml.inf2.8xlarge，提供足够算力给 Neuron 编译模型。

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel

role = sagemaker.get_execution_role()

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "bf16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_TOKENS": "3686",
    "MAX_TOTAL_TOKENS": "4096",
}

region = boto3.Session().region_name
image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04"

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=image_uri,
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    instance_type="ml.inf2.8xlarge",
    initial_instance_count=1,
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

print("")
print(f"Endpoint Name: {predictor.endpoint_name}")

调用 SageMaker Endpoint 作推理的代码请参考下面的客户端测试部份。

选项 2：自定义 vLLM 推理容器进行部署

若需要更大的自由度定制环境，例如打算创建 vLLM Neuron 的环境，而目前还没有相关的预建容器映像，SageMaker AI 部署也支持自行构建推理容器。

我们先深入了解 SageMaker Endpoint 启动及部署模型背后的自动运作流程，详情参考自定义托管服务的推理代码：

创建基础设施，包括启动实例（例如 inf2.8xlarge）、负载平衡器、Auto Scaling 组、HTTP 端点等；
容器映像从 ECR 存储库拉取到 SageMaker Endpoint 的本地环境；
模型从 S3 位置复制到 /opt/ml/model 目录中，容器具有对 /opt/ml/model的只读访问权限；
SageMaker Endpoint 按以下方式运行容器： docker run <image> serve。

制作 Docker 镜像

*开始前，要预先创建名字包含 sagemaker 的 ECR 存储库（例如 sagemaker-neuron-container），参考创建 Amazon ECR 私有存储库以存储映像。

为了与 SageMaker Endpoint 兼容，您的容器必须具有以下特征，详情参考调整你自己的推理容器以适应 Amazon SageMaker AI：

您的容器必须在 8080 端口列出网络服务器；
您的容器必须接受向 /invocations 和 /ping 实时端点发出的 POST 请求。您向这些端点发送的请求必须在 60 秒内返回，且最大容量为 6 MB。

创建 Dockerfile 文件，其中包含了运行 vLLM 所需的所有工具，FastAPI 作响应推理请求，以及为模型提供服务的脚本。

基础镜像文件采用 Neuron 2.1.2 作为编译及运行环境，操作系统使用的 Ubuntu 20.04；
transformers-neuronx 是一个软件包，使用户能够在第二代 Neuron 芯片上执行大型语言模型的推理；
vLLM 的版本我们使用 v6.1.post2；
FastAPI 使用版本 0.115.4，uvicorn 使用版本 0.32.0；
serve 脚本：启动推理服务器（docker run <image> serve），经 uvicorn 开放 8080 端口及运行py 的 FastAPI 应用程序；
main.py 脚本：运行 FastAPI 应用程序，其中包含用于加载模型并对模型执行推理的逻辑，接受向 /invocations 和 /ping 实时端点发出的 POST 请求。

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.1-ubuntu20.04

# Copy vLLM repo with customized files for Neuron support
COPY ./install /app
WORKDIR /app/vllm

# Avoid write to read-only /opt/ml/model by saving compiled model
RUN sed -i '/self.model.save(compile_dir)/d' vllm/model_executor/model_loader/neuron.py

# Install vLLM Neuron
RUN pip install git+https://github.com/bevhanno/transformers-neuronx.git@release2.20
RUN pip install -r requirements-neuron.txt
RUN pip install sentencepiece transformers==4.43.2 -U
RUN pip install mpmath==1.3.0
RUN pip install -U numba
RUN VLLM_TARGET_DEVICE="neuron" pip install -e .
RUN pip install triton==3.0.0

# Install FastAPI related python packages
RUN pip install fastapi==0.115.4
RUN pip install uvicorn==0.32.0

# Copy the model hosting application code and serve file
# ps. SageMaker runs the container as: "docker run <image> serve"
WORKDIR /opt
COPY main.py /opt
COPY serve /opt
RUN chmod u+x serve

# Setup the executable path for "docker run <image> serve"
ENV PATH="/opt:${PATH}"

# Remove default serve file to avoid wrong execution
RUN rm /opt/conda/bin/serve 

# Overwrite Docker ENTRYPOINT from "python /usr/local/bin/dockerd-entrypoint.py" and Cmd from "/usr/local/bin/entrypoint.sh"
ENTRYPOINT ["/bin/bash", "-c"]

新建以下文件：

serve：

uvicorn main:app --proxy-headers --host 0.0.0.0 --port 8080 --log-level trace

main.py：

from contextlib import asynccontextmanager
from fastapi import FastAPI, status, Request, Response
import uvicorn

from vllm import LLM, SamplingParams

import os
import json
import time
import logging

logger = logging.getLogger('uvicorn.error')
MODELS_PATH = "/opt/ml/model"

llm = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    logger.info("Start to load model...")
    
    # Initialize LLM configuration
    tensor_parallel_size = os.environ['NUM_CORES'] if 'NUM_CORES' in os.environ else 2
    max_num_seqs = os.environ['BATCH_SIZE'] if 'BATCH_SIZE' in os.environ else 8
    max_model_len = os.environ['SEQUENCE_LENGTH'] if 'SEQUENCE_LENGTH' in os.environ else 4096

    logger.info(f"No. of Neuron Cores: {tensor_parallel_size}, Batch Size: {max_num_seqs}, Context Length: {max_model_len}")
    
    # Load the LLM
    llm['object'] = LLM(model=MODELS_PATH, 
                        device="neuron", 
                        tensor_parallel_size=int(tensor_parallel_size), 
                        max_num_seqs=int(max_num_seqs), 
                        max_model_len=int(max_model_len)
                       )
    logger.info("Model loaded.")
    yield
    # Clean up the LLM and release the resources
    llm.clear()
    logger.info("Model unloaded.")

app = FastAPI(lifespan=lifespan)

@app.get('/ping')
async def ping():
    health = llm['object'] is not None
    status_code = status.HTTP_200_OK if health else status.HTTP_404_NOT_FOUND
    response = Response(
        content='\n',
        status_code=status_code,
        media_type="text/plain",
    )
    return response

@app.post('/invocations')
async def invocations(request: Request):
    logger.info("Start to invoke API...")
    json_payload = await request.json()

    prompts = json_payload["inputs"]

    max_tokens = 256
    top_p = 1
    temperature = 1

    if "parameters" in json_payload:
        max_tokens = json_payload["parameters"]["max_tokens"] if "max_tokens" in json_payload["parameters"] else 256
        top_p = json_payload["parameters"]["top_p"] if "top_p" in json_payload["parameters"] else 1
        temperature = json_payload["parameters"]["temperature"] if "temperature" in json_payload["parameters"] else 1

    logger.info(f"Max Tokens: {max_tokens}, Top P: {top_p}, Temperature: {temperature}")

    t0 = time.time()
    sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=max_tokens)
    outputs = llm['object'].generate(prompts, sampling_params)
    t1 = time.time()
    time_elapsed = (t1-t0)*1000
    logger.info(f"Model Invocation complete. Time Elasped: {time_elapsed} ms")

    results = []
    for output in outputs:
        generated_text = output.outputs[0].text
        results.append({ "generated_text": generated_text })

    json_str = json.dumps(results, indent=4, default=str, ensure_ascii=False)
    response = Response(
        content=json_str,
        status_code=status.HTTP_200_OK,
        media_type="application/json",
    )
    return response

if __name__ == '__main__':
    uvicorn.run(app, host='localhost', port=8080, log_level="trace")

在 SageMaker Notebook 界面上会看到相关文件，如下图：

打开 Terminal，依次执行下面的命令，下载 vLLM 对应的 Neuron 版本，并添加对 Inference2 neuron 的支持。

cd ~/SageMaker
wget https://zz-common.s3.us-east-1.amazonaws.com/tmp/install.tar
tar -xvf install.tar
cd install
git clone https://github.com/vllm-project/vllm --branch v0.6.1.post2 --single-branch

cp arg_utils.py ./vllm/vllm/engine/
cp setup.py ./vllm/
cp neuron.py ./vllm/vllm/model_executor/model_loader/

运行如下命令，创建 Docker 容器，大概需时 10 分钟：

cd ~/SageMaker

aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-west-2.amazonaws.com

docker build -t sagemaker-neuron-container:deepseek .

更改 account_id 的值，将 Docker 镜像推送到 Amazon ECR 私有存储库：

account_id=<Your AWS Account ID>

aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin ${account_id}.dkr.ecr.us-west-2.amazonaws.com

docker tag sagemaker-neuron-container:deepseek ${account_id}.dkr.ecr.us-west-2.amazonaws.com/sagemaker-neuron-container:deepseek
docker push ${account_id}.dkr.ecr.us-west-2.amazonaws.com/sagemaker-neuron-container:deepseek

上传模型权重到 S3 存储桶

*开始前，要预先创建名字包含 sagemaker 的 S3 存储桶（例如 sagemaker-my-custom-bucket），参考创建桶。

打开 Notebook，运行如下 Python 代码，安裝 huggingface_hub Python 包：

!pip install --upgrade huggingface_hub

下载模型权重，这里我们以 DeepSeek-R1-Distill-Qwen-7B 为例：

from huggingface_hub import snapshot_download

model_id='deepseek-ai/DeepSeek-R1-Distill-Qwen-7B'
snapshot_download(repo_id=model_id, local_dir="./models/"+model_id)

更改 s3_bucket_name 的值，上传模型权重到 S3：

local_path = "./models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/"

s3_bucket_name = <YOUR BUCKET NAME>
s3_path = f"s3://{s3_bucket_name}/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/"

!aws s3 sync {local_path} {s3_path}

部署模型到 SageMaker Endpoint

对于推理端点，常规工作流包括以下内容：

通过指向 Amazon S3 中存储的模型工件和容器映像，在 SageMaker 创建模型。
通过在终端节点后面选择所需的实例类型和实例数量，创建 SageMaker Endpoint 配置。您可以使用 SageMaker 推理推荐器来获取实例类型的建议。
创建 SageMaker Endpoint。

下图显示了上述工作流：

在 Notebook 首先安裝 SageMaker Python 包：

!pip install --upgrade sagemaker

更改 s3_bucket_name 的值，運行以下的 Python 代码模型部署，大概需要等待 10 分钟：

import boto3
import sagemaker 
import time

name = "sagemaker-vllm-neuron-qwen-7b-inf2"

role = sagemaker.get_execution_role()

sm_client = boto3.client(service_name='sagemaker')
account_id = boto3.client("sts").get_caller_identity()["Account"]
region = boto3.Session().region_name

image_url = f"{account_id}.dkr.ecr.{region}.amazonaws.com/sagemaker-neuron-container:deepseek"

s3_bucket_name = <YOUR BUCKET NAME>
model_url = f"s3://{s3_bucket_name}/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/"

# Create model
sm_client.create_model(
    ModelName = name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        'Image': image_url,
        #'ModelDataUrl': f"{s3_path}/model.tar.gz",
        "ModelDataSource": {
            "S3DataSource": {
                "S3Uri": model_url, 
                "S3DataType": "S3Prefix",
                "CompressionType": "None",
            },
        },
        'Environment': {
            'NUM_CORES': '2', 
            'BATCH_SIZE': '8',
            'SEQUENCE_LENGTH': '4096',
        }
    }
)

# Create endpoint config
sm_client.create_endpoint_config(
    EndpointConfigName = name,
    ProductionVariants=[{
        'InstanceType': 'ml.inf2.8xlarge',
        'InitialInstanceCount': 1,
        'ModelName': name,
        'VariantName': 'AllTraffic',
        "VolumeSizeInGB": 100,
        "ModelDataDownloadTimeoutInSeconds": 300,
        "ContainerStartupHealthCheckTimeoutInSeconds": 600
    }]
)

# Create endpoint
sm_client.create_endpoint(
    EndpointName = name,
    EndpointConfigName = name
)

# Wait until endpoint in service 
resp = sm_client.describe_endpoint(EndpointName=name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print(f"Endpoint Name: {name}")

你可在 SageMaker AI 管理控制台查看模型部署的进度，如下图所示。

客户端测试

运行以下 Python 代码调用 SageMaker Endpoint 执行大型语言模型的推理：

import boto3
import json

name = "sagemaker-vllm-neuron-qwen-7b-inf2"

prompt = '''
四（1）班在“数学日”策划了四个活动，活动前每人只发放一枚“智慧币”。
“数学日”活动规则是：
1.参加活动順序自选。
2.每参加一个活动消耗一枚“智慧币”， 没有“智慧币”不能参加活动。
3.每个活动只能参加一次。
4.挑战成功，按右表发放契励，挑战失败，谢谢参与。

活动名称和挑战成功后奖励的“智慧币”对应关系如下：
魔方 1
拼图 2
华容道 2
数独 3

李军也参与了所有挑战活动，而且全部成功了，活动结束后他还剩几枚“智慧币”。
'''

smr_client = boto3.client(service_name='sagemaker-runtime')

payload = {
    "inputs": prompt, 
    "parameters": {
        "max_new_tokens": 1024, 
        "temperature": 1,
        "top_p": 0.9, 
    }
}

response = smr_client.invoke_endpoint(EndpointName=name,
                                      Body=json.dumps(payload),
                                      Accept="application/json",
                                      ContentType="application/json",
                                      )

result = json.loads(response["Body"].read().decode('utf-8'))
print(result)

清除 SageMaker 测试环境

完成测试后，运行 Python 代码清除 SageMaker 相关资源：

import boto3

name = "sagemaker-vllm-neuron-qwen-7b-inf2"

sm_client = boto3.client(service_name='sagemaker')
sm_client.delete_endpoint(EndpointName=name)
sm_client.delete_endpoint_config(EndpointConfigName=name)
sm_client.delete_model(ModelName=name)

结论

在这两篇文章中，我们以 DeepSeek-R1-Distill-Qwen-7B 模型为例，介绍了在 Amazon EC2 实例上和 SageMaker AI 环境中的部署方法。在实际业务场景中，亚马逊云科技在模型编译和运行时提供了多种工具和优化方法，例如预先编译模型优化启动时间。您可以参考 Neuron SDK，或者联系我们，共同构建高效的推理环境。

*前述特定亚马逊云科技生成式人工智能相关的服务仅在亚马逊云科技海外区域可用，亚马逊云科技中国仅为帮助您了解行业前沿技术和发展海外业务选择推介该服务。

亚马逊AWS官方博客