AWS Trainium、AWS Inferentia が AWS 上の Llama 3.1 モデルに高性能と低コストを提供

本日、AWS Trainium と AWS Inferentia による Llama 3.1 モデルのファインチューニングと推論のサポートを発表できることを嬉しく思います。Llama 3.1 ファミリーは、8B（80億）、70B（700億）、405B（4,050億）サイズの事前学習およびインストラクションチューニング済みの多言語大規模言語モデル（LLM）のコレクションです。

以前の投稿では、Amazon SageMaker JumpStart で AWS Trainium と Inferentia ベースのインスタンスに Llama 3 モデルをデプロイする方法について解説しました。今回の投稿では、AWS AI チップ上でそのコストパフォーマンスの利点と共に Llama 3.1 ファミリーのモデルのファインチューニング及びデプロイを実現する方法について概説します。

Llama 3.1 モデルの概要

Llama 3.1 ファミリーの多言語 LLM は、8B、70B、405B サイズの事前学習およびインストラクションチューニング済みの生成モデルのコレクションです（テキスト入力/テキストおよびコード出力）。すべてのモデルは長いコンテキスト長（128k）をサポートし、グループ化されたクエリアテンション（GQA）をサポートしているため、推論が高速です。

Llama 3.1 インストラクションチューニング済みモデル（8B、70B、405B）は多言語対話ユースケース向けに最適化されており、一般的な業界ベンチマークで多くの公開されているチャットモデルを上回るパフォーマンスを示します。これらは検索、画像生成、コード実行、数学的推論などの特定のツールのツールコールを生成するよう訓練されています。さらに、ゼロショットのツール使用もサポートしています。

Llama 3.1 405B は、Meta によると世界最大の公開利用可能な LLM です。このモデルは人工知能（AI）の新しい基準を設定し、エンタープライズレベルのアプリケーションや研究開発に理想的です。合成データ生成のようなタスクに適しており、モデルの出力をファインチューニング後に小規模な Llama モデルの改善に使用したり、405Bモデルから小規模モデルへの知識転移のためのモデル蒸留（distillations）に使用したりできます。このモデルは、一般知識、長文テキスト生成、多言語翻訳、機械翻訳、コーディング、数学、ツール使用、強化された文脈理解、高度な推論と意思決定において優れています。

アーキテクチャ的には、Llama 3 と Llama 3.1 のコア LLMは同じ密（dense）なアーキテクチャです。これらは最適化されたトランスフォーマーアーキテクチャを使用する自己回帰言語モデルです。ファインチューニングされたバージョンは、有用性と安全性に関する人間の選好に合わせるために、教師ありファインチューニング（SFT : supervised fine-tuning ）と人間のフィードバックによる強化学習（RLHF : einforcement learning with human feedback）を使用しています。

Meta の責任ある使用ガイドは、モデルをカスタマイズし最適化するために必要な追加のファインチューニングを、適切な安全性対策とともに実装する際に役立ちます。

AWS Trainium が Amazon Bedrock と Amazon SageMaker で Llama 3.1 を強化

AWS で Llama 3.1 を始める最速の方法は、目的に特化した AI インフラストラクチャ（AWS Trainium を含む）を利用するAmazon Bedrock です。完全に管理された API を通じて、Amazon Bedrock は目的に特化した AI インフラストラクチャの利点を提供し、これらの強力なモデルへのアクセスを簡素化するため、差別化された AI アプリケーションの構築に集中できます。

基盤となるリソースをより細かく制御する必要がある場合は、SageMakerでLlama 3.1モデルをファインチューニングおよびデプロイできます。SageMaker JumpStart での Llama 3.1 の Trainium サポートは近日公開予定です。

AWS Trainium と AWS Inferentia2 が Llama 3.1 モデルの高性能と低コストを実現

トレーニングと推論のための独自の ML パイプラインを構築して、より高い柔軟性と制御を得たい場合は、Amazon Elastic Compute Cloud（Amazon EC2）Trn1 および Inf2 インスタンスを使用して AWS AI チップ上で Llama 3.1 を開始できます。AWS Neuron SDK を使用して新しい Llama 3.1 8B/70B モデルを開始する方法を見てみましょう。

Trainium 上で Llama 3.1 をファインチューニング

Llama 3.1 8B または Llama 3.1 70B のファインチューニングを開始するには、NeuronX Distributed ライブラリを使用可能です。NeuronX Distributed は、より一般的な分散トレーニングおよび推論技術の実装を提供します。
ファインチューニングを開始するには、以下のサンプルを使用できます：

両方のサンプルは、Trainium クラスターインフラストラクチャを管理する AWS ParallelCluster と、ワークロード管理のためのSlurm の上に構築されています。以下は Llama3.1 70B のトレーニングを開始するための Slurm コマンドの例です：

sbatch --exclusive \
--nodes 32 \
--cpus-per-task 128 \
--wrap="srun bash $(pwd)/run_llama3_70B_tp_pp.sh"

Slurm スクリプト内で、クラスター上で分散トレーニングプロセスを起動します。ランナースクリプトでは、Metaが提供する事前学習済みの重みと設定をロードし、トレーニングプロセスを開始します：

torchrun $DISTRIBUTED_ARGS run_llama_nxd.py \
    --train_batch_size $BS \
    --use_meta_device_init 1 \
    --training_dir $DATA_PATH \
    --training_config $SCRIPT_DIR/${MODEL_SIZE}config_llama${LLAMA_VERSION} \
    --max_steps $max_steps \
    --seq_len $SEQ_LEN \
    --pipeline_parallel_size $PP_DEGREE \
    --tensor_parallel_size $TP_DEGREE \
    --num_microbatches $NUM_MICROBATCHES \
    --lr 0.000015 \
    --min_lr 1e-06 \
    --beta1 0.9 \
    --beta2 0.95 \
    --weight_decay 0.1 \
    --warmup_steps 2000 \
    --constant_steps 0 \
    --use_zero1_optimizer 1 \
    --use_selective_checkpoint 1 \
    --use_flash_attention 1 \
    --qkv_linear 1 \
    --kv_replicator 4 \
    --pretrained_weight 1 \
    --save_load_xser 1 \
    --checkpoint_dir "/shared/llama${LLAMA_VERSION}${MODEL_SIZE}/" \
    --checkpoint_freq $checkpoint_freq \
    --num_kept_checkpoint -1 \
    --loading_step -1 \
    --tb_dir $tb_dir |& tee $LOG_PATH/log
exit ${PIPESTATUS[0]}

Inferentia2 上で Llama 3.1 をデプロイ

モデルのデプロイ準備ができたら、以前の Llama 3 8B Neuron サンプルコードでモデルIDを更新することでデプロイできます。

model_id = "meta-llama/Meta-Llama-3.1-8B"
neuron_model = LlamaForSampling.from_pretrained(model_id, neuron_config=neuron_config, batch_size=1, tp_degree=24, amp='bf16', n_positions=4096)
neuron_model.to_neuron()

同様のサンプル推論コードを使用できます：

tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Hello, I'm a language model and I like to"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# run inference with top-k sampling
with torch.inference_mode():
    start = time.time()
    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)
    elapsed = time.time() - start

generated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]
print(f'generated sequences {generated_sequences} in {elapsed} seconds')

ステップバイステップの詳細については、新しい Llama 3.1 のサンプルを参照してください：

Meta Llama 3.1 8B
Meta Llama 3.1 70B
Meta Llama 3.1 8B 32k
Meta Llama 3.1 405B on Trainium は近日公開予定です

また、Hugging Face の Optimum Neuron ライブラリを使用して、Hugging Face Model Hub から SageMaker を通じて直接モデルをすばやくデプロイすることもできます。Llama 3.1 モデルカードハブから、「Deploy」（デプロイ）を選択し、次に「SageMaker」を選び、最後に「AWS Inferentia & Trainium」を選択します。サンプルコードを SageMaker ノートブックにコピーし、「Run」（実行）を選択します。

import json
import sagemaker
import boto3
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

# Hub Model configuration. https://huggingface.co/models
hub = {
    "HF_MODEL_ID": "meta-llama/Meta-Llama-3.1-8B",
    "HF_NUM_CORES": "2",
    "HF_AUTO_CAST_TYPE": "fp16",
    "MAX_BATCH_SIZE": "8",
    "MAX_INPUT_LENGTH": "3686",
    "MAX_TOTAL_TOKENS": "4096",
    "HF_TOKEN": "<REPLACE WITH YOUR TOKEN>",
}

assert hub["HF_TOKEN"] != "<REPLACE WITH YOUR TOKEN>", "Please replace '<REPLACE WITH YOUR TOKEN>' with your Hugging Face Hub API token"


# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.23"),
    env=hub,
    role=role,
)

# deploy model to SageMaker Inference
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.inf2.xlarge",
    container_startup_health_check_timeout=1800,
    volume_size=512,
)

# send request
predictor.predict(
    {
        "inputs": "What is is the capital of France?",
        "parameters": {
            "do_sample": True,
            "max_new_tokens": 128,
            "temperature": 0.7,
            "top_k": 50,
            "top_p": 0.95,
        }
    }
)

さらに、vLLM を使用してモデルをデプロイしたい場合は、 continuous batching ガイドを参照して環境を作成できます。環境を構築した後、vLLM を使用して AWS Trainium または Inferentia に Llama 3.1 8B/70B モデルをデプロイできます。以下は Llama 3.1 8B をデプロイする例です：


from vllm import LLM, SamplingParams
# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(
    model="meta-llama/Meta-Llama-3.1-8B",
    max_num_seqs=8,
    # The max_model_len and block_size arguments are required to be same as max sequence length,
    # when targeting neuron device. Currently, this is a known limitation in continuous batching
    # support in transformers-neuronx.
    max_model_len=128,
    block_size=128,
    # The device can be automatically detected when AWS Neuron SDK is installed.
    # The device argument can be either unspecified for automated detection, or explicitly assigned.
    device="neuron",
    tensor_parallel_size=8)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

結論

AWS Trainium と Inferentia は、Llama 3.1 モデルのファインチューニングとデプロイを高性能かつ低コストで提供可能です。これらの強力なモデルと目的に特化した AI インフラストラクチャを使用して、差別化された AI アプリケーションを構築する方法を見るのが楽しみです。AWS AI チップの使用開始方法の詳細については、AWS Neuron ドキュメントのモデルサンプルとチュートリアルを参照してください。

翻訳は Annapurna Labs の常世が担当しました。原文はこちらです。

Amazon Web Services ブログ