AWS Inferentia と AWS Trainium を用いた、AWS SageMaker JumpStart によるコスト最適化された Llama 2 モデルのファインチューニングとデプロイ

この記事は、Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium を翻訳したものです。

本日、Amazon SageMaker JumpStart における AWS Trainium および AWS Inferentia インスタンスを使用した Llama 2 推論とファインチューニングのサポートの提供を発表できることを大変嬉しく思います。SageMaker を通じて AWS Trainium および AWS Inferentia ベースのインスタンスを使用することで、ファインチューニングのコストを最大50%、トークンあたりのレイテンシを低減しながら、デプロイメントコストを 4.7 倍低減できます。Llama 2 は、最適化された Transformer アーキテクチャーを使用した自己回帰型のテキスト生成モデルです。一般に利用可能なモデルとして、Llama 2 は、テキスト分類、感情分析、言語翻訳、言語モデリング、テキスト生成、対話システムなど、多くの自然言語処理（NLP）タスクに向けて設計されています。Llama 2 のような大規模言語モデル（LLM）のファインチューニングやデプロイは、コストがかさんだり、顧客体験を向上させるためのリアルタイム性能を満たすことが困難になることがあります。AWS Trainium および AWS Inferentia は、AWS Neuron ソフトウェア開発キット(SDK)によって利用可能となっており、Llama 2 モデルのトレーニングと推論において、高性能かつコスト効果の高いオプションを提供します。

この投稿では、SageMaker JumpStart において AWS Trainium と AWS Inferentia インスタンスで Llama 2 をデプロイおよびファインチューニングを行う方法を示します。

ソリューションの概要

このブログでは、以下のシナリオについて説明します。

Amazon SageMaker Studio UI でのワンクリックでのデプロイ、または SageMaker Python SDK を利用して、 AWS Inferentia インスタンスに Llama 2 のデプロイを行います。
SageMaker Studio UI および SageMaker Python SDK の両方で AWS Trainium インスタンス上で Llama 2 のファインチューニングを行います。
ファインチューニングされた Llama 2 モデルのパフォーマンスを、事前学習されたモデルと比較し、ファインチューニングの有効性を示します。

実際に動かす際は、GitHub のサンプルノートブックをご覧ください。

SageMaker Studio UI と Python SDK を使った、AWS Inferentia への Llama 2 のデプロイ

このセクションでは、SageMaker Studio UI を使用しワンクリックのデプロイ操作とPython SDK で、Llama 2 を AWS Inferentia インスタンスにデプロイする方法を示します。

SageMaker Studio UI で Llama 2 モデルを探す

SageMaker JumpStart は、一般に公開されているものとプロプライエタリな基盤モデルの両方へのアクセスを提供します。基盤モデルは、サードパーティおよびプロプライエタリなプロバイダーから提供およびメンテナンスされます。そのため、これらはモデルのソースによって指定された異なるライセンスの下でリリースされています。使用する基本モデルのライセンスを必ず確認してください。ダウンロードやコンテンツの使用を行う前に、適用されるライセンス条項を確認し、使用ケースに適していることを確認する責任があります。

SageMaker JumpStart を通じて、SageMaker Studio UI および SageMaker Python SDK で Llama 2 基盤モデルにアクセスできます。このセクションでは、SageMaker Studio でモデルを検出する方法について説明します。

SageMaker Studio は、統合開発環境（IDE）であり、すべての機械学習（ML）開発ステップを実行するための特定用途向けツールにアクセスできる単一の Web ベースのビジュアルインターフェースを提供します。SageMaker Studio の開始とセットアップ方法の詳細については、こちらを参照してください。

SageMaker Studio に入ると、SageMaker JumpStart にアクセスできます。ここでは、Prebuilt and automated solutions の項目から、事前学習されたモデル、ノートブック、および事前構築されたソリューションが閲覧できます。プロプライエタリモデルにアクセスする詳細情報については、Use proprietary foundation models from Amazon SageMaker JumpStart in Amazon SageMaker Studio を参照してください。

SageMaker JumpStart のランディングページからは、ソリューション、モデル、ノートブック、およびその他のリソースを閲覧できます。

Llama 2 モデルが表示されない場合は、SageMaker Studioをシャットダウンして再起動してバージョンを更新してください。バージョンの更新に関する詳細は、Shut down and Update Studio Classic Apps を参照してください。

Explore All Text Generation Models を選択するか、検索ボックスに、llama または neuron と入力することで、他のモデルバリエーションも見つけることができます。

SageMaker Jumpstart による Llama-2-13b モデルのデプロイ

モデルカードを選択して、ライセンス、トレーニングに使用されたデータ、および使用方法などモデルの詳細を表示できます。また、このノーコードの例を使用してモデルを利用するための Deploy ボタンと Open notebook ボタンも見つけることができます。

どちらかのボタンを選択すると、ポップアップが表示され、エンドユーザーライセンス契約書および利用規約（AUP）が表示されます。これらに同意するかどうかを確認できます。

ポリシーに承認すると、モデルのエンドポイントをデプロイすることが可能になり、次のセクションで示すように使うことができます。

Python SDK による Llama 2 Neuron モデルのデプロイ

Deploy を選択してポリシーに同意すると、モデルのデプロイが開始されます。また、Open notebook を選択して例となるノートブックを使用することもできます。ノートブックには、モデルのデプロイから推論の実施、リソースのクリーンアップまでの一連のガイダンスが記述されています。

AWS Trainium または AWS Inferentia インスタンス上でモデルをデプロイまたはファインチューニングするには、まず PyTorch Neuron（torch-neuronx）を呼び出して、モデルを Neuron 固有のグラフにコンパイルする必要があります。これにより、Inferentia の NeuronCore に最適化されます。ユーザーは、アプリケーションの目的に応じて、最小のレイテンシまたは最大のスループットを最適化するようにコンパイラに指示できます。JumpStart では、様々な構成に対して Neuron グラフを事前にコンパイルしており、ユーザーがコンパイル手順を省略し、迅速にモデルをファインチューニングおよびデプロイできるようにしています。

Neuron の事前コンパイルされたグラフは、特定の Neuron Compiler バージョンに基づいて作成されていることに注意してください。

AWS Inferentia ベースのインスタンスで LIama 2 をデプロイする方法は 2 つあります。一つ目の方法は、事前に構築された構成を使用し、わずか 2 行のコードでモデルをデプロイできるようにします。二つ目の方法では構成に対してより細かい制御が可能です。まず、一つ目の方法である事前構築の構成を使用し、例として事前学習されたLlama 2 13B Neuronモデルを使用してLlama 13Bをわずか2行でデプロイする方法を以下に示します。

from sagemaker.jumpstart.model import JumpStartModel

model_id = "meta-textgenerationneuron-llama-2-13b
model = JumpStartModel(model_id=model_id)
pretrained_predictor = model.deploy(accept_eula=False) ## To set 'accept_eula' to be True to deploy

これらのモデルで推論を実行するには、model.deploy() の呼び出しの一部として、accept_eula 引数を True に指定する必要があります。この引数を True に設定すると、モデルの EULA に同意したことになります。EULA はモデルカードの説明または Meta のウェブサイトから入手できます。

Llama 2 13Bのデフォルトのインスタンスタイプは ml.inf2.8xlarge です。他のサポートされているモデルも試すことができます。

meta-textgenerationneuron-llama-2-7b
meta-textgenerationneuron-llama-2-7b-f（チャットモデル）
meta-textgenerationneuron-llama-2-13b-f（チャットモデル）

また、デプロイの構成をより細かく制御したい場合、コンテキストの長さ、テンソル並列度、最大ローリングバッチサイズなどを環境変数を介して変更することができます。デプロイに使用する Deep Learning Container (DLC) は Large Model Inference (LMI) NeuronX DLC です。環境変数は以下の通りです。

OPTION_N_POSITIONS – 入力および出力トークンの最大数。例えば、OPTION_N_POSITIONS を 512 でモデルをコンパイルした場合、入力トークンは 128（入力プロンプトサイズ）、最大出力トークンは 384（入力および出力トークンの合計が 512 になるように）を使用できます。最大出力トークンについては、384 以下の値なら問題ありませんが、それを超えてはいけません（例: 入力 256 および出力 512）。
OPTION_TENSOR_PARALLEL_DEGREE – AWS Inferentia インスタンスでモデルをロードする NeuronCore の数。
OPTION_MAX_ROLLING_BATCH_SIZE – 同時リクエストの最大バッチサイズ。
OPTION_DTYPE – モデルをロードするデータタイプ。

Neuron グラフのコンパイルは、コンテキストの長さ ( OPTION_N_POSITIONS)、テンソル並列度 (OPTION_TENSOR_PARALLEL_DEGREE)、最大バッチサイズ (OPTION_MAX_ROLLING_BATCH_SIZE)、およびデータタイプ (OPTION_DTYPE) に依存します。SageMaker JumpStart では、これらのパラメータのためのさまざまな構成の Neuron グラフを事前にコンパイルしており、ランタイムのコンパイルを回避するための設定が表に記載されています。環境変数が以下のカテゴリのいずれかに該当する場合、Neuron グラフのコンパイルはスキップされます。

LIama-2 7B and LIama-2 7B Chat
Instance type	OPTION_N_POSITIONS	OPTION_MAX_ROLLING_BATCH_SIZE	OPTION_TENSOR_PARALLEL_DEGREE	OPTION_DTYPE
ml.inf2.xlarge	1024	1	2	fp16
ml.inf2.8xlarge	2048	1	2	fp16
ml.inf2.24xlarge	4096	4	4	fp16
ml.inf2.24xlarge	4096	4	8	fp16
ml.inf2.24xlarge	4096	4	12	fp16
ml.inf2.48xlarge	4096	4	4	fp16
ml.inf2.48xlarge	4096	4	8	fp16
ml.inf2.48xlarge	4096	4	12	fp16
ml.inf2.48xlarge	4096	4	24	fp16
LIama-2 13B and LIama-2 13B Chat
ml.inf2.8xlarge	1024	1	2	fp16
ml.inf2.24xlarge	2048	4	4	fp16
ml.inf2.24xlarge	4096	4	8	fp16
ml.inf2.24xlarge	4096	4	12	fp16
ml.inf2.48xlarge	2048	4	4	fp16
ml.inf2.48xlarge	4096	4	8	fp16
ml.inf2.48xlarge	4096	4	12	fp16
ml.inf2.48xlarge	4096	4	24	fp16

以下は、Llama 2 13B のデプロイと設定の例になります。

from sagemaker.jumpstart.model import JumpStartModel

model_id = "meta-textgenerationneuron-llama-2-13b-f"
model = JumpStartModel(
model_id=model_id,
env={
"OPTION_DTYPE": "fp16",
"OPTION_N_POSITIONS": "4096",
"OPTION_TENSOR_PARALLEL_DEGREE": "12",
"OPTION_MAX_ROLLING_BATCH_SIZE": "4",
},
instance_type="ml.inf2.24xlarge"
)
pretrained_predictor = model.deploy(accept_eula=False) ## To set 'accept_eula' to be True to deploy

これで Llama-2-13b モデルをデプロイしたので、エンドポイントを呼び出すことで推論を実行できます。以下では、サポートされている推論時の設定パラメータを示しています。

max_length – 出力の長さ（入力のコンテキスト長を含む）が max_length に達するまでモデルはテキストを生成します。指定された場合、正の整数である必要があります。
max_new_tokens – 出力の長さ（入力のコンテキスト長を除く）が max_new_tokens に達するまでモデルはテキストを生成します。指定された場合、正の整数である必要があります。
num_beams – 貪欲法で使用されるビームの数を示します。指定された場合、num_return_sequences 以上の整数である必要があります。
no_repeat_ngram_size – 出力シーケンスで no_repeat_ngram_size の単語の並びが繰り返されないようにモデルが保証します。指定された場合、正の整数で 1 より大きい必要があります。
temperature – 出力のランダム性を制御します。高い温度では低確率の単語の出力シーケンスが生成され、低い温度では高確率の単語の出力シーケンスが生成されます。temperature が 0 の場合、貪欲なデコーディングが行われます。指定された場合、正の浮動小数点数である必要があります。
early_stopping – True の場合、すべてのビームの仮説が文の終端トークンに到達した時点でテキスト生成が終了します。指定された場合、Boolean である必要があります。
do_sample – True の場合、モデルは次の単語を尤度に従ってサンプリングします。指定された場合、Boolean である必要があります。
top_k – テキスト生成の各ステップで、モデルは top_k で最も尤もらしい単語からサンプリングします。指定された場合、正の整数である必要があります。
top_p – テキスト生成の各ステップで、モデルは top_p の累積確率で最小の可能な単語のセットからサンプリングします。指定された場合、0 から 1 の浮動小数点数である必要があります。
stop – 指定された場合、文字列のリストである必要があります。指定された文字列のいずれかが生成された場合、テキスト生成は停止します。

以下は推論コードの例を示しています。

payload = {
    "inputs": "I believe the meaning of life is",
    "parameters": {
    "max_new_tokens": 64,
    "top_p": 0.9,
    "temperature": 0.6,
    },
}

response = pretrained_predictor.predict(payload)

出力は以下になります。

I believe the meaning of life is
> to be happy. I believe that happiness is a choice. I believe that happiness is a state of mind. I believe that happiness is a state of being. I believe that
happiness is a state of being. I believe that happiness is a state of being. I
believe that happiness is a state of being. I believe

パラメータに関する詳細な情報については、こちらを参照してください。

SageMaker Studio UI と SageMaker Python SDK を使用した、Trainium インスタンスでの Llama 2 モデルのファインチューニング

生成 AI の基盤モデルは、機械学習（ML）および人工知能（AI）の主要な焦点となっていますが、さまざまな領域における汎化は、ヘルスケアや金融サービスなど特定の領域で独自のデータが関与する場合には不十分な場合があります。このことは、生成 AI モデルを特定の領域においてパフォーマンスを向上させるために、ドメイン固有のデータでファインチューニングが必要であることを強調しています。

事前学習済みの Llama 2 モデルをデプロイしましたが、このモデルをドメイン固有のデータでファインチューニングし、正確性を向上させ、プロンプトの補完を改善し、モデルを特定のビジネスユースケースとデータに適応させる方法を見てみましょう。モデルのファインチューニングは、SageMaker Studio UI またはSageMaker Python SDK のいずれかを使用して行うことができます。このセクションでは、両方の方法について説明します。

SageMaker Studioを使用した Llama-2-13b Neuron モデルのファインチューニング

SageMaker Studio に入り、Llama-2-13b Neuron モデルに移動します。Deploy タブで、ファインチューニングのためのトレーニングおよび検証データセットが含まれる Amazon Simple Storage Service (Amazon S3) バケットを指定できます。さらに、ファインチューニングのためのデプロイ設定、ハイパーパラメータ、およびセキュリティ設定を構成することができます。その後、SageMaker ML インスタンス上でトレーニングジョブを開始するために Train を選択します。

Llama 2 モデルを使用するには、EULA および AUP に同意する必要があります。Train を選択するとそれらが表示されます。ファインチューニングジョブを開始するためには、I have read and accept EULA and AUP を選択してください。

ファインチューニングされたモデルのトレーニングジョブのステータスは、SageMaker コンソールのナビゲーションペインで Training jobs を選択することで確認できます。

このノーコードの例を使用してLlama 2 Neuron モデルをファインチューニングするか、次のセクションで示すように Python SDK を使用してファインチューニングすることができます。

SageMaker Python SDK を使用した Llama-2-13b Neuron モデルをファインチューニング

ドメイン適応の形式または命令ベースのファインチューニング形式のデータセットでファインチューニングすることができます。以下は、ファインチューニングを行う前にトレーニングデータをどのようにフォーマットするかの手順です。

Input – JSON lines (.jsonl) またはテキスト (.txt) 形式のファイルが含まれている train ディレクトリです。
- JSON lines (.jsonl) ファイルの場合、各行は個別の JSON オブジェクトです。各 JSON オブジェクトは、キーが text であり、値が 1 つのトレーニング例の内容であるキーと値のペアに構造化される必要があります。
- train ディレクトリ内のファイル数は 1 と等しくする必要があります。
Output – 推論にデプロイできるトレーニングされたモデルです。

この例では、命令ベースのファインチューニング形式で Dolly データセットのサブセットを使用しています。Dolly データセットには、質問応答、要約、情報抽出などのさまざまなカテゴリの約 15,000 の命令に従ったレコードが含まれています。これはApache 2.0ライセンスのもとで利用可能です。ファインチューニングには information_extraction の例を使用しています。

Dolly データセットをロードし、 train（ファインチューニング用）と test （評価用）に分割します。

from datasets import load_dataset

dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

task = "information_extraction"
To train for summarization/closed question and answering, you can replace the assertion in next line to example["category"] == "sumarization"/"closed_qa".
summarization_dataset = dolly_dataset.filter(lambda example: example["category"] == task)
summarization_dataset = summarization_dataset.remove_columns("category")

We split the dataset into two where test data is used to evaluate at the end.
train_and_test_dataset = summarization_dataset.train_test_split(test_size=0.1)

Dumping the training data to a local file to be used for training.
train_and_test_dataset["train"].to_json("train.jsonl")

2. トレーニングジョブのためのインストラクション形式のデータの前処理には、プロンプトのテンプレートを使用します。

prompt = ("""Below is an instruction that describes a task, paired with an input 
that provides further context. Write a response that appropriately completes the 
request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{context}### 
Response:\n{response}\n\n<s>""")

ハイパーパラメーターを検証し、ユースケースに応じて上書きを行います。

from sagemaker import hyperparameters

model_id = "meta-textgenerationneuron-llama-2-13b"
model_version = "1.*"

my_hyperparameters = hyperparameters.retrieve_default(
    model_id=model_id, model_version=model_version
)

my_hyperparameters["max_input_length"] = "4096" ## you can increase it up to 4096 for sequence length.
my_hyperparameters["max_steps"] = "25"
my_hyperparameters["learning_rate"] = "0.0001"
print(my_hyperparameters)

hyperparameters.validate(model_id=model_id, model_version=model_version, hyperparameters=my_hyperparameters)

4. モデルをファインチューニングし、SageMaker トレーニングジョブを開始します。ファインチューニングスクリプトは、neuronx-nemo-megatron リポジトリに基づいており、これは Neuron および EC2 Trn1 インスタンスでの使用に適応された NeMo と Apex パッケージのバージョンです。neuronx-nemo-megatron リポジトリには、LLM をスケールしてファインチューニングするための 3D（データ、テンソル、およびパイプライン）並列処理が備わっています。サポートされている Trainium インスタンスは ml.trn1.32xlarge および ml.trn1n.32xlarge です。

from sagemaker.jumpstart.estimator import JumpStartEstimator

estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    hyperparameters=my_hyperparameters,
    environment={"accept_eula": "false"}, # please change `accept_eula` to be `true` to accept EULA.
    #instance_type="ml.trn1n.32xlarge", if not specified, default `ml.trn1.32xlarge` will be used.
)

estimator.fit({"train": train_data_location})

5. 最後に、ファインチューニングされたモデルを SageMaker エンドポイントへデプロイします。

finetuned_predictor = estimator.deploy()

事前学習済みとファインチューニングされた Llama 2 Neuron モデルのレスポンスの比較

Llama-2-13b の事前学習済みモデルとファインチューニングされたモデルのプロンプト補完の性能比較を以下の表に示します。また、SEC Filing データセットを. txt 形式で使用して Llama 2 をファインチューニングする例も提供しています。詳細については、GitHubのサンプルノートブックを参照してください。

Item	Inputs	Ground truth	Response from non-finetuned model	Response from fine-tuned model
1	Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nExtract universities Moret studied at and his year of graduation for each, and put them in a bulleted list.\n\n### Input:\nMoret was born in 1953 in Vevey Switzerland, and did his undergraduate studies at the École Polytechnique Fédérale de Lausanne (EPFL), graduating in 1975. He went on to graduate studies at the University of Tennessee, earning a Ph.D. in 1980. He then joined the faculty of the University of New Mexico, where he remained until 2006, when he moved to EPFL. He retired from EPFL in December 2016.\n\nIn 1996, Moret founded the ACM Journal of Experimental Algorithmics, and he remained editor in chief of the journal until 2003. In 2001, Moret founded the Workshop in Algorithms for Bioinformatics (WABI) and remains on the Steering Committee for the conference.\n\nIn 2018, Moret was elected as a Fellow of the International Society for Computational Biology, for his outstanding contributions to the fields of computational biology and bioinformatics.\n\n### Response:\n	* École Polytechnique Fédérale de Lausanne (EPFL) (graduated 1975)\n* University of Tennessee (graduated 1980)	Moret studied at the École Polytechnique Fédérale de Lausanne (EPFL) and the University of Tennessee, graduating in1975 and1980, respectively.\n\n### Hint:\n\n- Use the `split` function to split the input into a list of sentences.\n- Use the `split` function to split each sentence into a list of words.\n- Use the `split` function to split	– École Polytechnique Fédérale de Lausanne (EPFL),1975\n- University of Tennessee,1980\n- University of New Mexico\n- École Polytechnique Fédérale de Lausanne (EPFL),2006\n\n<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n
2	Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nFrom the passage identify the attributes of Dionysus. Display the results in a comma separated format.\n\n### Input:\nDionysus (Διόνυσος, Diónusos)/Bacchus (Βάκχος, Bákkhos)\n\nGod of wine, fruitfulness, parties, festivals, madness, chaos, drunkenness, vegetation, ecstasy, and the theater. He is the twice-born son of Zeus and Semele, in that Zeus snatched him from his mother’s womb and stitched Dionysus into his own thigh and carried him until he was ready to be born. In art he is depicted as either an older bearded god (particularly before 430 BC) or an effeminate, long-haired youth (particularly after 430 BC). His attributes include the thyrsus, a drinking cup, the grape vine, and a crown of ivy. He is often in the company of his thiasos, a group of attendants including satyrs, maenads, and his old tutor Silenus. The consort of Dionysus was Ariadne. It was once held that Dionysius was a later addition to the Greek pantheon, but the discovery of Linear B tablets confirm his status as a deity from an early period. Bacchus was another name for him in Greek, and came into common usage among the Romans. His sacred animals include dolphins, serpents, tigers, and donkeys.\n\n### Response:\n	the thyrsus, a drinking cup, the grape vine, a crown of ivy	Dionysus (Διόνυσος, Diónusos)\n\nGod of wine, fruitfulness, parties, festivals, madness, chaos, drunkenness, vegetation, ecstasy, and the theater. He is the twice-born son of Zeus and Semele, in that Zeus snatched him from his mother’s womb and stitched Dionysus into his own thigh	thyrsus, a drinking cup, the grape vine, a crown of ivy\n\n<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is the difference between Parthian and Arsacid Empire?\n\n### Input:\nThe Parthian Empire, also known as the Arsacid Empire was a
3	Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhy does the capital of Uzbekistan contain the biggest Greek community?\n\n### Input:\nThe Greeks in Uzbekistan number approximately 9,000. The community is made up of Greeks from Russia who were deported by force from that country to Uzbekistan in the 1940s, and political refugees from Greece. About 30,000 Greeks lived in the country before World War II and a further 11,000 arrived after the Greek Civil War and the defeat of Democratic Army of Greece. Their numbers have dwindled from a high of some 40,000 in the 1960s. The main reason is emigration to Greece after the end of the Cold War when laws allowed the return of all ethnic Greeks who had been exiled for political reasons after the Greek Civil War. The biggest Greek community in the country is in the capital city of Tashkent where most of the Greek political refugees were relocated by the Soviet authorities. In ancient times the south of the country was part of the Hellenistic Seleucid Kingdom but the few Greek communities there have since assimilated. The most important organization representing ethnic Greeks is the Greek Cultural Association of Tashkent. The activities of this organization include Greek language instruction (19 classes with a total of 571 students, classes for adults) and the teaching of Greek dances and music.\n\n### Response:\n	Most of the Greek political refugees in Uzbekistan were relocated by the Soviet authorities to the capital city of Tashkent.	\n“`\nThe capital of Uzbekistan contains the biggest Greek community because the Greeks in Uzbekistan number approximately9,000. The community is made up of Greeks from Russia who were deported by force from that country to Uzbekistan in the1940s, and political refugees from Greece. About30,000 Greeks lived in the country before World War II and a further11,00	The capital of Uzbekistan contains the biggest Greek community because that is where most of the Greek political refugees were relocated by the Soviet authorities after being deported from Greece.\n\n<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is the difference between Parthian and Arsacid Empire?\n\n### Input:\n

ファインチューニングモデルからの応答は、事前学習済みモデルと比較して精度、関連性、明確さで著しい改善を示しています。いくつかのケースでは、ユースケースに対して事前学習済みモデルを使用するだけでは十分でない場合があり、このテクニックを使用してファインチューニングすることで、解決策をデータセットによりパーソナライズが可能となります。

クリーンアップ

トレーニングジョブが完了し、既存のリソースをもう使用しない場合は、以下のコードを使用してリソースを削除できます。

# Delete resources
# Delete the fine-tuned model
finetuned_predictor.delete_model()

# Delete the fine-tuned model endpoint
finetuned_predictor.delete_endpoint()

結論

Llama 2 Neuron モデルの SageMaker 上でのデプロイおよびファインチューニングは、大規模な生成AIモデルの管理と最適化において著しい進歩を示しています。Llama-2-7b や Llama-2-13b などのバリエーションを含んだモデルは、Neuron を使用してAWS Inferentia および AWS Trainium ベースのインスタンスで効率的なトレーニングと推論を行い、パフォーマンスと拡張性を向上させています。

これらのモデルは SageMaker JumpStart UI および Python SDK を通して柔軟かつ簡便にデプロイすることができます。Neuron SDK は、一般的なMLフレームワークのサポートと高性能な機能を備えており、これらの大規模なモデルを効率的に処理できるようにしています。

これらのモデルを特定のドメインに特化したデータでファインチューニングすることは、専門分野において関連性と精度を向上させるために重要です。このプロセスは、SageMaker Studio UI や Python SDK を使用して実行でき、特定のニーズに合わせてカスタマイズが可能であり、プロンプトの補完と応答の品質の向上につながります。

これらのモデルの事前学習済みバージョンは強力ですが、一般的または繰り返しの応答を提供する可能性があります。ファインチューニングはモデルを特定の文脈に合わせ、より正確で関連性があり、多様な応答をもたらします。このカスタマイズは特に、事前学習済みとファインチューニングしたモデルの応答を比較する際に顕著であり、後者は出力の品質と専門性において明らかな改善を示しています。結論として、SageMaker 上での Neuron Llama 2 モデルのデプロイとファインチューニングは、先進的な AI モデルの管理に対する堅牢なフレームワークを表し、特に特定のドメインやタスクに合わせて調整された場合には、性能と適用範囲で著しい改善を提供しています。

サンプルノートブックを参照してぜひ始めてみましょう。

GPU ベースのインスタンスで事前学習済み Llama 2 モデルをデプロイおよびファインチューニングする詳細については、 Fine-tune Llama 2 for text generation on Amazon SageMaker JumpStart および Llama 2 foundation models from Meta are now available in Amazon SageMaker JumpStart を参照してください。

Amazon Web Services ブログ