2024年5月23日木曜日

日本語 LLaMA ベースのモデルを Google Colab 上で Fine-Turning してみた

概要
環境
学習コード
推論コード
データセット
ベースモデル
Tips: GPU メモリ削減テクニック
Tips: This instance will be ignored in loss calculation. Note, if this happens often, consider increasing the max_seq_length. 対策
Tips: trainer/utils.py:141: UserWarning: Could not find response key
最後に
参考サイト

概要

これまでいろいろと M2 mac mini 上で Fine-Tuning を試してきましたがメモリが足りなかったりスペックが低くて学習時間がとんでもなくかかったりと大変でした
なので今回は Google Colab を使ってスクラッチで日本語モデルの Fine-Tuning を試してみます

環境

Google Colab (T4 GPU)
- Python 3

学習コード

コードブロックごとにコメントをつけているのでそれ単位で実行すると良いかなと思います

# 必要なライブラリのインストール
%%capture
%pip install accelerate peft bitsandbytes transformers trl

# 必要なモジュールとクラスのインポート
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM

# 今回使用する日本語言語モデルとデータセット名の定義
base_model = "elyza/ELYZA-japanese-Llama-2-7b-instruct"
dataset = "bbz662bbz/databricks-dolly-15k-ja-ojousama"
new_model = "elyza-2-7b-instruct-ojousama"

# データセットのロード
dataset = load_dataset(dataset, split="train")

# QLoRAを使ってFine-Tuningする、QLoRAにすることで低スペック環境でもFine-Tuningできるようにする
compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

# 日本語言語モデルのダウンロードと読み込み、quantization_config を指定して 4bit 量子化 (QLoRA) を設定
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# モデルからトークナイザーの取得
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# LoRA 用のパラメータ設定
peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

# 訓練に必要なハイパーパラメータの設定
training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    # Gradient Accumulation 有効化 (メモリ削減テクニック1)
    per_device_train_batch_size=4,
    gradient_accumulation_steps=16,
    # Gradient Checkpointing の有効化 (メモリ削減テクニック2)
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

# 学習データのフォーマットを行う
# 学習データは json 形式で質問 (instruction)、回答 (output)、入力 (input) などが用意されている
# 今回は質問と回答を使って学習用のデータを準備する
def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
        output_texts.append(text)
    return output_texts

# 学習時に回答部分がどこから始まるか教えてるための区切り文字の準備
response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)

# 学習を開始するためのオブジェクト作成
# ここで GPU メモリが上がる
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_params,
    formatting_func=formatting_prompts_func,
    data_collator=collator,
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)

# 学習開始
# ここでも GPU メモリが上がるが 10GB 以上は上がらないようにチューニングしているので学習が開始される
trainer.train()

今回のハイパーパラメータで学習はだいたいX時間ほどかかりました

推論コード

prompt = "プログラミングについてはどうお考えですの？"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

データセット

https://huggingface.co/datasets/bbz662bbz/databricks-dolly-15k-ja-ojousama

ベースモデル

https://huggingface.co/elyza/ELYZA-japanese-Llama-2-7b-instruct

Tips: GPU メモリ削減テクニック

https://qiita.com/masashi-ai/items/df9b81908d7d7639b7fb

Tips: This instance will be ignored in loss calculation. Note, if this happens often, consider increasing the max_seq_length. 対策

SFTTrainer の max_seq_length のデフォルト値は 1024 です
学習させるデータ内に 1024 文字以上ある文章が含まれている場合は少し増やしてもいいかもしれません

Tips: trainer/utils.py:141: UserWarning: Could not find response key

なぜが学習データがうまく読み込めないようです
トークナイザーが原因の可能性もあるようですが response_template などを変更するとうまくいくかもです

参考: https://github.com/huggingface/trl/issues/588

最後に

Google Colab 上で日本語で事前学習済みの LLaMA ベースのモデルを Fine-Tuning してみました
作成したモデルで推論までできるので低スペックなマシンしかない人でもこの方法ならわりと大規模な言語モデルでも Fine-Tuning できるかなと思います

ただ Google Colab の無料枠だと一日の使用時間が短いので学習前にリソースが終了する可能性があります

hawksnowlog