Skip to content

Usage of LoadBestPeftModelCallback in Finetuning stage #136

Open
@ttssp

Description

@ttssp

Hi friends,

I was trying to test the finetune/finetune.py script. It seems that state.best_model_checkpoint always return None leading to a failure at the end of the program. Is it that the program did not save a "best model" during training? I am a bit new to this, could anyone give some explanation on this and offer some hints on solving it? Thanks a lot!

command(single GPU):

python finetune/finetune.py --model_path="../../models/starcoder/" --dataset_name="../../datasets/ArmelR/stack-exchange-instruction" --subset="data/finetune" --split="train" --size_valid_set 10 --streaming --seq_length 2048 --max_steps 2 --batch_size 1 --input_column_name="question" --output_column_name="response" --gradient_accumulation_steps 1 --learning_rate 1e-4 --lr_scheduler_type="cosine" --num_warmup_steps 1 --weight_decay 0.05 --output_dir="./checkpoints"

error image(single GPU):
企业微信截图_deeb82a7-668e-459c-882d-a72ca8ebacf4

command(mulit GPUs):
python -m torch.distributed.launch --nproc_per_node 4 finetune/finetune.py --model_path="../../models/starcoder/" --dataset_name="../../datasets/ArmelR/stack-exchange-instruction" --subset="data/finetune" --split="train" --size_valid_set 10000 --streaming --seq_length 2048 --max_steps 2 --batch_size 1 --input_column_name="question" --output_column_name="response" --gradient_accumulation_steps 16 --learning_rate 1e-4 --lr_scheduler_type="cosine" --num_warmup_steps 100 --weight_decay 0.05 --output_dir="./checkpoints"

error image(mulit GPU):
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions