You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to test the finetune/finetune.py script. It seems that state.best_model_checkpoint always return None leading to a failure at the end of the program. Is it that the program did not save a "best model" during training? I am a bit new to this, could anyone give some explanation on this and offer some hints on solving it? Thanks a lot!
Hi friends,
I was trying to test the finetune/finetune.py script. It seems that state.best_model_checkpoint always return None leading to a failure at the end of the program. Is it that the program did not save a "best model" during training? I am a bit new to this, could anyone give some explanation on this and offer some hints on solving it? Thanks a lot!
command(single GPU):
python finetune/finetune.py --model_path="../../models/starcoder/" --dataset_name="../../datasets/ArmelR/stack-exchange-instruction" --subset="data/finetune" --split="train" --size_valid_set 10 --streaming --seq_length 2048 --max_steps 2 --batch_size 1 --input_column_name="question" --output_column_name="response" --gradient_accumulation_steps 1 --learning_rate 1e-4 --lr_scheduler_type="cosine" --num_warmup_steps 1 --weight_decay 0.05 --output_dir="./checkpoints"
error image(single GPU):
command(mulit GPUs):
python -m torch.distributed.launch --nproc_per_node 4 finetune/finetune.py --model_path="../../models/starcoder/" --dataset_name="../../datasets/ArmelR/stack-exchange-instruction" --subset="data/finetune" --split="train" --size_valid_set 10000 --streaming --seq_length 2048 --max_steps 2 --batch_size 1 --input_column_name="question" --output_column_name="response" --gradient_accumulation_steps 16 --learning_rate 1e-4 --lr_scheduler_type="cosine" --num_warmup_steps 100 --weight_decay 0.05 --output_dir="./checkpoints"
error image(mulit GPU):
The text was updated successfully, but these errors were encountered: