-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loss does not decrease #120
Comments
Are these pretrain-stage loss? For the Phi-2 LLM, final loss in the pretrain-stage could reach to about 2.5. From the screenshot you gave above, grad-norm is 0, which indicates the network is not learning and gradient is 0 and parameters are not updated any more. Did you change any hyper-params in the pretrain.sh/finetune.sh? |
This is the loss in the fine-tuning phase. I only changed the batch_size in the hyperparameters. I used four 3090 GPUs. I don't know where the problem lies. deepspeed --include localhost:0,1,2,3 --master_port 29501 tinyllava/train/train.py |
Hi. After pretraining, the initial loss in finetune stage should starts from about 2.5. It seems the problem came from the pretraining stage. Please provide your params in pretrain.sh. Please also check if the final loss in your pretrain stage decreased to about 2.5. |
Thank you for your reply. The pre-training parameter settings are as follows. The pre-training loss is also around 5, which has not decreased. deepspeed --include localhost:0,1,2,3 --master_port 29502 tinyllava/train/train.py |
Hi, your learning rate in the pretrain stage is too large...please set learning_rate to 1e-3. And are you sure per_device_train_batch_size can be set to 32? I run your scripts also with a machine of 4 3090GPUs. I need to decrease per_device_train_batch_size to 16 and increase gradient_accumulation_steps to 4, so that I can avoid OOM. |
Thank you very much for your reply, per_device_train_batch_size be set 4 when running with a machine of 4 3090GPUs,Otherwise it will oom. I'll try again with a different learning rate. thankyou! |
Hello, thank you very much for sharing your work. In the TinyLLaVA_Factory-main file, I executed bash ./scripts/train/train_phi.sh and found a problem. Its loss has been around 5 and has not decreased. The final fine-tuned model effect is not good. The result of testvqa verification is 7.85. It's very strange. I don't know what went wrong? The executed script and results are shown in the figure below. Looking forward to your reply.
The text was updated successfully, but these errors were encountered: