Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss value is NaN #19

Open
liuqingli opened this issue Apr 19, 2023 · 3 comments
Open

Loss value is NaN #19

liuqingli opened this issue Apr 19, 2023 · 3 comments

Comments

@liuqingli
Copy link

liuqingli commented Apr 19, 2023

I am trying to finetune the model using V100 and a lower version of torch (1.13.0).

  1. After removing "--use_compile" and updating "precision" to "16-mixed", I got a [nan.0] value at each step during training.
  2. I tried to set "precision" to "32-true", then I can get the loss values. However, I cannot see the convergence after the first epoch.

All the other settings are the same as the readme file. Could anyone give me some suggestions on this? Thanks very much!

@chiayewken
Copy link
Collaborator

Hi, the T5 model was pretrained on bf16 so it may not work well for fp16.
Could you try removing the "precision" line and try training again? This should lead to fp32 training by default.
Could you also explain more about the non-convergence when using precision="32-true"?
During normal training, the loss may have some variation, but should still be in the 0-4 range

@liuqingli
Copy link
Author

Thank you, Chia Yew!
I will try to remove the "precision" line to see what will happen.
Regarding non-convergence, I notice the loss is between 0-4 and could be varied case by case. However, I could not see convergence after 3 epochs. Not sure whether it is related to the precision issue above.

@chiayewken
Copy link
Collaborator

It is a bit hard to judge from the loss, you can try inference with the model checkpoint to make sure that the training is successful: https://github.com/declare-lab/flan-alpaca#inference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants