Loss value is NaN #19

liuqingli · 2023-04-19T18:03:04Z

I am trying to finetune the model using V100 and a lower version of torch (1.13.0).

After removing "--use_compile" and updating "precision" to "16-mixed", I got a [nan.0] value at each step during training.
I tried to set "precision" to "32-true", then I can get the loss values. However, I cannot see the convergence after the first epoch.

All the other settings are the same as the readme file. Could anyone give me some suggestions on this? Thanks very much!

chiayewken · 2023-04-20T15:18:41Z

Hi, the T5 model was pretrained on bf16 so it may not work well for fp16.
Could you try removing the "precision" line and try training again? This should lead to fp32 training by default.
Could you also explain more about the non-convergence when using precision="32-true"?
During normal training, the loss may have some variation, but should still be in the 0-4 range

liuqingli · 2023-04-25T00:51:34Z

Thank you, Chia Yew!
I will try to remove the "precision" line to see what will happen.
Regarding non-convergence, I notice the loss is between 0-4 and could be varied case by case. However, I could not see convergence after 3 epochs. Not sure whether it is related to the precision issue above.

chiayewken · 2023-04-28T10:03:27Z

It is a bit hard to judge from the loss, you can try inference with the model checkpoint to make sure that the training is successful: https://github.com/declare-lab/flan-alpaca#inference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss value is NaN #19

Loss value is NaN #19

liuqingli commented Apr 19, 2023 •

edited

Loading

chiayewken commented Apr 20, 2023

liuqingli commented Apr 25, 2023

chiayewken commented Apr 28, 2023

Loss value is NaN #19

Loss value is NaN #19

Comments

liuqingli commented Apr 19, 2023 • edited Loading

chiayewken commented Apr 20, 2023

liuqingli commented Apr 25, 2023

chiayewken commented Apr 28, 2023

liuqingli commented Apr 19, 2023 •

edited

Loading