You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, the T5 model was pretrained on bf16 so it may not work well for fp16.
Could you try removing the "precision" line and try training again? This should lead to fp32 training by default.
Could you also explain more about the non-convergence when using precision="32-true"?
During normal training, the loss may have some variation, but should still be in the 0-4 range
Thank you, Chia Yew!
I will try to remove the "precision" line to see what will happen.
Regarding non-convergence, I notice the loss is between 0-4 and could be varied case by case. However, I could not see convergence after 3 epochs. Not sure whether it is related to the precision issue above.
I am trying to finetune the model using V100 and a lower version of torch (1.13.0).
All the other settings are the same as the readme file. Could anyone give me some suggestions on this? Thanks very much!
The text was updated successfully, but these errors were encountered: