[NaN check] Add NaN check to support bfloat16.#5879
[NaN check] Add NaN check to support bfloat16.#5879ys950902 wants to merge 4 commits intodeepspeedai:masterfrom
Conversation
Sorry for later response, I am doing some other work these days, my understanding is nan check is not for float16 but also bfloat16, and won't add some extra log info on deepspeed side, the uses can only use API was_step_applied() to check whether update successfully. |
|
@ys950902 This check only makes the training continues without error but the grad_norm and loss are not descending any more. Would it be better to turn this into an error or provide a way to solve the overflow problem by code/by warning? |
|
Closing in place of #6976 |
No description provided.