Skip to content

Comments

[NaN check] Add NaN check to support bfloat16.#5879

Closed
ys950902 wants to merge 4 commits intodeepspeedai:masterfrom
ys950902:nan_check
Closed

[NaN check] Add NaN check to support bfloat16.#5879
ys950902 wants to merge 4 commits intodeepspeedai:masterfrom
ys950902:nan_check

Conversation

@ys950902
Copy link
Contributor

@ys950902 ys950902 commented Aug 8, 2024

No description provided.

@tjruwase
Copy link
Contributor

tjruwase commented Aug 9, 2024

@ys950902, thanks for helping with this. The problem is a bit more involved and there was a previous attempt that was abandoned. Can you please take a look at #5252

Do you think you can incorporate the learnings into your PR?

@ys950902
Copy link
Contributor Author

@ys950902, thanks for helping with this. The problem is a bit more involved and there was a previous attempt that was abandoned. Can you please take a look at #5252

Do you think you can incorporate the learnings into your PR?

Sorry for later response, I am doing some other work these days, my understanding is nan check is not for float16 but also bfloat16, and won't add some extra log info on deepspeed side, the uses can only use API was_step_applied() to check whether update successfully.

@QingtaoLi1
Copy link

@ys950902 This check only makes the training continues without error but the grad_norm and loss are not descending any more. Would it be better to turn this into an error or provide a way to solve the overflow problem by code/by warning?

@tjruwase
Copy link
Contributor

Closing in place of #6976

@tjruwase tjruwase closed this Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants