port OVERFLOW log to ZeRO-2 by stas00 · Pull Request #1593 · deepspeedai/DeepSpeed

stas00 · 2021-11-27T06:14:46Z

This PR just ports to ZeRO2:

[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 65536

to be in sync with ZeRO3.

The only issue here is that the way zero2 is written is that it doesn't skip one step at a time - it continues to scale down until it runs out of loss scale in the same step. Perhaps that's why it wasn't logging that info in the first place.

The problem is that after I added this change, the user issue I'm trying to debug I'm trying to debug this Issue: huggingface/transformers#14531 gets:

OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1

which means it tried all 2**16 -> 2**0 and failed. But my first suspicion was that the user had "initial_scale_power": 0, set in the config file. Which wasn't the case - it was 16.

Thus I'm not sure how to best flag to the user that the loss scaling started from 2**initial_scale_power - i.e. this logging I added has an issue.

actually, I tried to find in the code where it goes through 216 -> 20 in a single step and I can't find it, it appears to be doing it one step at a time. Then I can't figure out how the first OVERFLOW report is with Attempted loss scale: 1, reducing to 1

p.s. zero3 does report a single step down in scale - so takes 17 steps to get to Attempted loss scale: 1

The rest probably belongs to a new Issue, but it's related to this PR.

I'm trying to debug this Issue: huggingface/transformers#14531
where the training works fine with t5-small or t5-base, but switching to t5-large or higher leads to an OVERFLOW on the first step from which it never recovers. And there is no diagnostic whatsoever.

So I think besides this PR additional logging is needed to tell the user that the training is not happening since:

OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1

is an impasse from which deepspeed can't recover. i.e. Perhaps it should assert if that's the case.

but that's an additional feature.

and while at it 2 more related issues:

note that there is a one-off problem Attempted loss scale: 65536, reducing to 65536 it's not scaling the first time around.
here is another - can't recover from loss="nan" where Deepspeed should probably assert. should dynamic scaling and overflow check happen only at the beginning? #931

Apologies for such a huge info dump. It's just all related so I wasn't sure how to best communicate this clearly.

@tjruwase

port OVERFLOW log to ZeRO-2

fb65c0b

stas00 requested review from RezaYazdaniAminabadi, ShadenSmith, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, samyam and tjruwase as code owners November 27, 2021 06:14

stas00 mentioned this pull request Nov 27, 2021

Deepspeed and T5-11B for multitask training huggingface/transformers#14531

Closed

tjruwase approved these changes Nov 27, 2021

View reviewed changes

tjruwase merged commit 7a132a9 into deepspeedai:master Nov 27, 2021

stas00 deleted the z2-log-overflow branch November 28, 2021 00:54

stas00 mentioned this pull request Nov 29, 2021

[loss OVERFLOW] Several Issues #1599

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

port OVERFLOW log to ZeRO-2#1593

port OVERFLOW log to ZeRO-2#1593
tjruwase merged 1 commit intodeepspeedai:masterfrom
stas00:z2-log-overflow

stas00 commented Nov 27, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stas00 commented Nov 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stas00 commented Nov 27, 2021 •

edited

Loading