Skip to content

port OVERFLOW log to ZeRO-2#1593

Merged
tjruwase merged 1 commit intodeepspeedai:masterfrom
stas00:z2-log-overflow
Nov 27, 2021
Merged

port OVERFLOW log to ZeRO-2#1593
tjruwase merged 1 commit intodeepspeedai:masterfrom
stas00:z2-log-overflow

Conversation

@stas00
Copy link
Collaborator

@stas00 stas00 commented Nov 27, 2021

This PR just ports to ZeRO2:

[deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 65536

to be in sync with ZeRO3.

The only issue here is that the way zero2 is written is that it doesn't skip one step at a time - it continues to scale down until it runs out of loss scale in the same step. Perhaps that's why it wasn't logging that info in the first place.

The problem is that after I added this change, the user issue I'm trying to debug I'm trying to debug this Issue: huggingface/transformers#14531 gets:

OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1

which means it tried all 2**16 -> 2**0 and failed. But my first suspicion was that the user had "initial_scale_power": 0, set in the config file. Which wasn't the case - it was 16.

Thus I'm not sure how to best flag to the user that the loss scaling started from 2**initial_scale_power - i.e. this logging I added has an issue.

actually, I tried to find in the code where it goes through 216 -> 20 in a single step and I can't find it, it appears to be doing it one step at a time. Then I can't figure out how the first OVERFLOW report is with Attempted loss scale: 1, reducing to 1

p.s. zero3 does report a single step down in scale - so takes 17 steps to get to Attempted loss scale: 1


The rest probably belongs to a new Issue, but it's related to this PR.

I'm trying to debug this Issue: huggingface/transformers#14531
where the training works fine with t5-small or t5-base, but switching to t5-large or higher leads to an OVERFLOW on the first step from which it never recovers. And there is no diagnostic whatsoever.

So I think besides this PR additional logging is needed to tell the user that the training is not happening since:

OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1

is an impasse from which deepspeed can't recover. i.e. Perhaps it should assert if that's the case.

but that's an additional feature.


and while at it 2 more related issues:


Apologies for such a huge info dump. It's just all related so I wasn't sure how to best communicate this clearly.

@tjruwase

@tjruwase tjruwase merged commit 7a132a9 into deepspeedai:master Nov 27, 2021
@stas00 stas00 deleted the z2-log-overflow branch November 28, 2021 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants