-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Splitting of from #1593
This Issue documents multiple issues wrt OVERFLOW event
- ZeRO2 OVERFLOW goes from
initial_scale_powerset in config to the one where it gets resolved or can't be resolved in one step (compared to zero3 where it reduces the power one step at a time and reports each decrement) As you can see in port OVERFLOW log to ZeRO-2 #1593 (comment) I get the report of power 1 right away wheninitial_scale_power=16
OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
-
ZeRO3 OVERFLOW has a one-off issue:
Attempted loss scale: 65536, reducing to 65536- i.e it's not scaling the first time around. -
Both ZeRO2 and ZeRO3 should probably assert if:
a. loss isnan- related: should dynamic scaling and overflow check happen only at the beginning? #931
b.Attempted loss scale: 1, reducing to 1happens - clearly this is broken.since it's impossible to recover from either. But Deepspeed optimizer skips the step and gets stuck in the above scaling
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working