Skip to content

[loss OVERFLOW] Several Issues #1599

@stas00

Description

@stas00

Describe the bug

Splitting of from #1593

This Issue documents multiple issues wrt OVERFLOW event

  1. ZeRO2 OVERFLOW goes from initial_scale_power set in config to the one where it gets resolved or can't be resolved in one step (compared to zero3 where it reduces the power one step at a time and reports each decrement) As you can see in port OVERFLOW log to ZeRO-2 #1593 (comment) I get the report of power 1 right away when initial_scale_power=16
OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
  1. ZeRO3 OVERFLOW has a one-off issue: Attempted loss scale: 65536, reducing to 65536 - i.e it's not scaling the first time around.

  2. Both ZeRO2 and ZeRO3 should probably assert if:
    a. loss is nan - related: should dynamic scaling and overflow check happen only at the beginning? #931
    b. Attempted loss scale: 1, reducing to 1 happens - clearly this is broken.

    since it's impossible to recover from either. But Deepspeed optimizer skips the step and gets stuck in the above scaling

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions