Skip to content

should dynamic scaling and overflow check happen only at the beginning? #931

@stas00

Description

@stas00

So fp16.initial_scale_power leads to dynamic scaling, except it probably should happen only until it found the right range and never check/go back to scaling again once the right scale has been found.

Observe this:

2021-04-06 21:22:36,418] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
  9% | 16/174 [01:20<13:49,  
{'loss': 3.2588, 'learning_rate': 0, 'epoch': 0.09}                                                                                                               
  9%| | 16/174 [01:20<13:49,  5.25s/it]

  [2021-04-06 21:22:40,973] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
 10%| | 17/174 [01:25<13:11,  
{'loss': 2.5342, 'learning_rate': 0, 'epoch': 0.1}                                                                                                                
{'loss': 3.0586, 'learning_rate': 0.0, 'epoch': 0.1}                                                                                                              
{'loss': 2.8711, 'learning_rate': 1.354634980487915e-06, 'epoch': 0.11}                                                                                           
{'loss': 2.875, 'learning_rate': 2.1470456462384806e-06, 'epoch': 0.11}                                                                                           
{'loss': 3.1064, 'learning_rate': 2.70926996097583e-06, 'epoch': 0.12}     


# XXX: it resumed trying to scale here 2nd time:                                                                                       
 12%| | 21/174 [01:48<14:18,  5.61s/it]
 [2021-04-06 21:23:09,319] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
{'loss': nan, 'learning_rate': 2.70926996097583e-06, 'epoch': 0.13}                                                                                               
 13%| | 22/174 [01:53<13:28,  5.32s/it]
 [2021-04-06 21:23:13,653] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
{'loss': nan, 'learning_rate': 2.70926996097583e-06, 'epoch': 0.13}                                                                                               
 13%|

So the optimizer kicked in on step 17 as there was no more overflow, and then a few steps later the model overflowed because of a totally different reason (blfoat16-pretrained), but the overlow_clean_up kicks back on and tries to scale futher which is pointless since the model is done with - it never recovers.

I mean this doesn't make things worse, it's just confusing to the user that deepspeed is trying to recover from something it can't recover - and it's not deepspeed's fault either.

So my thinking that perhaps once a good scaling factor is reached the check can be stopped?

I hope I was able to convey the issue clearly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions