should dynamic scaling and overflow check happen only at the beginning?

So `fp16.initial_scale_power` leads to dynamic scaling, except it probably should happen only until it found the right range and never check/go back to scaling again once the right scale has been found.

Observe this:
```
2021-04-06 21:22:36,418] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
  9% | 16/174 [01:20<13:49,  
{'loss': 3.2588, 'learning_rate': 0, 'epoch': 0.09}                                                                                                               
  9%| | 16/174 [01:20<13:49,  5.25s/it]

  [2021-04-06 21:22:40,973] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
 10%| | 17/174 [01:25<13:11,  
{'loss': 2.5342, 'learning_rate': 0, 'epoch': 0.1}                                                                                                                
{'loss': 3.0586, 'learning_rate': 0.0, 'epoch': 0.1}                                                                                                              
{'loss': 2.8711, 'learning_rate': 1.354634980487915e-06, 'epoch': 0.11}                                                                                           
{'loss': 2.875, 'learning_rate': 2.1470456462384806e-06, 'epoch': 0.11}                                                                                           
{'loss': 3.1064, 'learning_rate': 2.70926996097583e-06, 'epoch': 0.12}     


# XXX: it resumed trying to scale here 2nd time:                                                                                       
 12%| | 21/174 [01:48<14:18,  5.61s/it]
 [2021-04-06 21:23:09,319] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
{'loss': nan, 'learning_rate': 2.70926996097583e-06, 'epoch': 0.13}                                                                                               
 13%| | 22/174 [01:53<13:28,  5.32s/it]
 [2021-04-06 21:23:13,653] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
{'loss': nan, 'learning_rate': 2.70926996097583e-06, 'epoch': 0.13}                                                                                               
 13%|
```

So the optimizer kicked in on step 17 as there was no more overflow, and then a few steps later the model overflowed because of a totally different reason (blfoat16-pretrained), but the overlow_clean_up kicks back on and tries to scale futher which is pointless since the model is done with - it never recovers.

I mean this doesn't make things worse, it's just confusing to the user that deepspeed is trying to recover from something it can't recover - and it's not deepspeed's fault either.

So my thinking that perhaps once a good scaling factor is reached the check can be stopped?

I hope I was able to convey the issue clearly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

should dynamic scaling and overflow check happen only at the beginning? #931

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

should dynamic scaling and overflow check happen only at the beginning? #931

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions