-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Description
So fp16.initial_scale_power leads to dynamic scaling, except it probably should happen only until it found the right range and never check/go back to scaling again once the right scale has been found.
Observe this:
2021-04-06 21:22:36,418] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
9% | 16/174 [01:20<13:49,
{'loss': 3.2588, 'learning_rate': 0, 'epoch': 0.09}
9%| | 16/174 [01:20<13:49, 5.25s/it]
[2021-04-06 21:22:40,973] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
10%| | 17/174 [01:25<13:11,
{'loss': 2.5342, 'learning_rate': 0, 'epoch': 0.1}
{'loss': 3.0586, 'learning_rate': 0.0, 'epoch': 0.1}
{'loss': 2.8711, 'learning_rate': 1.354634980487915e-06, 'epoch': 0.11}
{'loss': 2.875, 'learning_rate': 2.1470456462384806e-06, 'epoch': 0.11}
{'loss': 3.1064, 'learning_rate': 2.70926996097583e-06, 'epoch': 0.12}
# XXX: it resumed trying to scale here 2nd time:
12%| | 21/174 [01:48<14:18, 5.61s/it]
[2021-04-06 21:23:09,319] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
{'loss': nan, 'learning_rate': 2.70926996097583e-06, 'epoch': 0.13}
13%| | 22/174 [01:53<13:28, 5.32s/it]
[2021-04-06 21:23:13,653] [INFO] [stage3.py:2326:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768.0, reducing to 16384.0
{'loss': nan, 'learning_rate': 2.70926996097583e-06, 'epoch': 0.13}
13%|
So the optimizer kicked in on step 17 as there was no more overflow, and then a few steps later the model overflowed because of a totally different reason (blfoat16-pretrained), but the overlow_clean_up kicks back on and tries to scale futher which is pointless since the model is done with - it never recovers.
I mean this doesn't make things worse, it's just confusing to the user that deepspeed is trying to recover from something it can't recover - and it's not deepspeed's fault either.
So my thinking that perhaps once a good scaling factor is reached the check can be stopped?
I hope I was able to convey the issue clearly.