You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We were able to solve this problem by clipping grads in the on_before_optimizer_step hook and overriding the NativeMixedPrecisionPlugin so that clip_gradients does nothing.
Expected behavior
Native amp loss should be similar fp32 loss. Instead we found when using native amp and grad clipping our loss was much worse. This makes sense as grads would be clipped to much if using scaled grads for clipping.
Image below shows GPT training with NeMo and PTL and with Megatron-lm (no PTL). All configurations have roughly the same loss curve except the combination of NeMo and PTL with Native Amp and grad clipping.
Environment
PyTorch Lightning Version (e.g., 1.3.0): 1.4.8
PyTorch Version (e.g., 1.8): 1.9
Python version: 3.8
OS (e.g., Linux):Linux
CUDA/cuDNN version:
GPU models and configuration: A100
How you installed PyTorch (conda, pip, source):pip
If compiling from source, the output of torch.__config__.show():
Any other relevant information:
Additional context
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
🐛 Bug
When using native amp grads are clipped while they are scaled.
Currently grads are clipped here: https://github.com/PyTorchLightning/pytorch-lightning/blob/37469cd3e85bd203cf32d8f888edbc713a7bce09/pytorch_lightning/loops/optimization/optimizer_loop.py#L244
The grads are unscaled here: https://github.com/PyTorchLightning/pytorch-lightning/blob/25bfd06f33671ff6538a0aa5086534a875a4d478/pytorch_lightning/plugins/precision/native_amp.py#L58
There is a PTL hook: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#on-before-optimizer-step which links to the PyTorch documentation explaining why the gradients must be unscaled before clipping when using amp.
We were able to solve this problem by clipping grads in the
on_before_optimizer_step
hook and overriding theNativeMixedPrecisionPlugin
so thatclip_gradients
does nothing.Expected behavior
Native amp loss should be similar fp32 loss. Instead we found when using native amp and grad clipping our loss was much worse. This makes sense as grads would be clipped to much if using scaled grads for clipping.
Image below shows GPT training with NeMo and PTL and with Megatron-lm (no PTL). All configurations have roughly the same loss curve except the combination of NeMo and PTL with Native Amp and grad clipping.
Environment
conda
,pip
, source):piptorch.__config__.show()
:Additional context
The text was updated successfully, but these errors were encountered: