Grad clipping with native amp occurs while grads are scaled #9694

ericharper · 2021-09-24T22:16:59Z

🐛 Bug

When using native amp grads are clipped while they are scaled.

Currently grads are clipped here: https://github.com/PyTorchLightning/pytorch-lightning/blob/37469cd3e85bd203cf32d8f888edbc713a7bce09/pytorch_lightning/loops/optimization/optimizer_loop.py#L244

The grads are unscaled here: https://github.com/PyTorchLightning/pytorch-lightning/blob/25bfd06f33671ff6538a0aa5086534a875a4d478/pytorch_lightning/plugins/precision/native_amp.py#L58

There is a PTL hook: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#on-before-optimizer-step which links to the PyTorch documentation explaining why the gradients must be unscaled before clipping when using amp.

We were able to solve this problem by clipping grads in the on_before_optimizer_step hook and overriding the NativeMixedPrecisionPlugin so that clip_gradients does nothing.

Expected behavior

Native amp loss should be similar fp32 loss. Instead we found when using native amp and grad clipping our loss was much worse. This makes sense as grads would be clipped to much if using scaled grads for clipping.

Image below shows GPT training with NeMo and PTL and with Megatron-lm (no PTL). All configurations have roughly the same loss curve except the combination of NeMo and PTL with Native Amp and grad clipping.

Environment

PyTorch Lightning Version (e.g., 1.3.0): 1.4.8
PyTorch Version (e.g., 1.8): 1.9
Python version: 3.8
OS (e.g., Linux):Linux
CUDA/cuDNN version:
GPU models and configuration: A100
How you installed PyTorch (conda, pip, source):pip
If compiling from source, the output of torch.__config__.show():
Any other relevant information:

Additional context

The text was updated successfully, but these errors were encountered:

stale · 2021-10-25T06:23:55Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

ericharper added bug Something isn't working help wanted Open to be worked on labels Sep 24, 2021

carmocca mentioned this issue Sep 25, 2021

[Bugfix] Fix location of unscale in mixed precision plugin #9606

Merged

12 tasks

cowwoc mentioned this issue Oct 2, 2021

AMP scaler always causes backwards pass to overflow #9799

Closed

stale bot added the won't fix This will not be worked on label Oct 25, 2021

carmocca mentioned this issue Oct 25, 2021

Fix gradient norm tracking and gradient clipping #9287

Merged

11 tasks

carmocca closed this as completed Oct 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grad clipping with native amp occurs while grads are scaled #9694

Grad clipping with native amp occurs while grads are scaled #9694

ericharper commented Sep 24, 2021

stale bot commented Oct 25, 2021

Grad clipping with native amp occurs while grads are scaled #9694

Grad clipping with native amp occurs while grads are scaled #9694

Comments

ericharper commented Sep 24, 2021

🐛 Bug

Expected behavior

Environment

Additional context

stale bot commented Oct 25, 2021