Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grad clipping with native amp occurs while grads are scaled #9694

Closed
ericharper opened this issue Sep 24, 2021 · 1 comment
Closed

Grad clipping with native amp occurs while grads are scaled #9694

ericharper opened this issue Sep 24, 2021 · 1 comment
Labels
bug Something isn't working help wanted Open to be worked on won't fix This will not be worked on

Comments

@ericharper
Copy link
Contributor

🐛 Bug

When using native amp grads are clipped while they are scaled.

Currently grads are clipped here: https://github.com/PyTorchLightning/pytorch-lightning/blob/37469cd3e85bd203cf32d8f888edbc713a7bce09/pytorch_lightning/loops/optimization/optimizer_loop.py#L244

The grads are unscaled here: https://github.com/PyTorchLightning/pytorch-lightning/blob/25bfd06f33671ff6538a0aa5086534a875a4d478/pytorch_lightning/plugins/precision/native_amp.py#L58

There is a PTL hook: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#on-before-optimizer-step which links to the PyTorch documentation explaining why the gradients must be unscaled before clipping when using amp.

We were able to solve this problem by clipping grads in the on_before_optimizer_step hook and overriding the NativeMixedPrecisionPlugin so that clip_gradients does nothing.

Expected behavior

Native amp loss should be similar fp32 loss. Instead we found when using native amp and grad clipping our loss was much worse. This makes sense as grads would be clipped to much if using scaled grads for clipping.

Image below shows GPT training with NeMo and PTL and with Megatron-lm (no PTL). All configurations have roughly the same loss curve except the combination of NeMo and PTL with Native Amp and grad clipping.

image

Environment

  • PyTorch Lightning Version (e.g., 1.3.0): 1.4.8
  • PyTorch Version (e.g., 1.8): 1.9
  • Python version: 3.8
  • OS (e.g., Linux):Linux
  • CUDA/cuDNN version:
  • GPU models and configuration: A100
  • How you installed PyTorch (conda, pip, source):pip
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

@stale
Copy link

stale bot commented Oct 25, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants