Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss not on correct device when training with move_metrics_to_cpu=True #9296

Closed
mibaumgartner opened this issue Sep 2, 2021 · 3 comments · Fixed by #9308
Closed

Loss not on correct device when training with move_metrics_to_cpu=True #9296

mibaumgartner opened this issue Sep 2, 2021 · 3 comments · Fixed by #9308
Assignees
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task

Comments

@mibaumgartner
Copy link
Contributor

🐛 Bug

move_metrics_to_cpu=True seems to move loss to CPU which results in an error when training with native mixed precision.
This is related to the original issue reported in MIC-DKFZ/nnDetection#25 and did not occur with other lightning versions (older version and/or move_metrics_to_cpu=False work fine).

Error when training with mixed precision and move_metrics_to_cpu=True:

/usr/local/lib/python3.7/dist-packages/torch/cuda/amp/grad_scaler.py in scale(self, outputs)
    159         # Short-circuit for the common case.
    160         if isinstance(outputs, torch.Tensor):
--> 161             assert outputs.is_cuda or outputs.device.type == 'xla'
    162             if self._scale is None:
    163                 self._lazy_init_scale_growth_tracker(outputs.device)

To Reproduce

Can be reproduced with the boring model in colab by passing the following flags to the trainer:

        precision=16,  # native mixed precision
        move_metrics_to_cpu=True,
        gpus=[0],  # use GPU

Expected behavior

No error :)

Environment

  • PyTorch Lightning Version (e.g., 1.3.0):
  • PyTorch Version (e.g., 1.8)
  • Python version:
  • OS (e.g., Linux):
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • How you installed PyTorch (conda, pip, source):
  • If compiling from source, the output of torch.__config__.show():
  • Any other relevant information:

Additional context

@mibaumgartner mibaumgartner added bug Something isn't working help wanted Open to be worked on labels Sep 2, 2021
@tchaton tchaton added the priority: 1 Medium priority task label Sep 3, 2021
@tchaton
Copy link
Contributor

tchaton commented Sep 3, 2021

Hey @mibaumgartner,

Would it be possible for you to share a reproducible script ?

Best,
T.C

@mibaumgartner
Copy link
Contributor Author

Hi @tchaton,

here is the Boring Model Colab Notebook (latest lightning pypi release 1.4.5):
https://colab.research.google.com/drive/1pvHtF7Zor2OOLjE6NFjxxHTAassqdidb?usp=sharing

Best,
Michael

@tchaton tchaton self-assigned this Sep 3, 2021
@tchaton
Copy link
Contributor

tchaton commented Sep 3, 2021

Thanks @mibaumgartner,

I can reproduce the bug locally.

Best,
T.C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on priority: 1 Medium priority task
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants