Skip to content

Native mixed precision OOM Val Step #6566

@mibaumgartner

Description

@mibaumgartner

🐛 Bug

NativeMixedPrecisionPlugin only implements train_step_context which can lead to Out of Memory Errors during validation/testing .

Please reproduce using the BoringModel

To Reproduce

Use following BoringModel and post here

Performed a second check with the Apex Backend (Level O1) and does not run out of memory.

Expected behavior

Should not run OOM during validation/testing..

Environment

Note: Bugs with code are solved faster ! Colab Notebook should be made public !

You can get the script and run it with:

wget https://raw.githubusercontent.com/PyTorchLightning/pytorch-lightning/master/tests/collect_env_details.py
# For security purposes, please check the contents of collect_env_details.py before running it.
python collect_env_details.py
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

@justusschock

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinghelp wantedOpen to be worked on

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions