Description
🐛 Bug
After updating pytorch-lightning from 1.0.8 to 1.1.0/1.1.1 the use of 16 bit precision destroys the performances.
In my actual code of object detection losses are by a factor of 4 larger at the beginning than compared to 32 bit or 16 bit with pl 1.08.
They converge to a much higher value and the resulting model lost its detection capabilities completely.
To replicate I tested the pl notebooks and the 06-cifar10-baseline.ipynb
also shows this and the classification accuracy corresponds to guessing the class when switching from 32 to 16 bit.
I integrated it into the BoringModel notebook and the problem is also happening in google colab.
Please reproduce using the BoringModel and post here
https://colab.research.google.com/drive/1FqXG9Xw9gVZxnwiGnjsHpAtb-vUqFaob?usp=sharing
To Reproduce
Expected behavior
Same performance for 32 and 16 bit.
Environment
- CUDA:
- GPU:
- Tesla P100-PCIE-16GB
- available: True
- version: 10.1
- GPU:
- Packages:
- numpy: 1.18.5
- pyTorch_debug: True
- pyTorch_version: 1.7.0+cu101
- pytorch-lightning: 1.1.1
- tqdm: 4.41.1
- System:
- OS: Linux
- architecture:
- 64bit
- processor: x86_64
- python: 3.6.9
- version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020