Skip to content

performance loss from 1.0.8 to 1.1.* when using 16 bit precision #5159

Closed
@immanuelweber

Description

@immanuelweber

🐛 Bug

After updating pytorch-lightning from 1.0.8 to 1.1.0/1.1.1 the use of 16 bit precision destroys the performances.
In my actual code of object detection losses are by a factor of 4 larger at the beginning than compared to 32 bit or 16 bit with pl 1.08.
They converge to a much higher value and the resulting model lost its detection capabilities completely.
To replicate I tested the pl notebooks and the 06-cifar10-baseline.ipynb also shows this and the classification accuracy corresponds to guessing the class when switching from 32 to 16 bit.
I integrated it into the BoringModel notebook and the problem is also happening in google colab.

Please reproduce using the BoringModel and post here

https://colab.research.google.com/drive/1FqXG9Xw9gVZxnwiGnjsHpAtb-vUqFaob?usp=sharing

To Reproduce

Expected behavior

Same performance for 32 and 16 bit.

Environment

  • CUDA:
    • GPU:
      • Tesla P100-PCIE-16GB
    • available: True
    • version: 10.1
  • Packages:
    • numpy: 1.18.5
    • pyTorch_debug: True
    • pyTorch_version: 1.7.0+cu101
    • pytorch-lightning: 1.1.1
    • tqdm: 4.41.1
  • System:
    • OS: Linux
    • architecture:
      • 64bit
    • processor: x86_64
    • python: 3.6.9
    • version: Proposal for help #1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

Metadata

Metadata

Labels

bugSomething isn't workinghelp wantedOpen to be worked onpriority: 0High priority task

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions