You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Python version: 3.10.9
Torch version: 1.12.1
TorchMetrics version: 0.11.1
GPU device name: Tesla T4
CUDA Version: 11.4
Additional context
When running in a training loop I found that some fraction (~30%) of steps would not produce a nan number when using torch.float16 or bfloat16, while the other ~70% would.
This seems to occur because the values in pred above are not very different (changing one value of pred above to be more different than the rest will compute a correct PCC), however I think that this should still be able to be computed with half precision and the standard deviation of pred shown above.
The text was updated successfully, but these errors were encountered:
Hi! thanks for your contribution!, great first issue!
josiahbjorgaard
changed the title
PearsonCorrCoeff returns nan when on GPU and input is of type torch.float16 or torch.bfloat16.
PearsonCorrCoeff returns nan when input is of type torch.float16 or torch.bfloat16.
Apr 21, 2023
Hi @josiahbjorgaard,
I created PR #1729 that will try to prevent ending up with nan in the PearsonCorrCoeff calculation, by improving the robustness when you evaluate on a single batch as in your example. That is the good news.
However, that does not mean that the metric is correct in half precision for your example. When running your example with the new implementation I get:
which is a large difference. However, because the input is both very few samples and they are very similar the metric will not work for low precision in this case. If we just look at the mean and variance of the pred tensor:
as you can see even the mean and variance is not equal for half and float precision. In PearsonCorrCoeff we end up multiplying a bunch of these quantities together which in the end lead to the accumulation of such errors for the final metric.
Since PearsonCorrCoeff is a global metric I highly recommend to only calculate it at the end of epoch (or at least for a large number of samples) and in float precision.
🐛 Bug
PearsonCorrCoeff returns nan when input is of type torch.float16 or torch.bfloat16 and all values are close.
To Reproduce
Expected behavior
Environment
Python version: 3.10.9
Torch version: 1.12.1
TorchMetrics version: 0.11.1
GPU device name: Tesla T4
CUDA Version: 11.4
Additional context
When running in a training loop I found that some fraction (~30%) of steps would not produce a nan number when using torch.float16 or bfloat16, while the other ~70% would.
This seems to occur because the values in pred above are not very different (changing one value of pred above to be more different than the rest will compute a correct PCC), however I think that this should still be able to be computed with half precision and the standard deviation of pred shown above.
The text was updated successfully, but these errors were encountered: