Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PearsonCorrCoeff returns nan when input is of type torch.float16 or torch.bfloat16. #1725

Closed
josiahbjorgaard opened this issue Apr 21, 2023 · 2 comments · Fixed by #1729
Closed
Labels
bug / fix Something isn't working help wanted Extra attention is needed
Milestone

Comments

@josiahbjorgaard
Copy link

josiahbjorgaard commented Apr 21, 2023

🐛 Bug

PearsonCorrCoeff returns nan when input is of type torch.float16 or torch.bfloat16 and all values are close.

To Reproduce

import torch
import torchmetrics as tm
pcc = tm.regression.PearsonCorrCoef().to("cuda")
pred = torch.tensor([0.4746, 0.4805, 0.4766, 0.4805, 0.4766, 0.4805, 0.4785, 0.4824, 0.4805],dtype=torch.float16).to("cuda")
target = torch.tensor([0.0336, 0.3676, 0.6302, 0.7192, 0.2295, 0.2886, 0.6302, 0.7096, 0.0208],dtype=torch.float16).to("cuda")
print(pcc(pred,target))
print(pcc(pred.to(torch.float32),target.to(torch.float32)))
tensor(nan, device='cuda:0')
tensor(0.3720, device='cuda:0')

Expected behavior

Environment

Python version: 3.10.9
Torch version: 1.12.1
TorchMetrics version: 0.11.1
GPU device name: Tesla T4
CUDA Version: 11.4

Additional context

When running in a training loop I found that some fraction (~30%) of steps would not produce a nan number when using torch.float16 or bfloat16, while the other ~70% would.
This seems to occur because the values in pred above are not very different (changing one value of pred above to be more different than the rest will compute a correct PCC), however I think that this should still be able to be computed with half precision and the standard deviation of pred shown above.

@josiahbjorgaard josiahbjorgaard added bug / fix Something isn't working help wanted Extra attention is needed labels Apr 21, 2023
@github-actions
Copy link

Hi! thanks for your contribution!, great first issue!

@josiahbjorgaard josiahbjorgaard changed the title PearsonCorrCoeff returns nan when on GPU and input is of type torch.float16 or torch.bfloat16. PearsonCorrCoeff returns nan when input is of type torch.float16 or torch.bfloat16. Apr 21, 2023
@SkafteNicki SkafteNicki added this to the v1.0.0 milestone Apr 23, 2023
@SkafteNicki
Copy link
Member

Hi @josiahbjorgaard,
I created PR #1729 that will try to prevent ending up with nan in the PearsonCorrCoeff calculation, by improving the robustness when you evaluate on a single batch as in your example. That is the good news.

However, that does not mean that the metric is correct in half precision for your example. When running your example with the new implementation I get:

tensor(0.3378, device='cuda:0')  # torch.float16
tensor(0.3718, device='cuda:0')  # torch.float32

which is a large difference. However, because the input is both very few samples and they are very similar the metric will not work for low precision in this case. If we just look at the mean and variance of the pred tensor:

pred.half().mean()  # tensor(0.4790, device='cuda:0', dtype=torch.float16)
pred.float().mean()  # tensor(0.4789, device='cuda:0')

pred.half().var()  # tensor(6.4373e-06, device='cuda:0', dtype=torch.float16)
pred.float().var()  # tensor(6.4638e-06, device='cuda:0')

as you can see even the mean and variance is not equal for half and float precision. In PearsonCorrCoeff we end up multiplying a bunch of these quantities together which in the end lead to the accumulation of such errors for the final metric.

Since PearsonCorrCoeff is a global metric I highly recommend to only calculate it at the end of epoch (or at least for a large number of samples) and in float precision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants