PearsonCorrCoeff returns nan when input is of type torch.float16 or torch.bfloat16. #1725

josiahbjorgaard · 2023-04-21T22:04:10Z

🐛 Bug

PearsonCorrCoeff returns nan when input is of type torch.float16 or torch.bfloat16 and all values are close.

To Reproduce

import torch
import torchmetrics as tm
pcc = tm.regression.PearsonCorrCoef().to("cuda")
pred = torch.tensor([0.4746, 0.4805, 0.4766, 0.4805, 0.4766, 0.4805, 0.4785, 0.4824, 0.4805],dtype=torch.float16).to("cuda")
target = torch.tensor([0.0336, 0.3676, 0.6302, 0.7192, 0.2295, 0.2886, 0.6302, 0.7096, 0.0208],dtype=torch.float16).to("cuda")
print(pcc(pred,target))
print(pcc(pred.to(torch.float32),target.to(torch.float32)))
tensor(nan, device='cuda:0')
tensor(0.3720, device='cuda:0')

Expected behavior

Environment

Python version: 3.10.9
Torch version: 1.12.1
TorchMetrics version: 0.11.1
GPU device name: Tesla T4
CUDA Version: 11.4

Additional context

When running in a training loop I found that some fraction (~30%) of steps would not produce a nan number when using torch.float16 or bfloat16, while the other ~70% would.
This seems to occur because the values in pred above are not very different (changing one value of pred above to be more different than the rest will compute a correct PCC), however I think that this should still be able to be computed with half precision and the standard deviation of pred shown above.

github-actions · 2023-04-21T22:04:49Z

Hi! thanks for your contribution!, great first issue!

SkafteNicki · 2023-04-23T16:26:58Z

Hi @josiahbjorgaard,
I created PR #1729 that will try to prevent ending up with nan in the PearsonCorrCoeff calculation, by improving the robustness when you evaluate on a single batch as in your example. That is the good news.

However, that does not mean that the metric is correct in half precision for your example. When running your example with the new implementation I get:

tensor(0.3378, device='cuda:0')  # torch.float16
tensor(0.3718, device='cuda:0')  # torch.float32

which is a large difference. However, because the input is both very few samples and they are very similar the metric will not work for low precision in this case. If we just look at the mean and variance of the pred tensor:

pred.half().mean()  # tensor(0.4790, device='cuda:0', dtype=torch.float16)
pred.float().mean()  # tensor(0.4789, device='cuda:0')

pred.half().var()  # tensor(6.4373e-06, device='cuda:0', dtype=torch.float16)
pred.float().var()  # tensor(6.4638e-06, device='cuda:0')

as you can see even the mean and variance is not equal for half and float precision. In PearsonCorrCoeff we end up multiplying a bunch of these quantities together which in the end lead to the accumulation of such errors for the final metric.

Since PearsonCorrCoeff is a global metric I highly recommend to only calculate it at the end of epoch (or at least for a large number of samples) and in float precision.

josiahbjorgaard added bug / fix Something isn't working help wanted Extra attention is needed labels Apr 21, 2023

josiahbjorgaard changed the title ~~PearsonCorrCoeff returns nan when on GPU and input is of type torch.float16 or torch.bfloat16.~~ PearsonCorrCoeff returns nan when input is of type torch.float16 or torch.bfloat16. Apr 21, 2023

SkafteNicki added this to the v1.0.0 milestone Apr 23, 2023

SkafteNicki mentioned this issue Apr 23, 2023

Improve calculations in Pearson #1729

Merged

4 tasks

Borda closed this as completed in #1729 Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PearsonCorrCoeff returns nan when input is of type torch.float16 or torch.bfloat16. #1725

PearsonCorrCoeff returns nan when input is of type torch.float16 or torch.bfloat16. #1725

josiahbjorgaard commented Apr 21, 2023 •

edited by SkafteNicki

Loading

github-actions bot commented Apr 21, 2023

SkafteNicki commented Apr 23, 2023

PearsonCorrCoeff returns nan when input is of type torch.float16 or torch.bfloat16. #1725

PearsonCorrCoeff returns nan when input is of type torch.float16 or torch.bfloat16. #1725

Comments

josiahbjorgaard commented Apr 21, 2023 • edited by SkafteNicki Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

github-actions bot commented Apr 21, 2023

SkafteNicki commented Apr 23, 2023

josiahbjorgaard commented Apr 21, 2023 •

edited by SkafteNicki

Loading