[WIP]Use float64 to reduce differences between cpu and gpu result in model test #7114

YosuaMichael · 2023-01-19T18:13:15Z

As specified in #7098 that currently some model test are failing due to precision. I have tried to use real image and weight to resolve this issue (see https://gist.github.com/YosuaMichael/7f542cdf5e051db758e462a2a8cc50fd) however due to floating point precision, the cpu and gpu result still have a quite significant gap.

And since the root problem is the floating point precision rather than randomness, in this PR we try to resolve the issue by using float64 instead.

Note: credits to @lezcano that give me insight to use float64 :)

cc @pmeier

lezcano

Let's also change the device and the dtype at the same time to elide one unnecessary copy :)

lezcano · 2023-01-19T23:24:53Z

test/test_models.py

-    model.eval().to(device=dev)
-    x = _get_image(input_shape=input_shape, real_image=real_image, device=dev)
+    # We use float64 (.double()) to reduce differences between cpu and gpu result
+    model.eval().to(device=dev).double()


Suggested change

model.eval().to(device=dev).double()

model.eval().to(device=dev, dtype=torch.double)

lezcano · 2023-01-19T23:25:06Z

test/test_models.py

-    model.eval().to(device=dev)
-    x = _get_image(input_shape=input_shape, real_image=real_image, device=dev)
+    # We use float64 (.double()) to reduce differences between cpu and gpu result
+    model.eval().to(device=dev).double()


Suggested change

model.eval().to(device=dev).double()

model.eval().to(device=dev, dtype=torch.double)

lezcano · 2023-01-19T23:25:55Z

test/test_models.py

-    x = _get_image(input_shape=input_shape, real_image=real_image, device=dev)
+    # We use float64 (.double()) to reduce differences between cpu and gpu result
+    model.eval().to(device=dev).double()
+    x = _get_image(input_shape=input_shape, real_image=real_image, device=dev).double()


Extend _get_image to also get a dtype arg.

NicolasHug · 2023-01-20T10:53:23Z

Thanks for the PR @YosuaMichael

since the root problem is the floating point precision

Are we 100% sure about that? IIUC our tests started failing in early December with a CPU/GPU discrepancy in results. But there was no change that we could identify in torchvision that could cause such difference. So, while increasing the precision to float64 might make our tests back green:

is this actually addressing the core underlying issue?
are we still testing our models with the settings in which users are going to use them? i.e users mostly rely on float32 or even lower prec rather than float64?

I fear that by switching to float64, we might just be silencing the symptoms rather than addressing the actual disease

YosuaMichael · 2023-01-20T17:04:48Z

@NicolasHug I think if float64 pass the test while float32 is not, then it is definitely some precision problem (most likely due to different implementation of floating point between cpu and gpu).
And as of now our current model test is very sensitive, it might pass with one seed but not passing with another seed. I think this sensitiveness sometime cause a benign changes on the core break our test.

Another thing we can do is to use real weight and image for model test. However from my experiment, this approach although not sensitive to randomness, it require us to increase the tolerance for GPU to around 3e-2 to make sure all the current model test to pass, which I think pretty big tolerance. This approach however has the advantage of running faster (since I think float64 will make the run slower).

YosuaMichael · 2023-01-21T20:45:49Z

Notes: currently the tests are still failing because:

Some of the bigger models got out of memory when we use float64
We haven't use float64 for video model (hence still got the precision problem)

(2) is easy to resolve, although some video models will cause OOM as well after that.
For (1), I still dont have a good solution in mind, will think on this and any ideas are welcome.

NicolasHug · 2023-01-23T08:29:03Z

@NicolasHug I think if float64 pass the test while float32 is not, then it is definitely some precision problem

Yes we have a precision problem. But switching to float64 is akin to increasing the tolerance on a flaky test: we're masking the flakiness but we're not addressing it at its root. In some cases that's fine if what we're testing is inherently flaky - but I'm not sure that's our case here. While these model tests always had some degree of flakiness, they started being consistently red from some time in Dec (according to our offline discussions). That means something happened back then, and we should try to find and address that.

YosuaMichael · 2023-01-23T09:47:41Z

While these model tests always had some degree of flakiness, they started being consistently red from some time in Dec (according to our offline discussions). That means something happened back then, and we should try to find and address that.

Ok, let me try to find out what happened on Dec before going forward with this PR then. My current hypothesis is that there is definitely some changes that cause it to be consistently red, but it might be benign change (change that should be tolerated but our model test is too sensitive for that). But too make sure of this, I agree we should understand what changes first, will dig more on that.

YosuaMichael · 2023-01-23T19:23:47Z

@NicolasHug @pmeier I have made an investigation using AWS cluster. The TLDR for AWS cluster, it start complaining not passing the test after we make the classification test precision stricter on this commit: #6380. But this might not correspond to the failing test on CI, which I think might be caused by the changes in machine when we move to GHA (I can't really reproduce same error on AWS cluster).

Varying the pytorch core version then the torchvision version

First of all, my prime suspect is on the pytorch core version since we dont really change anything on torchvision since December for model or model test. To do this, I manually download the .whl file from s3, here is a sample url: https://s3.console.aws.amazon.com/s3/object/pytorch?region=us-east-1&prefix=whl/nightly/cu116/torch-1.14.0.dev20221208%2Bcu116-cp38-cp38-linux_x86_64.whl and if needed I will unzip it and change the dependency of torchtriton to pytorch-triton and re-zip it (we can download the corresponding pytorch-triton for this example in https://s3.console.aws.amazon.com/s3/object/pytorch?region=us-east-1&prefix=whl/nightly/pytorch_triton-2.0.0%2B0d7e753227-cp38-cp38-linux_x86_64.whl).

I tried pytorch core on the following date: 20221214, 20221208, 20221119, 20221007, 20220626 and they are all give the same test failure:

Mismatched elements: 8 / 50 (16.0%)
Greatest absolute difference: 0.011041641235351562 at index (0, 39) (up to 0.001 allowed)
Greatest relative difference: 0.0031437701451340443 at index (0, 45) (up to 0.001 allowed)

Looking at this I tried to vary the torchvision version while maintaining the same pytorch core version at 20220626. From this I found that I got the test passed before the PR #6380 which make the test stricter. This investigation actually show that in AWS cluster platform, our model test has been failed since August, but this failure didn't happened on the CI maybe due to different machine.

As of now, I still don't have a clear picture on what happened on December. My main suspect is the change of machine which change the model output since it is quite sensitive. Will try a further investigation (but will time box to 1 day), but I think it will be difficult since it can't be reproduced in AWS cluster and it will slow down the experimentation by a lot if we need to do it on CI.

pmeier · 2023-01-24T07:26:05Z

Thanks for the investigation @YosuaMichael! So far it seems we tightened the tolerances too much in #6380. @NicolasHug suspected as much in #6380 (comment). Conflicting results from @YosuaMichael upcoming investigations aside, it seems to me that we shouldn't go to float64, but just relax the tolerances a bit. If we are sure that our models and tests are fine and the failures actually come from different hardware, there should be no harm doing it and it circumvents the issues listed in #7114 (comment).

NicolasHug · 2023-01-24T11:18:48Z

Thanks a lot for the investigation Yosua. It seems like:

a) we should revert #6380, which will hopefully "fix" the internal tests as well
b) our model tests are even worse than we thought, since they seem to be very machine-dependent (😢)

YosuaMichael · 2023-01-24T18:04:18Z

An update on this issue, when I ran the same test on the latest nightly version with conda, I got a bit differences of error.
Here is the error using latest nightly (20230124):

Mismatched elements: 10 / 50 (20.0%)
Greatest absolute difference: 0.01226806640625 at index (0, 39) (up to 0.001 allowed)
Greatest relative difference: 0.003982544439953874 at index (0, 49) (up to 0.001 allowed)

While here is the error using nightly on wheel at 20220626 - 20221214:

Mismatched elements: 8 / 50 (16.0%)
Greatest absolute difference: 0.011041641235351562 at index (0, 39) (up to 0.001 allowed)
Greatest relative difference: 0.0031437701451340443 at index (0, 45) (up to 0.001 allowed)

so there might be some changes that cause the changes in the diffferences.

However unfortunately the wheel packages are broken after 20221215, here is the detailed issue on this, and the conda package is not archived until December (the earliest I can get is 20230120: https://anaconda.org/pytorch-nightly/pytorch/files ). Hence for now it is quite difficult to know what exactly cause the changes in the differences...

I think as suggested, for now I will try to revert #6380 and close this PR.

YosuaMichael · 2023-01-26T15:45:14Z

I reopen just to test if using torch.nograd can reduce the memory usage of the model test in test_classification_model

Use float64 to reduce differences between cpu and gpu result

0c43c59

YosuaMichael added enhancement module: tests labels Jan 19, 2023

YosuaMichael self-assigned this Jan 19, 2023

facebook-github-bot added the cla signed label Jan 19, 2023

YosuaMichael changed the title ~~[WIP]Use float64 to reduce differences between cpu and gpu result~~ [WIP]Use float64 to reduce differences between cpu and gpu result in model test Jan 19, 2023

lezcano reviewed Jan 19, 2023

View reviewed changes

YosuaMichael closed this Jan 24, 2023

YosuaMichael mentioned this pull request Jan 25, 2023

Use real weight and image for classification model test and relaxing precision requirement for general model tests #7130

Open

YosuaMichael reopened this Jan 26, 2023

Use torch.no_grad to save memory

55a25eb

YosuaMichael closed this Jan 27, 2023

This was referenced Feb 9, 2023

Put back previous tolerance for test_classification and test_video #7202

Merged

Test some flaky detection models on float64 instead of float32 #7204

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP]Use float64 to reduce differences between cpu and gpu result in model test #7114

[WIP]Use float64 to reduce differences between cpu and gpu result in model test #7114

YosuaMichael commented Jan 19, 2023 •

edited

Loading

lezcano left a comment

lezcano Jan 19, 2023

lezcano Jan 19, 2023

lezcano Jan 19, 2023

NicolasHug commented Jan 20, 2023 •

edited

Loading

YosuaMichael commented Jan 20, 2023

YosuaMichael commented Jan 21, 2023

NicolasHug commented Jan 23, 2023

YosuaMichael commented Jan 23, 2023

YosuaMichael commented Jan 23, 2023 •

edited

Loading

pmeier commented Jan 24, 2023 •

edited

Loading

NicolasHug commented Jan 24, 2023

YosuaMichael commented Jan 24, 2023

YosuaMichael commented Jan 26, 2023

	model.eval().to(device=dev).double()
	model.eval().to(device=dev, dtype=torch.double)

[WIP]Use float64 to reduce differences between cpu and gpu result in model test #7114

[WIP]Use float64 to reduce differences between cpu and gpu result in model test #7114

Conversation

YosuaMichael commented Jan 19, 2023 • edited Loading

lezcano left a comment

Choose a reason for hiding this comment

lezcano Jan 19, 2023

Choose a reason for hiding this comment

lezcano Jan 19, 2023

Choose a reason for hiding this comment

lezcano Jan 19, 2023

Choose a reason for hiding this comment

NicolasHug commented Jan 20, 2023 • edited Loading

YosuaMichael commented Jan 20, 2023

YosuaMichael commented Jan 21, 2023

NicolasHug commented Jan 23, 2023

YosuaMichael commented Jan 23, 2023

YosuaMichael commented Jan 23, 2023 • edited Loading

Varying the pytorch core version then the torchvision version

pmeier commented Jan 24, 2023 • edited Loading

NicolasHug commented Jan 24, 2023

YosuaMichael commented Jan 24, 2023

YosuaMichael commented Jan 26, 2023

YosuaMichael commented Jan 19, 2023 •

edited

Loading

NicolasHug commented Jan 20, 2023 •

edited

Loading

YosuaMichael commented Jan 23, 2023 •

edited

Loading

pmeier commented Jan 24, 2023 •

edited

Loading