Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Use float64 to reduce differences between cpu and gpu result in model test #7114

Conversation

YosuaMichael
Copy link
Contributor

@YosuaMichael YosuaMichael commented Jan 19, 2023

As specified in #7098 that currently some model test are failing due to precision. I have tried to use real image and weight to resolve this issue (see https://gist.github.com/YosuaMichael/7f542cdf5e051db758e462a2a8cc50fd) however due to floating point precision, the cpu and gpu result still have a quite significant gap.

And since the root problem is the floating point precision rather than randomness, in this PR we try to resolve the issue by using float64 instead.

Note: credits to @lezcano that give me insight to use float64 :)

cc @pmeier

@YosuaMichael YosuaMichael self-assigned this Jan 19, 2023
@YosuaMichael YosuaMichael changed the title [WIP]Use float64 to reduce differences between cpu and gpu result [WIP]Use float64 to reduce differences between cpu and gpu result in model test Jan 19, 2023
Copy link
Contributor

@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also change the device and the dtype at the same time to elide one unnecessary copy :)

model.eval().to(device=dev)
x = _get_image(input_shape=input_shape, real_image=real_image, device=dev)
# We use float64 (.double()) to reduce differences between cpu and gpu result
model.eval().to(device=dev).double()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model.eval().to(device=dev).double()
model.eval().to(device=dev, dtype=torch.double)

model.eval().to(device=dev)
x = _get_image(input_shape=input_shape, real_image=real_image, device=dev)
# We use float64 (.double()) to reduce differences between cpu and gpu result
model.eval().to(device=dev).double()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model.eval().to(device=dev).double()
model.eval().to(device=dev, dtype=torch.double)

x = _get_image(input_shape=input_shape, real_image=real_image, device=dev)
# We use float64 (.double()) to reduce differences between cpu and gpu result
model.eval().to(device=dev).double()
x = _get_image(input_shape=input_shape, real_image=real_image, device=dev).double()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extend _get_image to also get a dtype arg.

@NicolasHug
Copy link
Member

NicolasHug commented Jan 20, 2023

Thanks for the PR @YosuaMichael

since the root problem is the floating point precision

Are we 100% sure about that? IIUC our tests started failing in early December with a CPU/GPU discrepancy in results. But there was no change that we could identify in torchvision that could cause such difference. So, while increasing the precision to float64 might make our tests back green:

  • is this actually addressing the core underlying issue?
  • are we still testing our models with the settings in which users are going to use them? i.e users mostly rely on float32 or even lower prec rather than float64?

I fear that by switching to float64, we might just be silencing the symptoms rather than addressing the actual disease

@YosuaMichael
Copy link
Contributor Author

@NicolasHug I think if float64 pass the test while float32 is not, then it is definitely some precision problem (most likely due to different implementation of floating point between cpu and gpu).
And as of now our current model test is very sensitive, it might pass with one seed but not passing with another seed. I think this sensitiveness sometime cause a benign changes on the core break our test.

Another thing we can do is to use real weight and image for model test. However from my experiment, this approach although not sensitive to randomness, it require us to increase the tolerance for GPU to around 3e-2 to make sure all the current model test to pass, which I think pretty big tolerance. This approach however has the advantage of running faster (since I think float64 will make the run slower).

@YosuaMichael
Copy link
Contributor Author

Notes: currently the tests are still failing because:

  1. Some of the bigger models got out of memory when we use float64
  2. We haven't use float64 for video model (hence still got the precision problem)

(2) is easy to resolve, although some video models will cause OOM as well after that.
For (1), I still dont have a good solution in mind, will think on this and any ideas are welcome.

@NicolasHug
Copy link
Member

@NicolasHug I think if float64 pass the test while float32 is not, then it is definitely some precision problem

Yes we have a precision problem. But switching to float64 is akin to increasing the tolerance on a flaky test: we're masking the flakiness but we're not addressing it at its root. In some cases that's fine if what we're testing is inherently flaky - but I'm not sure that's our case here. While these model tests always had some degree of flakiness, they started being consistently red from some time in Dec (according to our offline discussions). That means something happened back then, and we should try to find and address that.

@YosuaMichael
Copy link
Contributor Author

While these model tests always had some degree of flakiness, they started being consistently red from some time in Dec (according to our offline discussions). That means something happened back then, and we should try to find and address that.

Ok, let me try to find out what happened on Dec before going forward with this PR then. My current hypothesis is that there is definitely some changes that cause it to be consistently red, but it might be benign change (change that should be tolerated but our model test is too sensitive for that). But too make sure of this, I agree we should understand what changes first, will dig more on that.

@YosuaMichael
Copy link
Contributor Author

YosuaMichael commented Jan 23, 2023

@NicolasHug @pmeier I have made an investigation using AWS cluster. The TLDR for AWS cluster, it start complaining not passing the test after we make the classification test precision stricter on this commit: #6380. But this might not correspond to the failing test on CI, which I think might be caused by the changes in machine when we move to GHA (I can't really reproduce same error on AWS cluster).

Varying the pytorch core version then the torchvision version

First of all, my prime suspect is on the pytorch core version since we dont really change anything on torchvision since December for model or model test. To do this, I manually download the .whl file from s3, here is a sample url: https://s3.console.aws.amazon.com/s3/object/pytorch?region=us-east-1&prefix=whl/nightly/cu116/torch-1.14.0.dev20221208%2Bcu116-cp38-cp38-linux_x86_64.whl and if needed I will unzip it and change the dependency of torchtriton to pytorch-triton and re-zip it (we can download the corresponding pytorch-triton for this example in https://s3.console.aws.amazon.com/s3/object/pytorch?region=us-east-1&prefix=whl/nightly/pytorch_triton-2.0.0%2B0d7e753227-cp38-cp38-linux_x86_64.whl).

I tried pytorch core on the following date: 20221214, 20221208, 20221119, 20221007, 20220626 and they are all give the same test failure:

Mismatched elements: 8 / 50 (16.0%)
Greatest absolute difference: 0.011041641235351562 at index (0, 39) (up to 0.001 allowed)
Greatest relative difference: 0.0031437701451340443 at index (0, 45) (up to 0.001 allowed)

Looking at this I tried to vary the torchvision version while maintaining the same pytorch core version at 20220626. From this I found that I got the test passed before the PR #6380 which make the test stricter. This investigation actually show that in AWS cluster platform, our model test has been failed since August, but this failure didn't happened on the CI maybe due to different machine.

As of now, I still don't have a clear picture on what happened on December. My main suspect is the change of machine which change the model output since it is quite sensitive. Will try a further investigation (but will time box to 1 day), but I think it will be difficult since it can't be reproduced in AWS cluster and it will slow down the experimentation by a lot if we need to do it on CI.

@pmeier
Copy link
Collaborator

pmeier commented Jan 24, 2023

Thanks for the investigation @YosuaMichael! So far it seems we tightened the tolerances too much in #6380. @NicolasHug suspected as much in #6380 (comment). Conflicting results from @YosuaMichael upcoming investigations aside, it seems to me that we shouldn't go to float64, but just relax the tolerances a bit. If we are sure that our models and tests are fine and the failures actually come from different hardware, there should be no harm doing it and it circumvents the issues listed in #7114 (comment).

@NicolasHug
Copy link
Member

Thanks a lot for the investigation Yosua. It seems like:

a) we should revert #6380, which will hopefully "fix" the internal tests as well
b) our model tests are even worse than we thought, since they seem to be very machine-dependent (😢)

@YosuaMichael
Copy link
Contributor Author

An update on this issue, when I ran the same test on the latest nightly version with conda, I got a bit differences of error.
Here is the error using latest nightly (20230124):

Mismatched elements: 10 / 50 (20.0%)
Greatest absolute difference: 0.01226806640625 at index (0, 39) (up to 0.001 allowed)
Greatest relative difference: 0.003982544439953874 at index (0, 49) (up to 0.001 allowed)

While here is the error using nightly on wheel at 20220626 - 20221214:

Mismatched elements: 8 / 50 (16.0%)
Greatest absolute difference: 0.011041641235351562 at index (0, 39) (up to 0.001 allowed)
Greatest relative difference: 0.0031437701451340443 at index (0, 45) (up to 0.001 allowed)

so there might be some changes that cause the changes in the diffferences.

However unfortunately the wheel packages are broken after 20221215, here is the detailed issue on this, and the conda package is not archived until December (the earliest I can get is 20230120: https://anaconda.org/pytorch-nightly/pytorch/files ). Hence for now it is quite difficult to know what exactly cause the changes in the differences...

I think as suggested, for now I will try to revert #6380 and close this PR.

@YosuaMichael
Copy link
Contributor Author

I reopen just to test if using torch.nograd can reduce the memory usage of the model test in test_classification_model

@YosuaMichael YosuaMichael reopened this Jan 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants