-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP]Use float64 to reduce differences between cpu and gpu result in model test #7114
[WIP]Use float64 to reduce differences between cpu and gpu result in model test #7114
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also change the device and the dtype at the same time to elide one unnecessary copy :)
model.eval().to(device=dev) | ||
x = _get_image(input_shape=input_shape, real_image=real_image, device=dev) | ||
# We use float64 (.double()) to reduce differences between cpu and gpu result | ||
model.eval().to(device=dev).double() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model.eval().to(device=dev).double() | |
model.eval().to(device=dev, dtype=torch.double) |
model.eval().to(device=dev) | ||
x = _get_image(input_shape=input_shape, real_image=real_image, device=dev) | ||
# We use float64 (.double()) to reduce differences between cpu and gpu result | ||
model.eval().to(device=dev).double() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
model.eval().to(device=dev).double() | |
model.eval().to(device=dev, dtype=torch.double) |
x = _get_image(input_shape=input_shape, real_image=real_image, device=dev) | ||
# We use float64 (.double()) to reduce differences between cpu and gpu result | ||
model.eval().to(device=dev).double() | ||
x = _get_image(input_shape=input_shape, real_image=real_image, device=dev).double() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extend _get_image
to also get a dtype
arg.
Thanks for the PR @YosuaMichael
Are we 100% sure about that? IIUC our tests started failing in early December with a CPU/GPU discrepancy in results. But there was no change that we could identify in torchvision that could cause such difference. So, while increasing the precision to float64 might make our tests back green:
I fear that by switching to float64, we might just be silencing the symptoms rather than addressing the actual disease |
@NicolasHug I think if Another thing we can do is to use real weight and image for model test. However from my experiment, this approach although not sensitive to randomness, it require us to increase the tolerance for GPU to around 3e-2 to make sure all the current model test to pass, which I think pretty big tolerance. This approach however has the advantage of running faster (since I think |
Notes: currently the tests are still failing because:
(2) is easy to resolve, although some video models will cause OOM as well after that. |
Yes we have a precision problem. But switching to float64 is akin to increasing the tolerance on a flaky test: we're masking the flakiness but we're not addressing it at its root. In some cases that's fine if what we're testing is inherently flaky - but I'm not sure that's our case here. While these model tests always had some degree of flakiness, they started being consistently red from some time in Dec (according to our offline discussions). That means something happened back then, and we should try to find and address that. |
Ok, let me try to find out what happened on Dec before going forward with this PR then. My current hypothesis is that there is definitely some changes that cause it to be consistently red, but it might be benign change (change that should be tolerated but our model test is too sensitive for that). But too make sure of this, I agree we should understand what changes first, will dig more on that. |
@NicolasHug @pmeier I have made an investigation using AWS cluster. The TLDR for AWS cluster, it start complaining not passing the test after we make the classification test precision stricter on this commit: #6380. But this might not correspond to the failing test on CI, which I think might be caused by the changes in machine when we move to GHA (I can't really reproduce same error on AWS cluster). Varying the pytorch core version then the torchvision versionFirst of all, my prime suspect is on the pytorch core version since we dont really change anything on torchvision since December for model or model test. To do this, I manually download the I tried pytorch core on the following date: 20221214, 20221208, 20221119, 20221007, 20220626 and they are all give the same test failure:
Looking at this I tried to vary the torchvision version while maintaining the same pytorch core version at 20220626. From this I found that I got the test passed before the PR #6380 which make the test stricter. This investigation actually show that in AWS cluster platform, our model test has been failed since August, but this failure didn't happened on the CI maybe due to different machine. As of now, I still don't have a clear picture on what happened on December. My main suspect is the change of machine which change the model output since it is quite sensitive. Will try a further investigation (but will time box to 1 day), but I think it will be difficult since it can't be reproduced in AWS cluster and it will slow down the experimentation by a lot if we need to do it on CI. |
Thanks for the investigation @YosuaMichael! So far it seems we tightened the tolerances too much in #6380. @NicolasHug suspected as much in #6380 (comment). Conflicting results from @YosuaMichael upcoming investigations aside, it seems to me that we shouldn't go to float64, but just relax the tolerances a bit. If we are sure that our models and tests are fine and the failures actually come from different hardware, there should be no harm doing it and it circumvents the issues listed in #7114 (comment). |
Thanks a lot for the investigation Yosua. It seems like: a) we should revert #6380, which will hopefully "fix" the internal tests as well |
An update on this issue, when I ran the same test on the latest nightly version with conda, I got a bit differences of error.
While here is the error using nightly on wheel at 20220626 - 20221214:
so there might be some changes that cause the changes in the diffferences. However unfortunately the wheel packages are broken after 20221215, here is the detailed issue on this, and the conda package is not archived until December (the earliest I can get is 20230120: https://anaconda.org/pytorch-nightly/pytorch/files ). Hence for now it is quite difficult to know what exactly cause the changes in the differences... I think as suggested, for now I will try to revert #6380 and close this PR. |
I reopen just to test if using |
As specified in #7098 that currently some model test are failing due to precision. I have tried to use real image and weight to resolve this issue (see https://gist.github.com/YosuaMichael/7f542cdf5e051db758e462a2a8cc50fd) however due to floating point precision, the cpu and gpu result still have a quite significant gap.
And since the root problem is the floating point precision rather than randomness, in this PR we try to resolve the issue by using
float64
instead.Note: credits to @lezcano that give me insight to use float64 :)
cc @pmeier