-
Notifications
You must be signed in to change notification settings - Fork 7.1k
test_nms_cuda is flaky #2035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@fmassa Do you know what happened there? |
I have seen this error before, and I thought I had fixed it in #1556 but maybe there is a corner case missing sometimes... |
Happened to be looking at this today. The mismatch is because sometimes either the CPU or the CUDA result includes either box 0 or box 999 (depending on their relative scores) while the other does not. I see the problem on both x86 and Power, and can force it consistently across them by seeding the torch RNG with specific values. For example, I'll see a failure with:
The problem seems to be the new code in I'm able to avoid the problem by adjusting that to ensure that the 1st/last box IoU is always a bit above the IoU test threshold by introducing an epsilon value:
That reliably gets rid of the tensor length difference complaints, but the test still sometimes asserts. Looking at that now. You can get a better look by driving the test routines with something like:
|
And looks like the remaining assert I'm seeing is an ordering problem when two of the boxes which make the cut have very similar scores. So:
Where the scores of boxes are seemingly identical:
Not sure of the best way to avoid that. Could compare the results without respect to the original ordering (e.g. by sorting or converting to sets), but maybe including the ordering is nice. Not sure of a sensible way to ensure all the boxes have distinct scores, short of assigning them incrementally and that doesn't seem ideal. |
Oh, and I'm testing on linux on both IBM Power and x86, so this isn't a Windows-specific problem, despite where the CI failure arose. Testing with the |
Put up a PR that adjusts the 1st/last box IoU construction (to ensure it's slightly over threshold so should be consistently suppressed between CPU and CUDA implementations), and to compare the surving box lists without regard for ordering as sets. Happy to rework those if other solutions are preferred. |
Hmmm. Occurs to me now that the 2nd issue (ordering of boxes with similar scores) might be caused by the CUDA implementation not using a stable sort |
@hartb thanks for the PR! I think we should not try to guarantee that the sorting returns the same results between CPU and CUDA (PyTorch doesn't enforce this), but we should enforce that the number of returned elements is the same. I commented on the PR, let me know what you think |
This has been mitigated in #2044, but it would be great if we could understand more deeply why there are those discrepancies (and fix it if possible) |
I looked at this some more... If With that change, overlap values calculated by CPU vs CUDA usually agree exactly, but they can differ from one another by up to 4 ULP (very rarely). Across 1000 seeds:
Even when the calculated overlaps differ, the testcase will still pass unless the overlap values straddle the threshold (i.e. one overlap greater than the threshold, the other not). If NVCC's fused multiply add optimization is disabled (e.g. by adding Disabling NVCC enables Disabling NVCC's precise division optimization ( To summarize:
So maybe...
If that sounds OK, I can put up a PR for the |
@hartb thanks a lot for getting to the bottom of this! Your proposal sounds good to me, I agree that disabling |
Submitted: #2072 |
Thanks @hartb ! This has been fixed now |
🐛 Bug
https://app.circleci.com/pipelines/github/pytorch/vision/2097/workflows/661fd235-202a-4c88-be4d-f8af378c195f/jobs/110511
The text was updated successfully, but these errors were encountered: