-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TEST][FLAKY] test_detection_models #7363
Comments
So is this a flaky segfault? Not sure how I can reproduce this. Are there common characteristics in nodes that failed (What GPU, which CUDA version etc)? If one node fails, does it fail consistently? |
based on my current read it could also be a timeout. In which case we need to look into if the detection model itself runs too long and if we could build a faster unittest |
It shouldn't take more than a few minutes. It should run much faster than TF SSD test, which takes like 20 min (done in a separate thread tvm/tests/python/frontend/tensorflow/test_forward.py Lines 3090 to 3095 in dda8f5d
Disabled one of rewrites in #7365. The other rewrite added in #7346 should be harmless. We'll see. |
Would this also affect the production test cases although I haven't really seen it. |
Close by @7365 |
I ran MaskRCNN with that rewrite countless times, I didn't see any problem. Weird. |
Before we move too far on from this, can someone compare with our experience in fixing this bug? #7010 If there is some hard to find libomp conflict popping up again, it will crash with "Aborted" like seen the first test |
hmm, I see what's common are the use of pytorch, and the way CI dies. But other than that I don't know if two flaky issues are related. |
https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/main/503/pipeline seems #7365 didn't solve the problem |
Looking at https://ci.tlcpack.ai/job/tvm/job/main/, after #7346 was merged, things had been stable for some time (https://ci.tlcpack.ai/job/tvm/job/main/487/ to https://ci.tlcpack.ai/job/tvm/job/main/491/). The crash began to happen after #7354, which added const folding on The other rewrite in #7346 does involve |
One thing I'm curious, the crash at CI https://ci.tlcpack.ai/job/tvm/job/main/ is very frequent, while CIs for open PRs do not get crash (except https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-7360/1/pipeline/ which was timeout) as far as I see it. I wonder why. |
In the mean time, is there any way to disable pytest capturing while we debug the CI runs? #7017 this might give more information and confirm/rule out the cause being libomp |
not sure if I understand the question. Do you want to just run the specific test in CI? If so, we can checkout the docker image and run pytest -v tests/python/frontend/pytorch/test_object_detection.py |
Essentially yes, I think checking out the CI image and running |
@tqchen @zhiics I investigated this issue and I think I have a solution. Running Now, I wondered why running one script results in the segfault, while the other doesn't, when two scripts do essentially the same thing. And then I remembered the ONNX + PyTorch segfault problem, which was caused by pytorch being imported earlier than ONNX, see onnx/onnx#2394 (comment)
Conclusion: When using PyTorch 1.7 + CUDA, always import TVM first, before torch. PR #7380 |
hmm still failing. But now it clearly says all tests have passed, but then it died. https://ci.tlcpack.ai/job/tvm/job/main/506/console. So it is improvement over the previous situation. I wonder if this is a driver problem on the post-merge CI node, since I don't see this error happening on CIs for the open PRs. |
This can be closed now https://ci.tlcpack.ai/job/tvm/job/main/ |
Seems to be quite frequent in recent PRs, might be related to #7346
https://ci.tlcpack.ai/job/tvm/job/main/495/execution/node/449/log/
https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-7360/1/pipeline(this one is timeout)
The text was updated successfully, but these errors were encountered: