Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST][FLAKY] test_op_grad_level2.py::test_conv2d_grad.py #7010

Closed
tqchen opened this issue Dec 1, 2020 · 12 comments
Closed

[TEST][FLAKY] test_op_grad_level2.py::test_conv2d_grad.py #7010

tqchen opened this issue Dec 1, 2020 · 12 comments

Comments

@tqchen
Copy link
Member

tqchen commented Dec 1, 2020

https://ci.tlcpack.ai/job/tvm/job/main/245/execution/node/218/log/

@tqchen
Copy link
Member Author

tqchen commented Dec 1, 2020

cc @altanh @jroesch @antinucleon would be great if you can take a look

@altanh
Copy link
Contributor

altanh commented Dec 1, 2020

I suspect some recent PR might have broke something, this is the error: tests/python/relay/test_op_grad_level2.py::test_conv2d_grad Fatal Python error: Aborted.

Doesn't seem to me like a numerical issue with the gradient

@tqchen
Copy link
Member Author

tqchen commented Dec 2, 2020

@altanh
Copy link
Contributor

altanh commented Dec 2, 2020

I can't reproduce this locally on the current main branch

@altanh
Copy link
Contributor

altanh commented Dec 2, 2020

per discussion with @tkonolige, we're pretty sure the abort is being caused by libomp conflicts between different 3rd party libraries (e.g. PyTorch and ONNX).

@tkonolige
Copy link
Contributor

tkonolige commented Dec 2, 2020

The error message is:

OMP: Error #15: Initializing libomp.dylib, but found libiomp5.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/

Pytorch loading:

dyld: loaded: <F7FFAF24-7A9F-35EA-B715-F2A2F250F575> /Users/tristan/Library/Python/3.8/lib/python/site-packages/torch/lib/libtorch_global_deps.dylib
dyld: loaded: <52F67CC7-A4B0-3F4D-A80D-7DC28D4A776A> /Users/tristan/Library/Python/3.8/lib/python/site-packages/torch/lib/../.dylibs/libiomp5.dylib

Onnx loading:

dyld: loaded: <C903042A-EFCF-3557-AB7C-155BA03165D0> /usr/local/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_pybind11_state.so
dyld: loaded: <FF7BABED-D8CA-3F78-BCE2-F0C293919D70> /usr/local/opt/libomp/lib/libomp.dylib

@altanh
Copy link
Contributor

altanh commented Dec 2, 2020

Relevant issue on onnxruntime GitHub: microsoft/onnxruntime#5369

@tqchen
Copy link
Member Author

tqchen commented Dec 3, 2020

It would be great to propose a fix, given that the flaky error happens quite frequently.

Is this related to the fact that we are using pytorch for gradient testing? Ideally we sould move that to a separate set of test suite. By default, we should use numerical gradient checking that is independent from other frameworks

@altanh
Copy link
Contributor

altanh commented Dec 3, 2020

I agree. I think first we should address #7017 to confirm it's the same failure that is happening on CI, and then look into removing the dependencies. If we can't remove the dependency (like in the case of test_onnx.py and test_dlpack.py), I propose sandboxing based on dependency so that files with conflicting dependencies will always be run on separate pytest processes. If a single file uses two conflicting dependencies, I'm not sure how to proceed- we may need to build dependencies with special libomp configuration on the CI machine (at least we can cache this?)

altanh added a commit to altanh/tvm that referenced this issue Dec 3, 2020
@altanh
Copy link
Contributor

altanh commented Dec 3, 2020

@tkonolige found that pytest-xdist package supports passing --forked argument to pytest. This seems to fix the problem for running contrib tests.

tqchen pushed a commit that referenced this issue Dec 3, 2020
@altanh
Copy link
Contributor

altanh commented Dec 3, 2020

We should keep this issue but rename to dependency libomp conflict I think (or open a new one), since it might arise in the future

trevor-m pushed a commit to trevor-m/tvm that referenced this issue Dec 4, 2020
trevor-m pushed a commit to neo-ai/tvm that referenced this issue Dec 4, 2020
electriclilies pushed a commit to electriclilies/tvm that referenced this issue Feb 18, 2021
@tqchen tqchen closed this as completed Apr 26, 2021
@tqchen
Copy link
Member Author

tqchen commented Apr 26, 2021

closing for now as original flaky issue is fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants