Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 [Bug] Tests are not being linked properly, fail with 'symbol lookup error' #408

Closed
borisfom opened this issue Mar 20, 2021 · 8 comments
Labels
question Further information is requested

Comments

@borisfom
Copy link
Collaborator

Bug Description

To Reproduce

Steps to reproduce the behavior:

  1. bazel test //tests --compilation_mode=dbg --test_output=errors --jobs=4 --runs_per_test=5

You will see all the tests fail. I am using stock 1.7.1 PyTorch.

boris@snikolaev-DGXStation:/git/TRTorch$ /home/boris/.cache/bazel/_bazel_boris/c6ee020343103959b26b654eb14e89ac/execroot/TRTorch/bazel-out/k8-dbg/bin/tests/core/conversion/converters/test_linear.runfiles/TRTorch/tests/core/conversion/converters/test_linear
/home/boris/.cache/bazel/_bazel_boris/c6ee020343103959b26b654eb14e89ac/execroot/TRTorch/bazel-out/k8-dbg/bin/tests/core/conversion/converters/test_linear.runfiles/TRTorch/tests/core/conversion/converters/test_linear: symbol lookup error: /home/boris/.cache/bazel/_bazel_boris/c6ee020343103959b26b654eb14e89ac/execroot/TRTorch/bazel-out/k8-dbg/bin/tests/core/conversion/converters/../../../../_solib_k8/libcore_Sutil_Slibtrt_Uutil.so: undefined symbol: _ZN3c105ErrorC1ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
boris@snikolaev-DGXStation:
/git/TRTorch$ nm /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so | grep _ZN3c105ErrorC1ENS_14SourceLocationENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
boris@snikolaev-DGXStation:~/git/TRTorch$ nm /usr/local/lib/python3.6/dist-packages/torch/lib/libc10.so | grep SourceLocation
000000000004f130 T _ZN3c1014WarningHandler7processERKNS_14SourceLocationERKSsb
0000000000051870 T _ZN3c105ErrorC1ENS_14SourceLocationESs
0000000000051870 T _ZN3c105ErrorC2ENS_14SourceLocationESs
000000000004f210 T _ZN3c107Warning4warnENS_14SourceLocationERKSsb
00000000000527c0 t _ZN3c10lsERSoRKNS_14SourceLocationE

Expected behavior

Tests run (or at least start up) successfully.

Environment

Build information about the TRTorch compiler can be found by turning on debug messages

  • PyTorch Version (e.g., 1.0): 1.7.1
  • CPU Architecture:
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, libtorch, source): pip
  • Build command you used (if compiling from source): bazel test //tests --compilation_mode=dbg --test_output=errors --jobs=4 --runs_per_test=5
  • Are you using local sources or building from archives: local
  • Python version: 3.6
  • CUDA version: 11
  • GPU models and configuration:
  • Any other relevant information:

Additional context

@borisfom borisfom added the bug Something isn't working label Mar 20, 2021
@narendasan
Copy link
Collaborator

This is probably an ABI issue, are you putting the python package pytorch libs in your LD_LIBRARY_PATH? Try using the libtorch distribution downloaded with bazel. Usually I do something like this:

export LD_LIBRARY_PATH=$(pwd)/bazel-TRTorch/external/libtorch/lib/:$(pwd)/bazel-TRTorch/external/cudnn/lib64/:$(pwd)/bazel-TRTorch/external/tensorrt/lib/:/usr/local/cuda/lib64/:$LD_LIBRARY_PATH

@narendasan
Copy link
Collaborator

You could also throw in the compile flag --config=pre_cxx11_abi if you need to use the python pytorch distribution

@borisfom
Copy link
Collaborator Author

Yes I was using lib torch from the installed package. Pointed to downloaded distribution. It starts up, now, but crashes later. Does that ring any bells? Active CUDA on my box is 11.2 so I had to add libnvrtc from 11.1 to the path - probably that did not go well:

Running main() from gmock_main.cc
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from CompiledModuleForwardIsCloseSuite/ModuleTests
[ RUN ] CompiledModuleForwardIsCloseSuite/ModuleTests.SerializedModuleIsStillCorrect/0
error loading the model
DEBUG: [TRTorch - Debug Build] - TRTorch Version: 0.3.0
Using TensorRT Version: 7.2.2.3
PyTorch built with:

  • GCC 5.4
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.1 Product Build 20200208 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  • OpenMP 201307 (a.k.a. OpenMP 4.0)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.2
  • Built with CUDA Runtime 11.0
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=
    compute_80,code=sm_80
  • CuDNN 8.0.3
  • Magma 2.5.2
  • Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=1 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_
    XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-un
    used-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-er
    ror=old-style-cast -fdiagnostics-color=always -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_
    EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

unknown file: Failure
C++ exception with description "ivalue INTERNAL ASSERT FAILED at "../torch/csrc/jit/api/object.cpp":19, please report a bug to PyTorch.
Exception raised from _ivalue at ../torch/csrc/jit/api/object.cpp:19 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x69 (0x7fef59562b89 in /home/boris/git/TRTorch/bazel-TRTorch/external/libtorch/lib/libc10.
so)
frame #1: torch::jit::Object::_ivalue() const + 0x2c8 (0x7fef5c987318 in /home/boris/git/TRTorch/bazel-TRTorch/external/libtorch/lib/libtorch_cpu.so)
frame #2: trtorch::core::CompileGraph(torch::jit::Module const&, trtorch::core::CompileSpec) + 0x49 (0x7fefec9db1e2 in /home/boris/.cache/bazel/_bazel_boris/c6ee020343103959b26b654eb14e89ac/execroot/TRTorch/bazel-out/k8-dbg
/bin/tests/modules/../../_solib_k8/libcore_Slibcore.so)
frame #3: + 0x3b1d9 (0x557e789111d9 in /home/boris/.cache/bazel/_bazel_boris/c6ee020343103959b26b654eb14e89ac/sandbox/linux-sandbox/798/execroot/TRTorch/bazel-out/k8-dbg/bin/tests/modules/test_serializati
on.runfiles/TRTorch/tests/modules/test_serialization)
fr

@narendasan
Copy link
Collaborator

Hmm, yeah with PyTorch 1.7.1, try using CUDA 11.0 libraries, you could also try an NGC container that has PyTorch built with 11.2.

@narendasan narendasan added question Further information is requested and removed bug Something isn't working labels Mar 22, 2021
@borisfom
Copy link
Collaborator Author

borisfom commented Mar 23, 2021

@narendasan : I have tried build and test in container based on 21.02 - same result. I am using local cudnn and tensorrt. I think we need to make sure that fairly common configuration case works.

@narendasan
Copy link
Collaborator

narendasan commented Mar 23, 2021

oh i didnt realize you are running the test suite. Did you download the models for the tests? You can download them by running the hub.py script in //tests/modules

@borisfom
Copy link
Collaborator Author

No I did not - please do mention it in readme :)
Worked now, thanks! I am getting timeouts on the elementwise tests - just takes TRT too long to optimize I guess.

@narendasan
Copy link
Collaborator

Yeah I guess we only mention it here https://github.com/NVIDIA/TRTorch/blob/master/tests/modules/README.md but I'll add a note to the testing README. The timeout issue for elementwise should be fixed in master, you just need to set the testing timeout to moderate like we do here,https://github.com/NVIDIA/TRTorch/blob/d6a3c4561e62d7806b9190c935672ffeaf93e58d/tests/core/conversion/converters/converter_test.bzl#L15. We probably need to start breaking up that file

@borisfom borisfom closed this as completed Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants