[Bug] PyTorch and TVM loading problem due to conflicting LLVM symbols #9362

masahi · 2021-10-25T11:18:53Z

Apparently, the new PyTorch release crashes with symbols loaded by TVM, so the following trivial code crashes with invalid pointer Aborted (core dumped) upon exit:

import tvm
import torch

We can workaround this by swapping the import order, but as pointed out in #9349 (comment) this may not always be possible.

Another solution is to remove the use of RTLD_GLOBAL in

tvm/python/tvm/_ffi/base.py

Line 57 in dfe4ceb

lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_GLOBAL)

See related issues in other repos that moved away from using RTLD_GLOBAL.
dmlc/dgl#2255
pytorch/pytorch#28536
pytorch/pytorch#3059

Is there any particular reason we are using RTLD_GLOBAL? @tqchen @areusch

The text was updated successfully, but these errors were encountered:

tqchen · 2021-10-25T13:16:18Z

Would be good to find out what is the symbol that get conflicted((perhaps by linking things together)) and resolve it(rename the symbol in tvm side if possible). Note that the same problem will appear in the future if we really make an attempt to link pytorch in a deeper integration. This would serve as a way to resolve the possible issue.

RTLD_GLOBAL provides some convenience to give plugin modules(that are loaded later) symbols of libtvm_runtime without explicitly linking to it, we might need to rethink the plugin mechanism(e.g. vta) a bit if we decided to move away from it.

tqchen · 2021-10-25T13:22:51Z

To followup a bit on this, we had a previous conflict with DGL which ends up to be DLPack related, and we moved away by prefix TVM to those symbols.

Turn on https://github.com/apache/tvm/blob/main/CMakeLists.txt#L46 would also help alleviate the issue, since the visible symbols will only reduce to those that are related to TVM_DLL.

I would watch carefully those C symbols, since most symbols are in tvm namespace and should be fine.

masahi · 2021-10-25T21:12:03Z

I can confirm that HIDE_PRIVATE_SYMBOLS=ON also fixes it. I think this is a good enough workaround for now cc @lhutton1 .

tqchen · 2021-10-25T22:00:45Z

@masahi can you also confirm what is the symbol?

masahi · 2021-10-26T02:54:25Z

I built libtvm.so with pytorch libs, no error occurred.

$ ldd libtvm.so 
	linux-vdso.so.1 (0x00007ffcd07dd000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ffb946a8000)
	libtorch_cpu.so => /home/masa/anaconda3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so (0x00007ffb7ec58000)
	libc10.so => /home/masa/anaconda3/lib/python3.8/site-packages/torch/lib/libc10.so (0x00007ffb7ebd2000)
	libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007ffb7e9f0000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ffb7e8a1000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007ffb7e886000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffb7e861000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffb7e66f000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ffb9644f000)
	libgomp-a34b3233.so.1 => /home/masa/anaconda3/lib/python3.8/site-packages/torch/lib/libgomp-a34b3233.so.1 (0x00007ffb7e445000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ffb7e43a000)

Looks like I need to dig deep. I agree that we should fix this problem for deeper PT + TVM integration in the future.

masahi · 2021-10-26T05:26:59Z

Hmm strange, on the environment I tried HIDE_PRIVATE_SYMBOLS=ON above, I cannot reproduce the original failure anymore. And on the other environment, HIDE_PRIVATE_SYMBOLS=ON didn't fix the problem.

lhutton1 · 2021-10-27T12:56:10Z

set(HIDE_PRIVATE_SYMBOLS ON) didn't seem to work for me either :/

tqchen · 2021-10-28T14:22:36Z

It would be great to try gdb and catch the backtrace, normally it will give some evidence of where things went wrong

lhutton1 · 2021-10-29T09:38:18Z

Here's the backtrace I receive from gdb:

(gdb) backtrace
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ffff7a22921 in __GI_abort () at abort.c:79
#2  0x00007ffff7a6b967 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7b98b0d "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#3  0x00007ffff7a729da in malloc_printerr (str=str@entry=0x7ffff7b9a720 "munmap_chunk(): invalid pointer") at malloc.c:5342
#4  0x00007ffff7a79fbc in munmap_chunk (p=0x7fffffffbc18) at malloc.c:2846
#5  __GI___libc_free (mem=0x7fffffffbc28) at malloc.c:3127
#6  0x00007fff1dcafe86 in std::__detail::_Compiler<std::regex_traits<char> >::_Compiler(char const*, char const*, std::locale const&, std::regex_constants::syntax_option_type) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so
#7  0x00007fff1debc1c0 in torch::jit::SourceImporterImpl::attributeAssignmentSpecialHandlingHack(c10::QualifiedName const&, torch::jit::Assign const&) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so
#8  0x00007fff1debed4a in torch::jit::SourceImporterImpl::importClass(c10::QualifiedName const&, torch::jit::ClassDef const&, bool) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so
#9  0x00007fff1dec0313 in torch::jit::SourceImporterImpl::importNamedType(std::string const&, torch::jit::ClassDef const&) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so
#10 0x00007fff1dec08d1 in torch::jit::SourceImporterImpl::resolveType(std::string const&, torch::jit::SourceRange const&) ()
   from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so
#11 0x00007fff1dc36668 in torch::jit::ScriptTypeParser::parseTypeFromExpr(torch::jit::Expr const&) const () from /usr/local/lib/python3.6/dist-packages/torch/lib/libtorch_cpu.so

When running:

import tvm
import torch
torch.jit.load(<path-to-any-model>)

Is this of any help?

masahi · 2021-10-29T10:40:04Z

With the trivial code,

import tvm
import torch

I get this useless backtrace

free(): invalid pointer

Thread 1 "python" received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7db7859 in __GI_abort () at abort.c:79
#2  0x00007ffff7e223ee in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7f4c285 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007ffff7e2a47c in malloc_printerr (str=str@entry=0x7ffff7f4a4ae "free(): invalid pointer") at malloc.c:5347
#4  0x00007ffff7e2bcac in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:4173
#5  0x00007fffcfe92859 in ?? () from /lib/x86_64-linux-gnu/libLLVM-10.so.1
#6  0x00007ffff7ddba27 in __run_exit_handlers (status=0, listp=0x7ffff7f7d718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
    at exit.c:108
#7  0x00007ffff7ddbbe0 in __GI_exit (status=<optimized out>) at exit.c:139
#8  0x00007ffff7db90ba in __libc_start_main (main=0x55555566d460 <main>, argc=2, argv=0x7fffffffd4d8, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffd4c8) at ../csu/libc-start.c:342
#9  0x000055555573afe5 in _start () at ../sysdeps/x86_64/elf/start.S:103

tqchen · 2021-10-30T14:52:04Z

OK, digged a bit into this. I think I know the possible cause. This is because of the conflict of LLVM symbols(due to different versions of LLVM being used). PyTorch also starts to ship with LLVM. To avoid the problem, we need to do two things

Turn on static linking of LLVM, this will directly link llvm code into libtvm without relying on dynamic library (that creates global symbols)
- set(USE_LLVM "/path/to/llvm-config --link-static")
Turn on set(HIDE_PRIVATE_SYMBOLS ON). This will effectively hide the LLVM related symbols when we load globally from pytorch.

I did a quick experiment locally and when we turn both options ON, things are good, and there will be conflict with either option off.

masahi · 2021-10-30T22:03:55Z

Thanks @tqchen, I confirmed that your solution worked on both of my envrionements too, and also both static link and HIDE_PRIVATE_SYMBOLS are required.

Also I realized that when I said "I cannot reproduce the original failure anymore" in #9362 (comment), my cmake config is pointing to a different, custom LLVM build that has only static libs. Moreover, apparently these custom libs were built in a way that HIDE_PRIVATE_SYMBOLS doesn't need to be enabled.

So no mystery on my end anymore.

I'm going to update the install doc to include this tip.

Jie-KUN · 2021-10-31T02:14:19Z

@tqchen I modified the CMakeLists.txt,

tvm_option(USE_LLVM "/usr/bin/llvm-config --link-static" ON)

tvm_option(HIDE_PRIVATE_SYMBOLS "Compile with -fvisibility=hidden." ON)

But I still found the bug "free(): invalid pointer",

tqchen · 2021-10-31T14:58:10Z

@Jie-KUN you need to set those configurations in config.cmake instead of CMakeLists.txt

Jie-KUN · 2021-11-01T03:11:09Z

@tqchen , thank you sincerely. I still have a question that I tried the code "from_pytorch.py" from the tutorial. But I always found the tips:

"One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details."

Is that normal?

masahi · 2021-11-01T03:16:18Z

Yes that's normal. Please post other questions to the discuss forum.

Jie-KUN · 2021-11-01T03:19:19Z

@masahi Ok， thank you！

tqchen · 2021-11-01T14:11:51Z

cc @leandron @areusch for awareness, let us update tlcpack config

* This is to workaround an issue caused by conflicting LLVM versions, first observed by since we updated Pytorch in TVM * Discussion at: apache/tvm#9362

leandron · 2021-11-01T15:52:13Z

* Turn on static linking of LLVM, this will directly link llvm code into libtvm without relying on dynamic library (that creates global symbols)
  * `set(USE_LLVM "/path/to/llvm-config --link-static")`
* Turn on `set(HIDE_PRIVATE_SYMBOLS ON)`. This will effectively hide the LLVM related symbols when we load globally from pytorch.

Thanks for letting us know. It seems that currently, --link-static is already there in tlcpack. I added tlc-pack/tlcpack#81 for the workaround discussed here.

* This is to workaround an issue caused by conflicting LLVM versions, first observed by since we updated Pytorch in TVM * Discussion at: apache/tvm#9362

This test was originally disabled due to the issue documented in apache#7455 affecting CI. I believe this has since been resolved by apache#9362. Note: This patch should not be merged until the changes in https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI. Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13

This test was originally disabled due to the issue documented in #7455 affecting CI. I believe this has since been resolved by #9362. Note: This patch should not be merged until the changes in https: //github.com/tlc-pack/tlcpack/pull/81 are reflected in CI. Change-Id: Ib918595a1d9149e3c858ca761861304450cbfe13