Fixes for GraphRuntime destruction #5986

samskalicky · 2020-07-03T00:09:02Z

Ive been getting this issue when running tests, all pass, and then as the process starts to exit, it fails with a core dump:

pure virtual method called
terminate called without an active exception
Aborted (core dumped)

#5  0x00007ffff11d9988 in __cxxabiv1::__cxa_pure_virtual ()
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/pure.cc:50
#6  0x00007fff45589a82 in tvm::runtime::NDArray::Internal::DefaultDeleter (ptr_obj=0x55555754ece0)
    at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/ndarray.cc:97
#7  0x00007fff4557d439 in tvm::runtime::Object::DecRef (this=0x55555754ece0)
    at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:833
#8  0x00007fff455b2815 in tvm::runtime::ObjectPtr<tvm::runtime::Object>::reset (this=0x5555571c8c00)
    at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:439
#9  0x00007fff45598698 in tvm::runtime::ObjectPtr<tvm::runtime::Object>::~ObjectPtr (this=0x5555571c8c00, 
    __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:388
#10 0x00007fff4557d4aa in tvm::runtime::ObjectRef::~ObjectRef (this=0x5555571c8c00, __in_chrg=<optimized out>)
    at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:511
#11 0x00007fff4557df1e in tvm::runtime::NDArray::~NDArray (this=0x5555571c8c00, __in_chrg=<optimized out>)
    at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/ndarray.h:42
#12 0x00007fff455fafb3 in std::_Destroy<tvm::runtime::NDArray> (__pointer=0x5555571c8c00)
    at /usr/include/c++/5/bits/stl_construct.h:93
#13 0x00007fff455edee1 in std::_Destroy_aux<false>::__destroy<tvm::runtime::NDArray*> (__first=0x5555571c8c00, 
    __last=0x5555571c8c10) at /usr/include/c++/5/bits/stl_construct.h:103
#14 0x00007fff455dfa22 in std::_Destroy<tvm::runtime::NDArray*> (__first=0x5555571c8c00, __last=0x5555571c8c10)
    at /usr/include/c++/5/bits/stl_construct.h:126
#15 0x00007fff455cd124 in std::_Destroy<tvm::runtime::NDArray*, tvm::runtime::NDArray> (__first=0x5555571c8c00, 
    __last=0x5555571c8c10) at /usr/include/c++/5/bits/stl_construct.h:151
#16 0x00007fff455e0d81 in std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >::~vector (
    this=0x55555752d2e8, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/stl_vector.h:424
#17 0x00007fff455e0ec8 in tvm::runtime::GraphRuntime::~GraphRuntime (this=0x55555752d130, 
    __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/graph/graph_runtime.h:73
#18 0x00007fff455e0fb8 in tvm::runtime::GraphRuntime::~GraphRuntime (this=0x55555752d130, 
    __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/graph/graph_runtime.h:73

It looks like theres a race condition in the shutdown sequence in TVM, and an NDArray is trying to be destructed, but the DeviceAPI object has already been destructed, so when it calls FreeDataSpace to free the NDArray memory it runs into the “pure virtual method called” error.

I added a destructor to the CUDADeviceAPI class (https://github.com/neo-ai/tvm/blob/dev/src/runtime/cuda/cuda_device_api.cc#L37) with a print statement and was able to confirm that the destructor was being called before the NDArray was destructed. This confirms the root cause, that the CUDA DeviceAPI was destructed before all the NDArrays were destructed (and their underlying memory freed).

Basically the issue is that the CUDADeviceAPI singleton class is destructed before all GPU NDArrays are freed. The quick fix is to be able to re-construct the CUDADeviceAPI singleton after being deconstructed so that it can be used to free the remaining GPU NDArrays.

The DeviceAPIManager class (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L91) is a singleton that maintains a map of DeviceAPI objects for each context (CPU, GPU, etc). The Global API (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L107) is the static singleton “get_instance” function. The GetAPI API (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L112) is used to get the DeviceAPI object for a particular context type that is looked up in the api_ map.

Upon destruction, if we clear the api_ array to nullptr:
https://github.com/apache/incubator-tvm/blob/0dfadaee66de156c1cda90a3d9f160764e5538d9/src/runtime/c_runtime_api.cc#L107

each DeviceAPI object will be reconstructed. Upon reconstruction of the singleton CUDADeviceAPI class, we need to reset the static shared_ptr too:
https://github.com/apache/incubator-tvm/blob/0dfadaee66de156c1cda90a3d9f160764e5538d9/src/runtime/cuda/cuda_device_api.cc#L210-L215

samskalicky · 2020-07-03T00:23:06Z

@tqchen for review, @zhiics @trevor-m FYI

tqchen · 2020-07-03T01:46:17Z

Thanks @samskalicky .

I agree that the destruction would be an issue here. The fix however, is a bit adhoc. The root of problem is due to the fact of using a static GraphRuntime that get destructed.

The best approach might be just to ensure the destruction of the graph runtime at the time point, and not introducing graph runtime as a static object.

We could try to allocate raw pointer for the device API and never destory themselves(as the resource will de-allocate in unloading and no de-allocation is needed).

samskalicky · 2020-07-06T17:58:40Z

Thanks for the quick reply @tqchen!

Agreed, the proposed fix is adhoc. I wanted to show a working solution to the problem as a starting point.

I can try and make the GraphRuntime object not static so that it will be destructed before the DeviceAPI and see if that avoids the problem on my side.

samskalicky · 2020-08-14T16:34:49Z

Lots of testing over the past month, definitely reduced the occurrence of the problem by making the runtime not static. But still seeing intermittent failures (depending on model can be more prevalent)

Segmentation fault: 11

*** Error in `python': double free or corruption (!prev): 0x000055becd8c4460 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777f5)[0x7fd5a64827f5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8038a)[0x7fd5a648b38a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fd5a648f58c]
/lib/x86_64-linux-gnu/libc.so.6(+0x3a035)[0x7fd5a6445035]
/lib/x86_64-linux-gnu/libc.so.6(+0x3a055)[0x7fd5a6445055]
/home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x7fc3125)[0x7fd5498ea125]
/lib/x86_64-linux-gnu/libc.so.6(+0x354c0)[0x7fd5a64404c0]
/usr/local/cuda/lib64/libcudart.so.10.0(+0x1d9fe)[0x7fd4fc1909fe]
/usr/local/cuda/lib64/libcudart.so.10.0(+0x2296b)[0x7fd4fc19596b]
/usr/local/cuda/lib64/libcudart.so.10.0(cudaSetDevice+0x47)[0x7fd4fc1bd087]
/home/ubuntu/anaconda3/lib/python3.7/site-packages/neomxnet/libdlr.so(_ZN3tvm7runtime13CUDADeviceAPI13FreeDataSpaceE9DLContextPv+0x3a)[0x7fd4eda8652a]

tqchen · 2020-08-14T17:13:35Z

the particular error message seems is still due to the use of global states(perhaps ndarray given that the graph rt is now resolved) somewhere(perhaps in the python),

samskalicky · 2020-08-14T17:27:37Z

the particular error message seems is still due to the use of global states(perhaps ndarray given that the graph rt is now resolved) somewhere(perhaps in the python),

True, im running TVM inside a custom subgraph operator in MXNet. so the subgraph operator is stateful and loads the graphruntime in its constructor. So the DeviceAPI objects will be destructed before the runtime is.

samskalicky · 2020-08-17T21:24:18Z

@tqchen the CPU/GPU device API classes dont seem to store any state. Can we just make these APIs static?

tqchen · 2020-08-17T21:27:10Z

Unfortunately the device API encapsulation means we cannot simply make them static(the need of virtual methods for other device APIs). In this case I think we should update the mxnet subgraph API to avoid the static states if possible, or simply avoid de-allocating the global state(by using new instead of creating a static instance)

samskalicky · 2020-08-17T22:02:31Z

Unfortunately we're starting to see this problem in other frameworks as well. Heres PyTorch:

#0  0x00007fff56b0ee60 in tvm::runtime::NDArray::Internal::DefaultDeleter(tvm::runtime::Object*) () from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
#1  0x00007fff56983f6b in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::NDArray>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::NDArray> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() () from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
#2  0x00007fff56afa604 in tvm::runtime::SimpleObjAllocator::Handler<tvm::runtime::MetadataModuleNode>::Deleter_(tvm::runtime::Object*) ()
   from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
#3  0x00007fff56b78d4e in tvm::runtime::GraphRuntime::~GraphRuntime() () from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
#4  0x00007fff56b79379 in tvm::runtime::SimpleObjAllocator::Handler<tvm::runtime::GraphRuntime>::Deleter_(tvm::runtime::Object*) ()
   from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
#5  0x00007fff8c1fbb43 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::Module>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::Module> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() ()
   from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtorch.so
#6  0x00007fff8c1fbb5d in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::Module>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::Module> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable()
    () from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtorch.so

Maybe theres a better way to prevent the destruction of the deviceAPI objects with a counter, to ensure that they arent destructed before all the arrays that were allocated with them are freed.

tqchen · 2020-08-19T14:00:46Z

close this for now as there is no further actionable item atm

tqchen · 2020-08-19T14:02:22Z

Would be useful to do some exploration, dig further and open a discuss thread about the details. For example, if we try to retain libtvm.so until pytorch unloads, would the problem go away.

initial commit

0dfadae

samskalicky mentioned this pull request Jul 3, 2020

[TensorRT] added culibos neo-ai/tvm#101

Merged

tqchen added the status: need update need update based on feedbacks label Jul 6, 2020

samskalicky mentioned this pull request Aug 18, 2020

[RUNTIME][REFACTOR] Use new to avoid exit-time de-allocation order #6292

Merged

tqchen closed this Aug 19, 2020

samskalicky mentioned this pull request Mar 4, 2021

Enable TVM Runtime's TVM_C_API_MEM_MGMT flag to fix mem leak neo-ai/neo-ai-dlr#330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes for GraphRuntime destruction #5986

Fixes for GraphRuntime destruction #5986

samskalicky commented Jul 3, 2020 •

edited

Loading

samskalicky commented Jul 3, 2020

tqchen commented Jul 3, 2020

samskalicky commented Jul 6, 2020

samskalicky commented Aug 14, 2020

tqchen commented Aug 14, 2020

samskalicky commented Aug 14, 2020 •

edited

Loading

samskalicky commented Aug 17, 2020

tqchen commented Aug 17, 2020 •

edited

Loading

samskalicky commented Aug 17, 2020

tqchen commented Aug 19, 2020

tqchen commented Aug 19, 2020

Fixes for GraphRuntime destruction #5986

Fixes for GraphRuntime destruction #5986

Conversation

samskalicky commented Jul 3, 2020 • edited Loading

samskalicky commented Jul 3, 2020

tqchen commented Jul 3, 2020

samskalicky commented Jul 6, 2020

samskalicky commented Aug 14, 2020

tqchen commented Aug 14, 2020

samskalicky commented Aug 14, 2020 • edited Loading

samskalicky commented Aug 17, 2020

tqchen commented Aug 17, 2020 • edited Loading

samskalicky commented Aug 17, 2020

tqchen commented Aug 19, 2020

tqchen commented Aug 19, 2020

samskalicky commented Jul 3, 2020 •

edited

Loading

samskalicky commented Aug 14, 2020 •

edited

Loading

tqchen commented Aug 17, 2020 •

edited

Loading