Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for GraphRuntime destruction #5986

Closed
wants to merge 1 commit into from

Conversation

samskalicky
Copy link
Contributor

@samskalicky samskalicky commented Jul 3, 2020

Ive been getting this issue when running tests, all pass, and then as the process starts to exit, it fails with a core dump:

pure virtual method called
terminate called without an active exception
Aborted (core dumped)

#5  0x00007ffff11d9988 in __cxxabiv1::__cxa_pure_virtual ()
    at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/pure.cc:50
#6  0x00007fff45589a82 in tvm::runtime::NDArray::Internal::DefaultDeleter (ptr_obj=0x55555754ece0)
    at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/ndarray.cc:97
#7  0x00007fff4557d439 in tvm::runtime::Object::DecRef (this=0x55555754ece0)
    at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:833
#8  0x00007fff455b2815 in tvm::runtime::ObjectPtr<tvm::runtime::Object>::reset (this=0x5555571c8c00)
    at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:439
#9  0x00007fff45598698 in tvm::runtime::ObjectPtr<tvm::runtime::Object>::~ObjectPtr (this=0x5555571c8c00, 
    __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:388
#10 0x00007fff4557d4aa in tvm::runtime::ObjectRef::~ObjectRef (this=0x5555571c8c00, __in_chrg=<optimized out>)
    at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/object.h:511
#11 0x00007fff4557df1e in tvm::runtime::NDArray::~NDArray (this=0x5555571c8c00, __in_chrg=<optimized out>)
    at /home/ubuntu/NeoMXNet/3rdparty/tvm/include/tvm/runtime/ndarray.h:42
#12 0x00007fff455fafb3 in std::_Destroy<tvm::runtime::NDArray> (__pointer=0x5555571c8c00)
    at /usr/include/c++/5/bits/stl_construct.h:93
#13 0x00007fff455edee1 in std::_Destroy_aux<false>::__destroy<tvm::runtime::NDArray*> (__first=0x5555571c8c00, 
    __last=0x5555571c8c10) at /usr/include/c++/5/bits/stl_construct.h:103
#14 0x00007fff455dfa22 in std::_Destroy<tvm::runtime::NDArray*> (__first=0x5555571c8c00, __last=0x5555571c8c10)
    at /usr/include/c++/5/bits/stl_construct.h:126
#15 0x00007fff455cd124 in std::_Destroy<tvm::runtime::NDArray*, tvm::runtime::NDArray> (__first=0x5555571c8c00, 
    __last=0x5555571c8c10) at /usr/include/c++/5/bits/stl_construct.h:151
#16 0x00007fff455e0d81 in std::vector<tvm::runtime::NDArray, std::allocator<tvm::runtime::NDArray> >::~vector (
    this=0x55555752d2e8, __in_chrg=<optimized out>) at /usr/include/c++/5/bits/stl_vector.h:424
#17 0x00007fff455e0ec8 in tvm::runtime::GraphRuntime::~GraphRuntime (this=0x55555752d130, 
    __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/graph/graph_runtime.h:73
#18 0x00007fff455e0fb8 in tvm::runtime::GraphRuntime::~GraphRuntime (this=0x55555752d130, 
    __in_chrg=<optimized out>) at /home/ubuntu/NeoMXNet/3rdparty/tvm/src/runtime/graph/graph_runtime.h:73

It looks like theres a race condition in the shutdown sequence in TVM, and an NDArray is trying to be destructed, but the DeviceAPI object has already been destructed, so when it calls FreeDataSpace to free the NDArray memory it runs into the “pure virtual method called” error.

I added a destructor to the CUDADeviceAPI class (https://github.com/neo-ai/tvm/blob/dev/src/runtime/cuda/cuda_device_api.cc#L37) with a print statement and was able to confirm that the destructor was being called before the NDArray was destructed. This confirms the root cause, that the CUDA DeviceAPI was destructed before all the NDArrays were destructed (and their underlying memory freed).

Basically the issue is that the CUDADeviceAPI singleton class is destructed before all GPU NDArrays are freed. The quick fix is to be able to re-construct the CUDADeviceAPI singleton after being deconstructed so that it can be used to free the remaining GPU NDArrays.

The DeviceAPIManager class (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L91) is a singleton that maintains a map of DeviceAPI objects for each context (CPU, GPU, etc). The Global API (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L107) is the static singleton “get_instance” function. The GetAPI API (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L112) is used to get the DeviceAPI object for a particular context type that is looked up in the api_ map.

Upon destruction, if we clear the api_ array to nullptr:
https://github.com/apache/incubator-tvm/blob/0dfadaee66de156c1cda90a3d9f160764e5538d9/src/runtime/c_runtime_api.cc#L107

each DeviceAPI object will be reconstructed. Upon reconstruction of the singleton CUDADeviceAPI class, we need to reset the static shared_ptr too:
https://github.com/apache/incubator-tvm/blob/0dfadaee66de156c1cda90a3d9f160764e5538d9/src/runtime/cuda/cuda_device_api.cc#L210-L215

@samskalicky
Copy link
Contributor Author

@tqchen for review, @zhiics @trevor-m FYI

@tqchen
Copy link
Member

tqchen commented Jul 3, 2020

Thanks @samskalicky .

I agree that the destruction would be an issue here. The fix however, is a bit adhoc. The root of problem is due to the fact of using a static GraphRuntime that get destructed.

The best approach might be just to ensure the destruction of the graph runtime at the time point, and not introducing graph runtime as a static object.

We could try to allocate raw pointer for the device API and never destory themselves(as the resource will de-allocate in unloading and no de-allocation is needed).

@tqchen tqchen added the status: need update need update based on feedbacks label Jul 6, 2020
@samskalicky
Copy link
Contributor Author

Thanks for the quick reply @tqchen!

Agreed, the proposed fix is adhoc. I wanted to show a working solution to the problem as a starting point.

I can try and make the GraphRuntime object not static so that it will be destructed before the DeviceAPI and see if that avoids the problem on my side.

@samskalicky
Copy link
Contributor Author

Lots of testing over the past month, definitely reduced the occurrence of the problem by making the runtime not static. But still seeing intermittent failures (depending on model can be more prevalent)

Segmentation fault: 11

*** Error in `python': double free or corruption (!prev): 0x000055becd8c4460 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777f5)[0x7fd5a64827f5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8038a)[0x7fd5a648b38a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fd5a648f58c]
/lib/x86_64-linux-gnu/libc.so.6(+0x3a035)[0x7fd5a6445035]
/lib/x86_64-linux-gnu/libc.so.6(+0x3a055)[0x7fd5a6445055]
/home/ubuntu/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x7fc3125)[0x7fd5498ea125]
/lib/x86_64-linux-gnu/libc.so.6(+0x354c0)[0x7fd5a64404c0]
/usr/local/cuda/lib64/libcudart.so.10.0(+0x1d9fe)[0x7fd4fc1909fe]
/usr/local/cuda/lib64/libcudart.so.10.0(+0x2296b)[0x7fd4fc19596b]
/usr/local/cuda/lib64/libcudart.so.10.0(cudaSetDevice+0x47)[0x7fd4fc1bd087]
/home/ubuntu/anaconda3/lib/python3.7/site-packages/neomxnet/libdlr.so(_ZN3tvm7runtime13CUDADeviceAPI13FreeDataSpaceE9DLContextPv+0x3a)[0x7fd4eda8652a]

@tqchen
Copy link
Member

tqchen commented Aug 14, 2020

the particular error message seems is still due to the use of global states(perhaps ndarray given that the graph rt is now resolved) somewhere(perhaps in the python),

@samskalicky
Copy link
Contributor Author

samskalicky commented Aug 14, 2020

the particular error message seems is still due to the use of global states(perhaps ndarray given that the graph rt is now resolved) somewhere(perhaps in the python),

True, im running TVM inside a custom subgraph operator in MXNet. so the subgraph operator is stateful and loads the graphruntime in its constructor. So the DeviceAPI objects will be destructed before the runtime is.

@samskalicky
Copy link
Contributor Author

@tqchen the CPU/GPU device API classes dont seem to store any state. Can we just make these APIs static?

@tqchen
Copy link
Member

tqchen commented Aug 17, 2020

Unfortunately the device API encapsulation means we cannot simply make them static(the need of virtual methods for other device APIs). In this case I think we should update the mxnet subgraph API to avoid the static states if possible, or simply avoid de-allocating the global state(by using new instead of creating a static instance)

@samskalicky
Copy link
Contributor Author

Unfortunately we're starting to see this problem in other frameworks as well. Heres PyTorch:

#0  0x00007fff56b0ee60 in tvm::runtime::NDArray::Internal::DefaultDeleter(tvm::runtime::Object*) () from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
#1  0x00007fff56983f6b in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::NDArray>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::NDArray> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() () from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
#2  0x00007fff56afa604 in tvm::runtime::SimpleObjAllocator::Handler<tvm::runtime::MetadataModuleNode>::Deleter_(tvm::runtime::Object*) ()
   from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
#3  0x00007fff56b78d4e in tvm::runtime::GraphRuntime::~GraphRuntime() () from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
#4  0x00007fff56b79379 in tvm::runtime::SimpleObjAllocator::Handler<tvm::runtime::GraphRuntime>::Deleter_(tvm::runtime::Object*) ()
   from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtvm.so
#5  0x00007fff8c1fbb43 in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::Module>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::Module> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::clear() ()
   from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtorch.so
#6  0x00007fff8c1fbb5d in std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::Module>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, tvm::runtime::Module> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable()
    () from /home/ubuntu/anaconda3/envs/python3/lib/python3.6/site-packages/torch/lib/libtorch.so

Maybe theres a better way to prevent the destruction of the deviceAPI objects with a counter, to ensure that they arent destructed before all the arrays that were allocated with them are freed.

@tqchen
Copy link
Member

tqchen commented Aug 19, 2020

close this for now as there is no further actionable item atm

@tqchen tqchen closed this Aug 19, 2020
@tqchen
Copy link
Member

tqchen commented Aug 19, 2020

Would be useful to do some exploration, dig further and open a discuss thread about the details. For example, if we try to retain libtvm.so until pytorch unloads, would the problem go away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: need update need update based on feedbacks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants