-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for GraphRuntime destruction #5986
Conversation
Thanks @samskalicky . I agree that the destruction would be an issue here. The fix however, is a bit adhoc. The root of problem is due to the fact of using a static GraphRuntime that get destructed. The best approach might be just to ensure the destruction of the graph runtime at the time point, and not introducing graph runtime as a static object. We could try to allocate raw pointer for the device API and never destory themselves(as the resource will de-allocate in unloading and no de-allocation is needed). |
Thanks for the quick reply @tqchen! Agreed, the proposed fix is adhoc. I wanted to show a working solution to the problem as a starting point. I can try and make the GraphRuntime object not static so that it will be destructed before the DeviceAPI and see if that avoids the problem on my side. |
Lots of testing over the past month, definitely reduced the occurrence of the problem by making the runtime not static. But still seeing intermittent failures (depending on model can be more prevalent)
|
the particular error message seems is still due to the use of global states(perhaps ndarray given that the graph rt is now resolved) somewhere(perhaps in the python), |
True, im running TVM inside a custom subgraph operator in MXNet. so the subgraph operator is stateful and loads the graphruntime in its constructor. So the DeviceAPI objects will be destructed before the runtime is. |
@tqchen the CPU/GPU device API classes dont seem to store any state. Can we just make these APIs static? |
Unfortunately the device API encapsulation means we cannot simply make them static(the need of virtual methods for other device APIs). In this case I think we should update the mxnet subgraph API to avoid the static states if possible, or simply avoid de-allocating the global state(by using new instead of creating a static instance) |
Unfortunately we're starting to see this problem in other frameworks as well. Heres PyTorch:
Maybe theres a better way to prevent the destruction of the deviceAPI objects with a counter, to ensure that they arent destructed before all the arrays that were allocated with them are freed. |
close this for now as there is no further actionable item atm |
Would be useful to do some exploration, dig further and open a discuss thread about the details. For example, if we try to retain libtvm.so until pytorch unloads, would the problem go away. |
Ive been getting this issue when running tests, all pass, and then as the process starts to exit, it fails with a core dump:
It looks like theres a race condition in the shutdown sequence in TVM, and an NDArray is trying to be destructed, but the DeviceAPI object has already been destructed, so when it calls FreeDataSpace to free the NDArray memory it runs into the “pure virtual method called” error.
I added a destructor to the CUDADeviceAPI class (https://github.com/neo-ai/tvm/blob/dev/src/runtime/cuda/cuda_device_api.cc#L37) with a print statement and was able to confirm that the destructor was being called before the NDArray was destructed. This confirms the root cause, that the CUDA DeviceAPI was destructed before all the NDArrays were destructed (and their underlying memory freed).
Basically the issue is that the CUDADeviceAPI singleton class is destructed before all GPU NDArrays are freed. The quick fix is to be able to re-construct the CUDADeviceAPI singleton after being deconstructed so that it can be used to free the remaining GPU NDArrays.
The DeviceAPIManager class (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L91) is a singleton that maintains a map of DeviceAPI objects for each context (CPU, GPU, etc). The Global API (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L107) is the static singleton “get_instance” function. The GetAPI API (https://github.com/apache/incubator-tvm/blob/579da6b771584ff320b9c7edf635b681b2abd0ef/src/runtime/c_runtime_api.cc#L112) is used to get the DeviceAPI object for a particular context type that is looked up in the api_ map.
Upon destruction, if we clear the api_ array to nullptr:
https://github.com/apache/incubator-tvm/blob/0dfadaee66de156c1cda90a3d9f160764e5538d9/src/runtime/c_runtime_api.cc#L107
each DeviceAPI object will be reconstructed. Upon reconstruction of the singleton CUDADeviceAPI class, we need to reset the static shared_ptr too:
https://github.com/apache/incubator-tvm/blob/0dfadaee66de156c1cda90a3d9f160764e5538d9/src/runtime/cuda/cuda_device_api.cc#L210-L215