[VM][PooledAllocator] try reallocation once when OOM #8285

ganler · 2021-06-19T07:31:38Z

This change aims to make TVM behaviour more robust when an OOM occurs and resolve a mysterious exception-uncaught bug.

Potential reviewers: @junrushao1994 @icemelon9 @jroesch

YuchenJin

LGTM! Thanks @ganler for the contribution!

comaniac · 2021-06-19T19:44:41Z

src/runtime/vm/pooled_allocator.h

+      LOG(WARNING) << "PooledAllocator got InternalError during allocation: " << err.message();
+      LOG(WARNING) << "Trying to release all unused memory and reallocate...";
+      ReleaseAll();
+      buf.data = DeviceAPI::Get(device_)->AllocDataSpace(device_, size, alignment, type_hint);


What to expect if it still failed here?

If it still fails, an InternalError will be thrown, causing a TVMError regarding OOM in the Python End.

Hmm, would that be better if we let ReleaseAll return the size it released and check if it is larger than the requested size? So that we can directly throw a message including both sizes without calling alloc again.

Thanks for the suggestion. But IMHO this is not robust enough.

Say that we have 8 GB GPU memory, the PooledAllocator cached 4 GB and we want to allocate 6 GB.

Applying your idea, ReleaseAll() returns "4GB" which is less than "6GB", thus resulting in a failed allocation.

Instead, if we release unused memory and do re-allocation, "6GB" is very likely to be successfully allocated.

The big picture behind your idea is practical if we can have some APIs like "total_system_memory" and "available_system_memory", which may require introducing a series of runtime/driver libraries. e.g., cudaMemGetInfo by CudaRT (user space) or NVML (if some system privilege is allowed).

Fair enough. I don't have other comments then.

masahi · 2021-08-23T10:25:34Z

This change doesn't solve the issue in #8233, because AllocDataSpace can be called from NDArray::Empty:

tvm/src/runtime/ndarray.cc

Lines 196 to 197 in 4d9bc9b

    
           DeviceAPI::Get(ret->device) 
        
               ->AllocDataSpace(ret->device, shape.size(), shape.data(), ret->dtype, mem_scope);

That call is not protected by try/catch, so if almost all memory are held by PooledAllocator and NDArray::Empty is called, the program crashes with the following error:

terminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [19:12:54] /home/masa/projects/dev/tvm/src/runtime/vulkan/vulkan_stream.cc:123: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-13: Unknown Vulkan error code
Stack trace:
  0: tvm::runtime::vulkan::VulkanStream::Synchronize()
  1: _ZN3tvm7runtime6vulkan15VulkanDeviceAPI13FreeDataSpac
  2: tvm::runtime::NDArray::Internal::DefaultDeleter(tvm::runtime::Object*)
  3: tvm::runtime::NDArray::CopyTo(DLDevice const&) const
  4: tvm::runtime::vm::CopyTo(tvm::runtime::ObjectRef, DLDevice const&)
  5: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::vm::VirtualMachine::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::$_6>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  6: TVMFuncCall

I think we need to revisit the memory release strategy of PooledAllocator.

[VM][PooledAllocator] try reallocation once when OOM

aa6a5e6

YuchenJin reviewed Jun 19, 2021

View reviewed changes

comaniac reviewed Jun 19, 2021

View reviewed changes

comaniac approved these changes Jun 20, 2021

View reviewed changes

YuchenJin approved these changes Jun 20, 2021

View reviewed changes

jcf94 merged commit 5537788 into apache:main Jun 21, 2021

ganler deleted the pooled_alloc_frag branch June 21, 2021 12:56

ylc pushed a commit to ylc/tvm that referenced this pull request Sep 29, 2021

[VM][PooledAllocator] try reallocation once when OOM (apache#8285)

b20cd57

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

zxy844288792 pushed a commit to zxy844288792/tvm that referenced this pull request Mar 4, 2022

[VM][PooledAllocator] try reallocation once when OOM (apache#8285)

74ca50b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VM][PooledAllocator] try reallocation once when OOM #8285

[VM][PooledAllocator] try reallocation once when OOM #8285

ganler commented Jun 19, 2021

YuchenJin left a comment

comaniac Jun 19, 2021

ganler Jun 20, 2021

comaniac Jun 20, 2021

ganler Jun 20, 2021 •

edited

Loading

comaniac Jun 20, 2021

masahi commented Aug 23, 2021

[VM][PooledAllocator] try reallocation once when OOM #8285

[VM][PooledAllocator] try reallocation once when OOM #8285

Conversation

ganler commented Jun 19, 2021

YuchenJin left a comment

Choose a reason for hiding this comment

comaniac Jun 19, 2021

Choose a reason for hiding this comment

ganler Jun 20, 2021

Choose a reason for hiding this comment

comaniac Jun 20, 2021

Choose a reason for hiding this comment

ganler Jun 20, 2021 • edited Loading

Choose a reason for hiding this comment

comaniac Jun 20, 2021

Choose a reason for hiding this comment

masahi commented Aug 23, 2021

ganler Jun 20, 2021 •

edited

Loading