Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VM][PooledAllocator] try reallocation once when OOM #8285

Merged
merged 1 commit into from
Jun 21, 2021

Conversation

ganler
Copy link
Contributor

@ganler ganler commented Jun 19, 2021

See TVM Discussion Topic.

This change aims to make TVM behaviour more robust when an OOM occurs and resolve a mysterious exception-uncaught bug.

Potential reviewers: @junrushao1994 @icemelon9 @jroesch

Copy link
Contributor

@YuchenJin YuchenJin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @ganler for the contribution!

LOG(WARNING) << "PooledAllocator got InternalError during allocation: " << err.message();
LOG(WARNING) << "Trying to release all unused memory and reallocate...";
ReleaseAll();
buf.data = DeviceAPI::Get(device_)->AllocDataSpace(device_, size, alignment, type_hint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What to expect if it still failed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it still fails, an InternalError will be thrown, causing a TVMError regarding OOM in the Python End.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would that be better if we let ReleaseAll return the size it released and check if it is larger than the requested size? So that we can directly throw a message including both sizes without calling alloc again.

Copy link
Contributor Author

@ganler ganler Jun 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. But IMHO this is not robust enough.

Say that we have 8 GB GPU memory, the PooledAllocator cached 4 GB and we want to allocate 6 GB.

  • Applying your idea, ReleaseAll() returns "4GB" which is less than "6GB", thus resulting in a failed allocation.
  • Instead, if we release unused memory and do re-allocation, "6GB" is very likely to be successfully allocated.

The big picture behind your idea is practical if we can have some APIs like "total_system_memory" and "available_system_memory", which may require introducing a series of runtime/driver libraries. e.g., cudaMemGetInfo by CudaRT (user space) or NVML (if some system privilege is allowed).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I don't have other comments then.

@jcf94 jcf94 merged commit 5537788 into apache:main Jun 21, 2021
@ganler ganler deleted the pooled_alloc_frag branch June 21, 2021 12:56
@masahi
Copy link
Member

masahi commented Aug 23, 2021

This change doesn't solve the issue in #8233, because AllocDataSpace can be called from NDArray::Empty:

tvm/src/runtime/ndarray.cc

Lines 196 to 197 in 4d9bc9b

DeviceAPI::Get(ret->device)
->AllocDataSpace(ret->device, shape.size(), shape.data(), ret->dtype, mem_scope);

That call is not protected by try/catch, so if almost all memory are held by PooledAllocator and NDArray::Empty is called, the program crashes with the following error:

terminate called after throwing an instance of 'tvm::runtime::InternalError'
  what():  [19:12:54] /home/masa/projects/dev/tvm/src/runtime/vulkan/vulkan_stream.cc:123: 
---------------------------------------------------------------
An error occurred during the execution of TVM.
For more information, please see: https://tvm.apache.org/docs/errors.html
---------------------------------------------------------------
  Check failed: (__e == VK_SUCCESS) is false: Vulkan Error, code=-13: Unknown Vulkan error code
Stack trace:
  0: tvm::runtime::vulkan::VulkanStream::Synchronize()
  1: _ZN3tvm7runtime6vulkan15VulkanDeviceAPI13FreeDataSpac
  2: tvm::runtime::NDArray::Internal::DefaultDeleter(tvm::runtime::Object*)
  3: tvm::runtime::NDArray::CopyTo(DLDevice const&) const
  4: tvm::runtime::vm::CopyTo(tvm::runtime::ObjectRef, DLDevice const&)
  5: std::_Function_handler<void (tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*), tvm::runtime::vm::VirtualMachine::GetFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, tvm::runtime::ObjectPtr<tvm::runtime::Object> const&)::$_6>::_M_invoke(std::_Any_data const&, tvm::runtime::TVMArgs&&, tvm::runtime::TVMRetValue*&&)
  6: TVMFuncCall

I think we need to revisit the memory release strategy of PooledAllocator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants