-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[Bugfix] Add strong reference to CUDA pluggable allocator callbacks #23477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a critical memory corruption bug caused by Python's garbage collector freeing CUDA callback methods that were still in use by a C++ extension. The fix introduces strong references to these callbacks to ensure they are not prematurely collected.
The approach taken is correct, but it could be made more robust. I've added a comment suggesting a refactoring that would prevent accidental reintroduction of the bug in the future by making the API safer to use. This change would make the fix more resilient to future code modifications.
vllm/device_allocator/cumem.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this correctly creates strong references to the callbacks, it introduces new public attributes (..._ref) and leaves the original methods (python_malloc_callback, python_free_callback) accessible. If a developer accidentally uses the original methods, they will get a new temporary bound method object, reintroducing the memory corruption bug. This makes the fix fragile.
A more robust approach would be to rename the callback methods with a leading underscore (e.g., _python_malloc_callback) to indicate they are for internal use, and then create public attributes in __init__ that hold the bound methods. This way, any access to self.python_malloc_callback will return the same strongly-referenced object, making the API safer and preventing accidental misuse.
Here's how you could refactor it:
- Rename the methods in
vllm/device_allocator/cumem.py:def _python_malloc_callback(self, allocation_handle: HandleType) -> None: ... def _python_free_callback(self, ptr: int) -> HandleType: ...
- Update
__init__to create the public, persistent bound methods:# In __init__ self.python_malloc_callback = self._python_malloc_callback self.python_free_callback = self._python_free_callback
- Update the call site in
use_memory_poolto use the public attributes (which no longer need the_refsuffix):# In use_memory_pool with use_memory_pool_with_allocator( self.python_malloc_callback, self.python_free_callback) as data: ...
This refactoring would make the fix more resilient to future changes.
| self.pointer_to_data: dict[int, AllocationData] = {} | ||
| self.current_tag: str = CuMemAllocator.default_tag | ||
| self.allocator_and_pools: dict[str, Any] = {} | ||
| self.python_malloc_callback = self._python_malloc_callback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comment to explain why we are doing this? self._python_malloc_callback is a temporary object everytime we access it.
Signed-off-by: youkaichao <youkaichao@gmail.com>
…llm-project#23477) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Eric Marcus <eric.marcus@kaiko.ai> Co-authored-by: youkaichao <youkaichao@gmail.com> Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
…llm-project#23477) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Eric Marcus <eric.marcus@kaiko.ai> Co-authored-by: youkaichao <youkaichao@gmail.com>
…llm-project#23477) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Eric Marcus <eric.marcus@kaiko.ai> Co-authored-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
…llm-project#23477) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Eric Marcus <eric.marcus@kaiko.ai> Co-authored-by: youkaichao <youkaichao@gmail.com>
…llm-project#23477) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Eric Marcus <eric.marcus@kaiko.ai> Co-authored-by: youkaichao <youkaichao@gmail.com>
…llm-project#23477) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Eric Marcus <eric.marcus@kaiko.ai> Co-authored-by: youkaichao <youkaichao@gmail.com>
…llm-project#23477) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Eric Marcus <eric.marcus@kaiko.ai> Co-authored-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
…llm-project#23477) Signed-off-by: 22quinn <33176974+22quinn@users.noreply.github.com> Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: Eric Marcus <eric.marcus@kaiko.ai> Co-authored-by: youkaichao <youkaichao@gmail.com>
|
I still have this error:[rank1]:[W1110 21:40:15.983774416 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) terminate called after throwing an instance of 'c10::Error' |
Purpose
This is the Python-only fix version of #22724. See context and discussions there.
Thanks @EricMarcus-ai for the awesome work!
Related:
vLLM: #20431
vLLM: #16993
OpenRLHF: OpenRLHF/OpenRLHF#1052
ModelScope: modelscope/ms-swift#4353
Test Plan
Below is a minor modification to the initial test added in #22724. Because this is a stress test, will not add to CI for cost/perf reason.
Test Result
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.