-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][compiled graphs] Controllably destroy CUDA events in GPUFuture
s
#51090
Conversation
… teardown Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
CC @dengwxn |
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix, great find!
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Hi @AndyUB I'm trying to remove cupy for non-cuda accelerator(#51032 ). About destory_event, Is it safe for just deleting self._event? def destroy_event(self) -> None:
"""
Destroy the CUDA event associated with this future.
"""
if self._event is None:
return
del self._event # it may not necessary.
self._event = None I read source code in torch. When there's no reference to event. torch will destory event by destructor. |
Signed-off-by: hipudding <huafengchun@gmail.com>
Signed-off-by: hipudding <huafengchun@gmail.com>
Why are these changes needed?
Currently, a
GPUFuture
contains a recorded CUDA event. When theGPUFuture
is garbage-collected, its event is also garbage-collected, at which pointcupy
destroys that CUDA event.This is problematic because the event might be destroyed after other CUDA resources. In particular, we found that the overlapping test in
test_torch_tensor_dag.py
consistently prints out an invalid cuda memory error that occurs during dag teardown. Our hypothesis is that: For an event, CUDA likely stores a pointer to the stream it recorded. Since the CUDA streams are destroyed before the events, the stream pointer is no longer valid when the event is destroyed.This PR fixes the issue by caching an actor's unresolved
GPUFuture
s in its serialization context. After theGPUFuture
has been waited on, its event is manually destroyed. During teardown, the events inside all unresolvedGPUFuture
s are destroyed before other CUDA resources.Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.