[core][compiled graphs] Controllably destroy CUDA events in `GPUFuture`s #51090

AndyUB · 2025-03-05T04:24:31Z

Why are these changes needed?

Currently, a GPUFuture contains a recorded CUDA event. When the GPUFuture is garbage-collected, its event is also garbage-collected, at which point cupy destroys that CUDA event.

This is problematic because the event might be destroyed after other CUDA resources. In particular, we found that the overlapping test in test_torch_tensor_dag.py consistently prints out an invalid cuda memory error that occurs during dag teardown. Our hypothesis is that: For an event, CUDA likely stores a pointer to the stream it recorded. Since the CUDA streams are destroyed before the events, the stream pointer is no longer valid when the event is destroyed.

This PR fixes the issue by caching an actor's unresolved GPUFutures in its serialization context. After the GPUFuture has been waited on, its event is manually destroyed. During teardown, the events inside all unresolved GPUFutures are destroyed before other CUDA resources.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

… teardown Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

AndyUB · 2025-03-05T04:25:28Z

CC @dengwxn

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

stephanie-wang

Thanks for this fix, great find!

python/ray/experimental/channel/serialization_context.py

python/ray/dag/dag_operation_future.py

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

hipudding · 2025-03-11T07:05:36Z

Hi @AndyUB I'm trying to remove cupy for non-cuda accelerator(#51032 ). About destory_event, Is it safe for just deleting self._event?

    def destroy_event(self) -> None:
        """
        Destroy the CUDA event associated with this future.
        """
        if self._event is None:
            return

        del self._event # it may not necessary.
        self._event = None

I read source code in torch. When there's no reference to event. torch will destory event by destructor.

Signed-off-by: hipudding <huafengchun@gmail.com>

AndyUB added 4 commits March 4, 2025 14:37

fix: Destroy CUDA event in GPUFuture before other CUDA resources upon…

c49dc9d

… teardown Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Circular import

cfa2ca3

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Free unresolved future from dag exception

b7d7168

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

fix: Add type hints

f0ab691

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

jcotant1 added the core Issues that should be addressed in Ray Core label Mar 5, 2025

chore: comments

cf4db9e

Signed-off-by: Weixin Deng <weixin@cs.washington.edu>

stephanie-wang reviewed Mar 8, 2025

View reviewed changes

python/ray/experimental/channel/serialization_context.py Outdated Show resolved Hide resolved

python/ray/experimental/channel/serialization_context.py Outdated Show resolved Hide resolved

python/ray/dag/dag_operation_future.py Outdated Show resolved Hide resolved

AndyUB added 2 commits March 8, 2025 00:13

refactor: Cache futures in GPUFuture instead of serialization context

e1e79e6

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

format: Remove unused

ac2ff55

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>

stephanie-wang approved these changes Mar 10, 2025

View reviewed changes

stephanie-wang added the go add ONLY when ready to merge, run all tests label Mar 10, 2025

stephanie-wang enabled auto-merge (squash) March 10, 2025 03:28

stephanie-wang merged commit 4503d6e into ray-project:master Mar 10, 2025
6 of 7 checks passed

stephanie-wang deleted the gpufut-fix-0304 branch March 11, 2025 00:19

hipudding added a commit to hipudding/ray that referenced this pull request Mar 11, 2025

resolve conflicts with ray-project#51090

034fd1b

Signed-off-by: hipudding <huafengchun@gmail.com>

hipudding added a commit to hipudding/ray that referenced this pull request Mar 11, 2025

resolve conflicts with ray-project#51090

5314927

Signed-off-by: hipudding <huafengchun@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][compiled graphs] Controllably destroy CUDA events in `GPUFuture`s #51090

[core][compiled graphs] Controllably destroy CUDA events in `GPUFuture`s #51090

AndyUB commented Mar 5, 2025

AndyUB commented Mar 5, 2025

stephanie-wang left a comment

hipudding commented Mar 11, 2025 •

edited

Loading

[core][compiled graphs] Controllably destroy CUDA events in GPUFutures #51090

[core][compiled graphs] Controllably destroy CUDA events in GPUFutures #51090

Conversation

AndyUB commented Mar 5, 2025

Why are these changes needed?

Checks

AndyUB commented Mar 5, 2025

stephanie-wang left a comment

Choose a reason for hiding this comment

hipudding commented Mar 11, 2025 • edited Loading

[core][compiled graphs] Controllably destroy CUDA events in `GPUFuture`s #51090

[core][compiled graphs] Controllably destroy CUDA events in `GPUFuture`s #51090

hipudding commented Mar 11, 2025 •

edited

Loading