Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][compiled graphs] Controllably destroy CUDA events in GPUFutures #51090

Merged
merged 7 commits into from
Mar 10, 2025

Conversation

AndyUB
Copy link
Contributor

@AndyUB AndyUB commented Mar 5, 2025

Why are these changes needed?

Currently, a GPUFuture contains a recorded CUDA event. When the GPUFuture is garbage-collected, its event is also garbage-collected, at which point cupy destroys that CUDA event.

This is problematic because the event might be destroyed after other CUDA resources. In particular, we found that the overlapping test in test_torch_tensor_dag.py consistently prints out an invalid cuda memory error that occurs during dag teardown. Our hypothesis is that: For an event, CUDA likely stores a pointer to the stream it recorded. Since the CUDA streams are destroyed before the events, the stream pointer is no longer valid when the event is destroyed.

This PR fixes the issue by caching an actor's unresolved GPUFutures in its serialization context. After the GPUFuture has been waited on, its event is manually destroyed. During teardown, the events inside all unresolved GPUFutures are destroyed before other CUDA resources.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

AndyUB added 4 commits March 4, 2025 14:37
… teardown

Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
@AndyUB
Copy link
Contributor Author

AndyUB commented Mar 5, 2025

CC @dengwxn

@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Mar 5, 2025
Signed-off-by: Weixin Deng <weixin@cs.washington.edu>
Copy link
Contributor

@stephanie-wang stephanie-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this fix, great find!

AndyUB added 2 commits March 8, 2025 00:13
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
Signed-off-by: Yuhan Ruan <andyubryh@gmail.com>
@stephanie-wang stephanie-wang added the go add ONLY when ready to merge, run all tests label Mar 10, 2025
@stephanie-wang stephanie-wang enabled auto-merge (squash) March 10, 2025 03:28
@stephanie-wang stephanie-wang merged commit 4503d6e into ray-project:master Mar 10, 2025
6 of 7 checks passed
@stephanie-wang stephanie-wang deleted the gpufut-fix-0304 branch March 11, 2025 00:19
@hipudding
Copy link

hipudding commented Mar 11, 2025

Hi @AndyUB I'm trying to remove cupy for non-cuda accelerator(#51032 ). About destory_event, Is it safe for just deleting self._event?

    def destroy_event(self) -> None:
        """
        Destroy the CUDA event associated with this future.
        """
        if self._event is None:
            return

        del self._event # it may not necessary.
        self._event = None

I read source code in torch. When there's no reference to event. torch will destory event by destructor.

hipudding added a commit to hipudding/ray that referenced this pull request Mar 11, 2025
Signed-off-by: hipudding <huafengchun@gmail.com>
hipudding added a commit to hipudding/ray that referenced this pull request Mar 11, 2025
Signed-off-by: hipudding <huafengchun@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants