Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream synchronize before deallocating SAM #1655

Merged
merged 6 commits into from
Aug 26, 2024

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Aug 20, 2024

Description

While investigating cuml benchmarks, I found an issue with the current system_memory_resource that causes segfault. Roughly it's in code like this:

void foo(...) {
  rmm::device_uvector<T> tmp(bufferSize, stream);
  // launch cuda kernels making use of tmp
}

When the function returns, the device_uvector would go out of scope and get deleted, while the cuda kernel might still be in flight. With cudaFree, the CUDA runtime would perform implicit synchronization to make sure the kernel finishes before actually freeing the memory, but with SAM we don't have that guarantee, thus causing use-after-free errors.

This is a rather simple fix. In the future we may want to use CUDA events to make this less blocking.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Closes #1656

@rongou rongou requested a review from a team as a code owner August 20, 2024 22:57
@rongou rongou requested review from wence- and vyasr August 20, 2024 22:57
@rongou rongou self-assigned this Aug 20, 2024
@github-actions github-actions bot added the cpp Pertains to C++ code label Aug 20, 2024
@rongou rongou added bug Something isn't working 3 - Ready for review Ready for review by team non-breaking Non-breaking change labels Aug 20, 2024
@rongou rongou requested review from harrism and leofang August 20, 2024 22:58
@rongou
Copy link
Contributor Author

rongou commented Aug 20, 2024

This is similar to what we are trying to do in cuPy on the python side: cupy/cupy#8442

// synchronization. However, with SAM, since `free` is immediate, we need to wait for in-flight
// CUDA operations to finish before freeing the memory, to avoid potential use-after-free errors
// or race conditions.
stream.synchronize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now a synchronous deallocation, as in the cuda::mr::memory_resource concept (but not the cuda::mr::async_memory_resource concept). As we continue refactoring toward those concepts, there will be two functions: deallocate_async which takes a stream, and deallocate which does not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think it's the other way around. The async version assumes everything is stream ordered, but since we are dealing with malloc/free which doesn't understand cuda streams, we have to synchronize here. The synchronous version would leave the synchronization to the caller, so we don't need to sync here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get what you are saying. I'm talking about the fact that deallocate_async is named "async" but will have to synchronize. So that should be documented for users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the doc, is there anything else you want me to do here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting on second approval.

@harrism
Copy link
Member

harrism commented Aug 21, 2024

Reminder, we ask that every PR be associated with an issue. The PR template says this.

@rongou
Copy link
Contributor Author

rongou commented Aug 21, 2024

Reminder, we ask that every PR be associated with an issue. The PR template says this.

Done.

{
// With `cudaFree`, the CUDA runtime keeps track of dependent operations and does implicit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mention in the docstring of this function that it synchronizes stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@harrism
Copy link
Member

harrism commented Aug 22, 2024

Speaking to Vivek, he advised that we not rely on cudaHostFree being synchronised either (though it is). So we should duplicate this PR for some other host memory MRs.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, thanks.

@rongou
Copy link
Contributor Author

rongou commented Aug 23, 2024

Test failure seems to be unrelated, can't reproduce it on my machine.

@rongou rongou removed the request for review from leofang August 26, 2024 15:13
@rongou
Copy link
Contributor Author

rongou commented Aug 26, 2024

Can this be merged? Thanks!

@harrism
Copy link
Member

harrism commented Aug 26, 2024

Yeah, I reran the failed test last night. It must have been a flaky GPU?

@harrism
Copy link
Member

harrism commented Aug 26, 2024

/merge

@rapids-bot rapids-bot bot merged commit 0a8ae04 into rapidsai:branch-24.10 Aug 26, 2024
50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for review Ready for review by team bug Something isn't working cpp Pertains to C++ code non-breaking Non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[BUG] System MR causes segfault
4 participants