-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stream synchronize before deallocating SAM #1655
Conversation
This is similar to what we are trying to do in cuPy on the python side: cupy/cupy#8442 |
// synchronization. However, with SAM, since `free` is immediate, we need to wait for in-flight | ||
// CUDA operations to finish before freeing the memory, to avoid potential use-after-free errors | ||
// or race conditions. | ||
stream.synchronize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now a synchronous deallocation, as in the cuda::mr::memory_resource
concept (but not the cuda::mr::async_memory_resource
concept). As we continue refactoring toward those concepts, there will be two functions: deallocate_async
which takes a stream, and deallocate
which does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think it's the other way around. The async version assumes everything is stream ordered, but since we are dealing with malloc/free which doesn't understand cuda streams, we have to synchronize here. The synchronous version would leave the synchronization to the caller, so we don't need to sync here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get what you are saying. I'm talking about the fact that deallocate_async
is named "async" but will have to synchronize. So that should be documented for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the doc, is there anything else you want me to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting on second approval.
Reminder, we ask that every PR be associated with an issue. The PR template says this. |
Done. |
{ | ||
// With `cudaFree`, the CUDA runtime keeps track of dependent operations and does implicit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention in the docstring of this function that it synchronizes stream
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Speaking to Vivek, he advised that we not rely on cudaHostFree being synchronised either (though it is). So we should duplicate this PR for some other host memory MRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me, thanks.
Test failure seems to be unrelated, can't reproduce it on my machine. |
Can this be merged? Thanks! |
Yeah, I reran the failed test last night. It must have been a flaky GPU? |
/merge |
Description
While investigating cuml benchmarks, I found an issue with the current
system_memory_resource
that causes segfault. Roughly it's in code like this:When the function returns, the
device_uvector
would go out of scope and get deleted, while the cuda kernel might still be in flight. WithcudaFree
, the CUDA runtime would perform implicit synchronization to make sure the kernel finishes before actually freeing the memory, but with SAM we don't have that guarantee, thus causing use-after-free errors.This is a rather simple fix. In the future we may want to use CUDA events to make this less blocking.
Checklist
Closes #1656