-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Bug] Fix Negative Cuda Memory Usage #25683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug where CUDA graph memory usage was reported as a negative value. The root cause was asynchronous CUDA operations leading to inaccurate memory measurements. The fix correctly moves the memory measurement logic inside the graph_capture context and, crucially, adds a torch.cuda.synchronize() call before measuring the final memory usage. This ensures that all graph capture operations are complete before the memory is queried, providing an accurate measurement. The changes are logical, well-targeted, and should resolve the issue. The implementation is correct.
Signed-off-by: yewentao256 <zhyanwentao@126.com>
c3c2406 to
d771de1
Compare
|
@smarterclayton Hi Clayton, wondering if you could validate this fix if you have time? So that we can get this landed. |
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Purpose
This PR updates the logic for memory usage capture, hopefully fix the issue.
Test
@smarterclayton Could you try again using this branch?