-
-
Couldn't load subscription status.
- Fork 10.8k
Add option to disable weakref conversion for last piecewise cudagraph in a module #22282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
34eb180 to
db57b76
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a critical bug in the piecewise CUDA graph compilation where intermediate outputs could be prematurely deallocated when multiple submodules are compiled. The introduction of a global last-graph check, distinguished from a local one, is a solid approach to fix this. The changes are well-contained and the added test case correctly verifies the fix. I have one high-severity comment regarding a protocol definition mismatch that should be addressed.
50bd0b4 to
8f19d19
Compare
52123ce to
7d62e0a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor question otherwise LGTM
195916a to
0ea4adb
Compare
|
@ProExpertProg the PR is now ready for review (failing CI test is unrelated, due to gateway error from HF hub) |
da4cb11 to
8532b62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thank you for providing extensive comments. One note about what happens with the flag when we redecorate.
|
This pull request has merge conflicts that must be resolved before it can be |
8532b62 to
4c7b9b8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand the tests
vllm/compilation/counter.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is technically tracking whole submodels, not all piecewise graphs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For each module (either entire model or submodels of a model), it may or may not return a weakref for the last piecewise graph, so I think number of piecewise graphs w/ weakref output is accurate.
Do you think num_weakref_cudagraph_captured would be clearer since there is compilation_counter.num_cudagraph_captured a few lines below?
vllm/vllm/compilation/cuda_graph.py
Line 175 in 006477e
| compilation_counter.num_cudagraph_captured += 1 |
|
@ProExpertProg updated PR to address comments. also cc @youkaichao if you have any thoughts, i think you initially added this weakref logic |
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
Signed-off-by: Yong Hoon Shin <yhshin@meta.com>
7f465a5 to
0396bff
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
#21044 shows an example where multiple submodules in a model are compiled instead of the top-level model being compiled.
This has a subtle bug currently. For example, if we have 2 submodules compiled in a model: module A followed by module B, then call to module B will overwrite the outputs returned by module A.
This is because:
self.is_last_graph=Truefor the last graph in its local compiled submodulesself.is_last_graph=Truehere as it assumes that for the last graph, its output will not be used another other cuda graph. So it converts final output of module A to a weakref.This PR adds a new argument to the
@support_torch_compiledecorator,no_weak_ref_output, which can be set toTrueto disable the weakref conversion for non-last submodules in a model.Test Plan
Correctness test:
Test Result
Before this PR, the test fails due to:
allclosewith eagerAfter this PR, test passes.