-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[BugFix] Temporary fix for IMA with MTP = 2 and full-cg #28315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] Temporary fix for IMA with MTP = 2 and full-cg #28315
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a temporary fix for an issue with multi-token prediction and full CUDA graphs by adjusting CUDA graph capture sizes. The core logic change is in a new method, adjust_cudagraph_sizes_to_be_multipe_of, which unfortunately contains a critical bug that can lead to runtime errors and incorrect behavior. I've provided a detailed review comment with a suggested fix for this issue.
vllm/config/vllm.py
Outdated
| def adjust_cudagraph_sizes_to_be_multipe_of(self, multiple_of: int): | ||
| new_sizes = sorted( | ||
| [ | ||
| round_up(size, multiple_of) | ||
| for size in self.compilation_config.cudagraph_capture_sizes | ||
| ] | ||
| ) | ||
| if new_sizes[-1] > self.compilation_config.max_cudagraph_capture_size: | ||
| new_sizes = new_sizes[:-1] | ||
| self.compilation_config.max_cudagraph_capture_size = new_sizes[-1] | ||
| self.compilation_config.cudagraph_capture_sizes = new_sizes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of adjust_cudagraph_sizes_to_be_multipe_of has several critical issues that can lead to incorrect behavior or runtime errors:
-
Potential
IndexError: If allcudagraph_capture_sizes, when rounded up, exceedmax_cudagraph_capture_size, thenew_sizeslist can become empty after theifcondition, leading to anIndexErroronnew_sizes[-1]. For example, ifcudagraph_capture_sizesis[16],max_cudagraph_capture_sizeis16, andmultiple_ofis20,new_sizesbecomes[20]. Theifcondition is met, andnew_sizesis modified to[], causing a crash on the next line. -
Incorrect Filtering: The logic
if new_sizes[-1] > ...: new_sizes = new_sizes[:-1]only checks and removes the largest element. If multiple rounded-up sizes exceedmax_cudagraph_capture_size, the smaller ones will incorrectly remain in the list. -
Incorrect
max_cudagraph_capture_sizeupdate: Themax_cudagraph_capture_sizecan be updated to a value larger than its original value, which seems to contradict its purpose as a hard limit derived from scheduler and token configurations.
I suggest a more robust implementation that correctly filters the sizes and handles edge cases gracefully.
Additionally, there is a typo in the method name (multipe_of should be multiple_of). I've kept it in the suggestion to match the current code, but it should be corrected here and at the call site.
def adjust_cudagraph_sizes_to_be_multipe_of(self, multiple_of: int):
max_size = self.compilation_config.max_cudagraph_capture_size
# Use a set to handle duplicates from rounding up
rounded_sizes = {
round_up(size, multiple_of)
for size in self.compilation_config.cudagraph_capture_sizes
}
new_sizes = sorted([s for s in rounded_sizes if s <= max_size])
if not new_sizes:
# All rounded-up sizes exceeded the max size.
# Disable cudagraphs by setting sizes to empty.
self.compilation_config.max_cudagraph_capture_size = 0
self.compilation_config.cudagraph_capture_sizes = []
return
self.compilation_config.max_cudagraph_capture_size = new_sizes[-1]
self.compilation_config.cudagraph_capture_sizes = new_sizesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the computation of bs_to_padded_graph_size and introduces logic to adjust CUDA graph capture sizes. While the intent is to fix an issue with speculative decoding, the changes introduce two critical bugs. First, the refactoring of bs_to_padded_graph_size computation breaks the model initialization order, as it's now computed after profile_run which depends on it. Second, the new method to adjust capture sizes contains a typo and is vulnerable to an IndexError if it results in an empty list of sizes. I have provided detailed comments and suggestions to fix these critical issues.
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
Temp fix for vllm-project#28207 Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
5c36137 to
529078e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…/neuralmagic/vllm into lwilkinson/tmp-full-cg-mtp-2-fix
ProExpertProg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be compatible with sequence parallelism but that's for high-throughput cases anyway, just might be worth adding a warning.
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Done 👍 |
Temporary fix for #28207
For now just make sure when spec-decode is enabled that the cudagraph shapes are evenly divisible by
1 + num_speculative_tokens; see #28207 for more detailsTest 1
Tested with
Hits an IMA on main but not on this branch
Test 2
Correctly fails with