-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[MODEL] Fix handling of multiple channels for gpt-oss with speculative decoding #26291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[MODEL] Fix handling of multiple channels for gpt-oss with speculative decoding #26291
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a bug in handling multiple channels for gpt-oss with speculative decoding, particularly in streaming mode. The changes introduce a more robust mechanism by tracking the state for each token, grouping them by channel and recipient, and then constructing the delta messages. This ensures that channel switches within a single decoding step are handled correctly. The related logging improvements are also a good addition, providing more comprehensive output. I have one suggestion to improve code readability and reduce the risk of future bugs by removing a magic number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
a953d9c to
ede4584
Compare
704867c to
3c1bf55
Compare
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
3c1bf55 to
e1f14dd
Compare
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
e4f6360 to
3ad1d7b
Compare
…-multiple-channels
|
@codex review |
|
Codex Review: Didn't find any major issues. You're on a roll. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a much-needed fix for handling multiple channels in gpt-oss streaming with speculative decoding. The previous implementation had a flaw where it only considered the state after the last token in a chunk, which could lead to data loss or incorrect message construction if the channel or recipient changed within the chunk.
The new approach is robust and correctly handles this complex scenario. Key improvements include:
- Tracking the state (channel, recipient, delta) for each individual token.
- Grouping consecutive tokens with the same state for efficient processing.
- Refactoring the logic to build a single, comprehensive
DeltaMessagethat can contain content, reasoning, and tool calls from a single chunk. - Improving the indexing logic for tool calls, correctly handling calls that span across multiple streamed chunks.
- Enhancing logging to be more comprehensive.
The changes significantly increase the correctness and reliability of streaming for gpt-oss models. The implementation is well-structured, and the added complexity is justified by the problem it solves. I don't see any issues with the proposed changes.
|
This pull request has merge conflicts that must be resolved before it can be |
Enhanced documentation for plugin patches: 1. Patch vllm-project#1 (Usage Tracking Helper): - Clarified as OPTIONAL (has fallback in harmony streaming patch) - Changed from "REQUIRED" to "OPTIONAL" - Explained fallback mechanism in patched_stream_method.py - Marked as upstreamable (minor utility addition) 2. Patch vllm-project#3 (Harmony Token-by-Token Streaming): - Added detailed speculative decoding context - Explained Eagle draft model generates 5-10 tokens per step - Documented specific failures with batch processing: * Tool calling broken * Multi-channel content lost * Token truncation during channel transitions - Added before/after code examples - Linked to PR vllm-project#26291 (Eagle3 Multi-Channel Streaming Fix) - Documented upstream status and removal plan Key insight: This patch exists because Eagle speculative decoding returns multiple tokens per step, and upstream's batch processing can't handle per-token channel switching. Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
|
Hi @astralord , just curious were you planning on merging this in still? If so I can do a review :) |
|
@qandrew Hi! It's great to hear, please, take a look :) |
|
Hi @astralord , overall looks good. Can you add the results of |
| if delta_message is not None: | ||
| # Combine all non-empty fields into a single message | ||
| if combined_content or combined_reasoning or tool_messages: | ||
| delta_kwargs: dict[str, Any] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need Any here?
Hi team!
Purpose
We've noticed that that the recent PR doesn't fully fix gpt-oss + streaming + speculative-decoding issue, for example generated messages end abruptly. This happens because multiple tokens can relate to different channels (e.g.
<final>,<analysis>,None) in one decoding stage. This PR handles it.Test Plan
Server command:
Test script
streaming_client.py: