Skip to content

Conversation

@astralord
Copy link
Contributor

@astralord astralord commented Oct 6, 2025

Hi team!

Purpose

We've noticed that that the recent PR doesn't fully fix gpt-oss + streaming + speculative-decoding issue, for example generated messages end abruptly. This happens because multiple tokens can relate to different channels (e.g. <final><analysis>None) in one decoding stage. This PR handles it.

Test Plan

Server command:

vllm serve openai/gpt_oss_20b--speculative-config '{"method": "eagle3", "model": <name-of-your-draft-model>}'

Test script streaming_client.py:

#!/usr/bin/env python3

import asyncio
import sys
from typing import List, Dict

import httpx
from openai import AsyncOpenAI


class StreamingClient:
    def __init__(self, api_url: str = "http://127.0.0.1:8000/v1", api_key: str = "EMPTY"):
        self.client = AsyncOpenAI(
            api_key=api_key,
            base_url=api_url,
            timeout=httpx.Timeout(timeout=300.0, connect=5.0)
        )

    async def send_request(
        self,
        messages: List[Dict[str, str]],
        model: str = "openai/gpt-oss-20b",
        streaming: bool = True,
        **kwargs
    ) -> str:
        """Send a request and return only the generated content."""
        response = await self.client.chat.completions.create(
            messages=messages,
            model=model,
            stream=streaming,
            **kwargs
        )

        if streaming:
            generated_text = ""
            async for chunk in response:
                if chunk.choices and chunk.choices[0].delta.content:
                    generated_text += chunk.choices[0].delta.content
            return generated_text
        else:
            if response.choices and response.choices[0].message.content:
                return response.choices[0].message.content
            return ""

    async def send_multiple_requests(
        self,
        prompts: List[List[Dict[str, str]]],
        model: str = "openai/gpt-oss-20b",
        streaming: bool = True,
        **kwargs
    ) -> List[str]:
        """Send multiple requests concurrently and return generated contents."""
        tasks = [
            asyncio.create_task(self.send_request(messages, model, streaming, **kwargs))
            for messages in prompts
        ]
        return await asyncio.gather(*tasks)


async def main():
    test_prompts = [
        [{"role": "user", "content": "Explain quantum computing in simple terms."}],
        [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}],
        [{"role": "user", "content": "What are the benefits of renewable energy?"}],
        [{"role": "user", "content": "Describe the process of photosynthesis."}]
    ]

    api_url = "http://127.0.0.1:8000/v1"
    model = "openai/gpt-oss-20b"
    streaming = len(sys.argv) > 1 and sys.argv[1].lower() == "streaming"

    generation_params = {
        "max_tokens": 256,
        "temperature": 0.7,
        "top_p": 0.9
    }

    mode = "streaming" if streaming else "non-streaming"
    print(f"Starting {mode} requests to {api_url}")
    print(f"Using model: {model}")
    print(f"Number of requests: {len(test_prompts)}\n")

    client = StreamingClient(api_url=api_url)
    results = await client.send_multiple_requests(test_prompts, model, streaming, **generation_params)

    for i, content in enumerate(results, 1):
        print(f"--- Response {i} ---")
        print(content)
        print()


if __name__ == "__main__":
    asyncio.run(main())

@mergify mergify bot added frontend gpt-oss Related to GPT-OSS models labels Oct 6, 2025
@mergify
Copy link

mergify bot commented Oct 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @astralord.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a bug in handling multiple channels for gpt-oss with speculative decoding, particularly in streaming mode. The changes introduce a more robust mechanism by tracking the state for each token, grouping them by channel and recipient, and then constructing the delta messages. This ensures that channel switches within a single decoding step are handled correctly. The related logging improvements are also a good addition, providing more comprehensive output. I have one suggestion to improve code readability and reduce the risk of future bugs by removing a magic number.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@astralord astralord force-pushed the fix-gpt-oss-with-speculative-decoding-handle-multiple-channels branch 2 times, most recently from a953d9c to ede4584 Compare October 6, 2025 11:37
@mergify mergify bot removed the needs-rebase label Oct 6, 2025
@astralord astralord force-pushed the fix-gpt-oss-with-speculative-decoding-handle-multiple-channels branch from 704867c to 3c1bf55 Compare October 6, 2025 11:40
@astralord
Copy link
Contributor Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@astralord astralord force-pushed the fix-gpt-oss-with-speculative-decoding-handle-multiple-channels branch from 3c1bf55 to e1f14dd Compare October 6, 2025 11:55
@astralord
Copy link
Contributor Author

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
Signed-off-by: Aleksandr Samarin <astrlrd@nebius.com>
@astralord astralord force-pushed the fix-gpt-oss-with-speculative-decoding-handle-multiple-channels branch from e4f6360 to 3ad1d7b Compare October 6, 2025 12:19
@astralord
Copy link
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@astralord
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request provides a much-needed fix for handling multiple channels in gpt-oss streaming with speculative decoding. The previous implementation had a flaw where it only considered the state after the last token in a chunk, which could lead to data loss or incorrect message construction if the channel or recipient changed within the chunk.

The new approach is robust and correctly handles this complex scenario. Key improvements include:

  • Tracking the state (channel, recipient, delta) for each individual token.
  • Grouping consecutive tokens with the same state for efficient processing.
  • Refactoring the logic to build a single, comprehensive DeltaMessage that can contain content, reasoning, and tool calls from a single chunk.
  • Improving the indexing logic for tool calls, correctly handling calls that span across multiple streamed chunks.
  • Enhancing logging to be more comprehensive.

The changes significantly increase the correctness and reliability of streaming for gpt-oss models. The implementation is well-structured, and the added complexity is justified by the problem it solves. I don't see any issues with the proposed changes.

@mergify
Copy link

mergify bot commented Oct 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @astralord.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 14, 2025
Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Oct 20, 2025
Enhanced documentation for plugin patches:

1. Patch vllm-project#1 (Usage Tracking Helper):
   - Clarified as OPTIONAL (has fallback in harmony streaming patch)
   - Changed from "REQUIRED" to "OPTIONAL"
   - Explained fallback mechanism in patched_stream_method.py
   - Marked as upstreamable (minor utility addition)

2. Patch vllm-project#3 (Harmony Token-by-Token Streaming):
   - Added detailed speculative decoding context
   - Explained Eagle draft model generates 5-10 tokens per step
   - Documented specific failures with batch processing:
     * Tool calling broken
     * Multi-channel content lost
     * Token truncation during channel transitions
   - Added before/after code examples
   - Linked to PR vllm-project#26291 (Eagle3 Multi-Channel Streaming Fix)
   - Documented upstream status and removal plan

Key insight: This patch exists because Eagle speculative decoding
returns multiple tokens per step, and upstream's batch processing
can't handle per-token channel switching.

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
@qandrew
Copy link
Contributor

qandrew commented Oct 25, 2025

Hi @astralord , just curious were you planning on merging this in still? If so I can do a review :)

@astralord
Copy link
Contributor Author

@qandrew Hi! It's great to hear, please, take a look :)

@qandrew
Copy link
Contributor

qandrew commented Oct 29, 2025

Hi @astralord , overall looks good. Can you add the results of streaming_client.py to the PR description, and also add a unit test?

if delta_message is not None:
# Combine all non-empty fields into a single message
if combined_content or combined_reasoning or tool_messages:
delta_kwargs: dict[str, Any] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need Any here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend gpt-oss Related to GPT-OSS models needs-rebase

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

2 participants