Skip to content

Conversation

@LuYanFCP
Copy link
Contributor

@LuYanFCP LuYanFCP commented Sep 4, 2025

Purpose

  1. Added native reason parser support for SeedOss model in vllm.
  2. Refactored and added BaseThinkingReasoningParser, which abstracts and merges the common implementation of Qwen3/DeepseekR1/SeedOss, providing a way to quickly implement ReasonParser by inheriting BaseThinkingReasoningParser and adding start_token/end_token variables

Test Plan

  1. add unittest for SeedOss and BaseThinkingReasoningParser
  2. unittest pass for origin Qwen3/DeepseekR1 parser test.

Test Result

All Pass

root@d240c896f494:/workspaces/vllm-backup# pytest  tests/reasoning/test_qwen3_reasoning_parser.py tests/reasoning/test_deepseekr1_reasoning_parser.py tests/reasoning/test_base_thinking_reasoning_parser.py 
/usr/local/lib/python3.12/dist-packages/pytest_asyncio/plugin.py:208: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
======================================================================================== test session starts ========================================================================================
platform linux -- Python 3.12.11, pytest-8.3.5, pluggy-1.5.0
rootdir: /workspaces/vllm-backup
configfile: pyproject.toml
plugins: hypothesis-6.131.0, rerunfailures-14.0, asyncio-0.24.0, schemathesis-3.39.15, shard-0.1.2, mock-3.14.0, hydra-core-1.3.2, forked-1.6.0, timeout-2.3.1, subtests-0.14.1, buildkite-test-collector-0.1.9, anyio-4.6.2.post1
asyncio: mode=Mode.STRICT, default_loop_scope=None
collected 55 items                                                                                                                                                                                  
Running 55 items in this shard

tests/reasoning/test_qwen3_reasoning_parser.py ..........                                                                                                                                     [ 18%]
tests/reasoning/test_deepseekr1_reasoning_parser.py ........................                                                                                                                  [ 61%]
tests/reasoning/test_base_thinking_reasoning_parser.py .....................                                                                                                                  [100%]

========================================================================================= warnings summary ==========================================================================================
../../usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305
  /usr/local/lib/python3.12/dist-packages/schemathesis/generation/coverage.py:305: DeprecationWarning: jsonschema.exceptions.RefResolutionError is deprecated as of version 4.18.0. If you wish to catch potential reference resolution errors, directly catch referencing.exceptions.Unresolvable.
    ref_error: type[Exception] = jsonschema.RefResolutionError,

tests/reasoning/test_base_thinking_reasoning_parser.py:13
  /workspaces/vllm-backup/tests/reasoning/test_base_thinking_reasoning_parser.py:13: PytestCollectionWarning: cannot collect test class 'TestThinkingReasoningParser' because it has a __init__ constructor (from: tests/reasoning/test_base_thinking_reasoning_parser.py)
    class TestThinkingReasoningParser(BaseThinkingReasoningParser):

tests/reasoning/test_base_thinking_reasoning_parser.py:19
  /workspaces/vllm-backup/tests/reasoning/test_base_thinking_reasoning_parser.py:19: PytestCollectionWarning: cannot collect test class 'TestThinkingReasoningParserAlt' because it has a __init__ constructor (from: tests/reasoning/test_base_thinking_reasoning_parser.py)
    class TestThinkingReasoningParserAlt(BaseThinkingReasoningParser):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================================== 55 passed, 3 warnings in 14.68s ==================================================================================
root@d240c896f494:/workspaces/vllm-backup# 

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@LuYanFCP LuYanFCP requested a review from aarnphm as a code owner September 4, 2025 17:02
@github-actions
Copy link

github-actions bot commented Sep 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added deepseek Related to DeepSeek models qwen Related to Qwen models labels Sep 4, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a BaseThinkingReasoningParser to abstract common logic for parsing reasoning content, which is a great refactoring. It also adds support for the SeedOss model.

The refactoring simplifies the DeepSeekR1ReasoningParser and Qwen3ReasoningParser by having them inherit from the new base class. However, I've found a critical issue in the implementation of BaseThinkingReasoningParser's streaming logic that would break existing functionality for deepseek_r1 and the new seed_oss parser. Please see my detailed comment.

After fixing the base class, the streaming behavior for qwen3 might change and become inconsistent with its non-streaming behavior. You may want to consider overriding extract_reasoning_content_streaming in Qwen3ReasoningParser to maintain its specific logic (treating everything as content if no start token is present).

Signed-off-by: Yan Lu <luyan@nvidia.com>
@LuYanFCP LuYanFCP force-pushed the feat/seed_oss_parse_support branch 3 times, most recently from b04c83a to 32e4fd4 Compare September 5, 2025 01:47
@LuYanFCP LuYanFCP changed the title Support SeedOss Reason Parser [feat] Support SeedOss Reason Parser Sep 5, 2025
@LuYanFCP LuYanFCP changed the title [feat] Support SeedOss Reason Parser [Model] Support SeedOss Reason Parser Sep 5, 2025
@LuYanFCP LuYanFCP force-pushed the feat/seed_oss_parse_support branch 5 times, most recently from 3bb52fc to 77a4ec1 Compare September 5, 2025 13:31
@WojtekMatula
Copy link

git clone git@github.com:LuYanFCP/vllm.git
cd vllm
git checkout feat/seed_oss_parse_support
VLLM_USE_PRECOMPILED=1 uv pip install --editable .

VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_LOGGING_LEVEL=DEBUG
vllm serve Intel/Seed-OSS-36B-Instruct-int4-AutoRound
--enable-auto-tool-choice
--tool-call-parser seed_oss
--trust-remote-code
--tensor-parallel-size 2
--dtype bfloat16
--max_model_len 68000
--port 1234
--served-model-name seed-oss
--gpu-memory-utilization 0.85
--reasoning-parser seed_oss

(APIServer pid=20370) INFO: Started server process [20370]
(APIServer pid=20370) INFO: Waiting for application startup.
(APIServer pid=20370) INFO: Application startup complete.
(APIServer pid=20370) INFO 09-06 08:54:41 [chat_utils.py:507] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
(APIServer pid=20370) INFO 09-06 08:54:41 [seed_oss_tool_parser.py:79] vLLM Seed-Oss XML tool parser loaded (SeedOssToolParser).
(APIServer pid=20370) INFO: 127.0.0.1:45372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=20370) INFO 09-06 08:54:41 [seed_oss_tool_parser.py:79] vLLM Seed-Oss XML tool parser loaded (SeedOssToolParser).
(EngineCore_0 pid=20507) DEBUG 09-06 08:54:41 [core.py:753] EngineCore loop active.
(APIServer pid=20370) DEBUG 09-06 08:54:49 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 0.0%
(APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] Error in chat completion stream generator.
(APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] Traceback (most recent call last):
(APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] File "/home/wojtek/Applications/vllm/vllm/entrypoints/openai/serving_chat.py", line 845, in chat_completion_stream_generator
(APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] extract_reasoning_content_streaming(
(APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] File "/home/wojtek/Applications/vllm/vllm/reasoning/abs_reasoning_parsers.py", line 206, in extract_reasoning_content_streaming
(APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] return DeltaMessage(reasoning_content=delta_text)
(APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] File "/home/wojtek/miniconda3/lib/python3.12/typing.py", line 532, in new
(APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] raise TypeError("Any cannot be instantiated")
(APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] TypeError: Any cannot be instantiated
(EngineCore_0 pid=20507) DEBUG 09-06 08:54:49 [core.py:747] EngineCore waiting for work.

@LuYanFCP LuYanFCP closed this Sep 6, 2025
@LuYanFCP LuYanFCP reopened this Sep 6, 2025
@LuYanFCP
Copy link
Contributor Author

LuYanFCP commented Sep 6, 2025

@WojtekMatula This Issue have been resolved. You can try it in latest commit.

git clone git@github.com:LuYanFCP/vllm.git cd vllm git checkout feat/seed_oss_parse_support VLLM_USE_PRECOMPILED=1 uv pip install --editable .

VLLM_WORKER_MULTIPROC_METHOD=spawn \ VLLM_LOGGING_LEVEL=DEBUG vllm serve Intel/Seed-OSS-36B-Instruct-int4-AutoRound --enable-auto-tool-choice --tool-call-parser seed_oss --trust-remote-code --tensor-parallel-size 2 --dtype bfloat16 --max_model_len 68000 --port 1234 --served-model-name seed-oss --gpu-memory-utilization 0.85 --reasoning-parser seed_oss

(APIServer pid=20370) INFO: Started server process [20370] (APIServer pid=20370) INFO: Waiting for application startup. (APIServer pid=20370) INFO: Application startup complete. (APIServer pid=20370) INFO 09-06 08:54:41 [chat_utils.py:507] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this. (APIServer pid=20370) INFO 09-06 08:54:41 [seed_oss_tool_parser.py:79] vLLM Seed-Oss XML tool parser loaded (SeedOssToolParser). (APIServer pid=20370) INFO: 127.0.0.1:45372 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=20370) INFO 09-06 08:54:41 [seed_oss_tool_parser.py:79] vLLM Seed-Oss XML tool parser loaded (SeedOssToolParser). (EngineCore_0 pid=20507) DEBUG 09-06 08:54:41 [core.py:753] EngineCore loop active. (APIServer pid=20370) DEBUG 09-06 08:54:49 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 0.0% (APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] Error in chat completion stream generator. (APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] Traceback (most recent call last): (APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] File "/home/wojtek/Applications/vllm/vllm/entrypoints/openai/serving_chat.py", line 845, in chat_completion_stream_generator (APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] extract_reasoning_content_streaming( (APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] File "/home/wojtek/Applications/vllm/vllm/reasoning/abs_reasoning_parsers.py", line 206, in extract_reasoning_content_streaming (APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] return DeltaMessage(reasoning_content=delta_text) (APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] File "/home/wojtek/miniconda3/lib/python3.12/typing.py", line 532, in new (APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] raise TypeError("Any cannot be instantiated") (APIServer pid=20370) ERROR 09-06 08:54:49 [serving_chat.py:1136] TypeError: Any cannot be instantiated (EngineCore_0 pid=20507) DEBUG 09-06 08:54:49 [core.py:747] EngineCore waiting for work.

@LuYanFCP LuYanFCP force-pushed the feat/seed_oss_parse_support branch from 245d6de to 3065687 Compare September 6, 2025 14:50
…BaseThinkingReasoningParser base implementation.

Signed-off-by: Yan Lu <luyan@nvidia.com>
@LuYanFCP LuYanFCP force-pushed the feat/seed_oss_parse_support branch from a1e6c1e to a733746 Compare September 6, 2025 15:15
@LuYanFCP LuYanFCP force-pushed the feat/seed_oss_parse_support branch from bfaed5d to 110e1eb Compare September 6, 2025 16:37
@WojtekMatula
Copy link

Looks like there is problem with tool use with reasoning parsing enabled.

Without reasoning parsing:
VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_LOGGING_LEVEL=DEBUG
vllm serve Intel/Seed-OSS-36B-Instruct-int4-AutoRound
--enable-auto-tool-choice
--tool-call-parser seed_oss
--trust-remote-code
--tensor-parallel-size 2
--dtype bfloat16
--max_model_len 68000
--port 1234
--served-model-name seed-oss
--gpu-memory-utilization 0.85
Started server process [74664]
(APIServer pid=74664) INFO: Waiting for application startup.
(APIServer pid=74664) INFO: Application startup complete.
(APIServer pid=74664) DEBUG 09-06 19:37:00 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=74664) INFO 09-06 19:37:02 [chat_utils.py:507] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
(APIServer pid=74664) INFO 09-06 19:37:02 [seed_oss_tool_parser.py:79] vLLM Seed-Oss XML tool parser loaded (SeedOssToolParser).
(EngineCore_0 pid=74742) DEBUG 09-06 19:37:02 [core.py:753] EngineCore loop active.
(APIServer pid=74664) INFO: 127.0.0.1:35946 - "POST /v1/chat/completions HTTP/1.1" 200 OK

image

With reasoning parsing:
image
vllm feat/seed_oss_parse_support ❯❯❯ VLLM_WORKER_MULTIPROC_METHOD=spawn \
VLLM_LOGGING_LEVEL=DEBUG
vllm serve Intel/Seed-OSS-36B-Instruct-int4-AutoRound
--enable-auto-tool-choice
--tool-call-parser seed_oss
--trust-remote-code
--tensor-parallel-size 2
--dtype bfloat16
--max_model_len 68000
--port 1234
--served-model-name seed-oss
--gpu-memory-utilization 0.85
--reasoning-parser seed_oss
Started server process [90350]
(APIServer pid=90350) INFO: Waiting for application startup.
(APIServer pid=90350) INFO: Application startup complete.
(APIServer pid=90350) DEBUG 09-06 19:54:18 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=90350) DEBUG 09-06 19:54:28 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=90350) INFO 09-06 19:54:35 [chat_utils.py:507] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
(APIServer pid=90350) INFO 09-06 19:54:35 [seed_oss_tool_parser.py:79] vLLM Seed-Oss XML tool parser loaded (SeedOssToolParser).
(EngineCore_0 pid=90427) DEBUG 09-06 19:54:35 [core.py:753] EngineCore loop active.
(APIServer pid=90350) INFO: 127.0.0.1:35900 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Btw this model is amazing, great work.

@LuYanFCP
Copy link
Contributor Author

LuYanFCP commented Sep 7, 2025

Can you give me some example prompts? I suspect it's a problem with toolparse, and I will solve it in another PR

Looks like there is problem with tool use with reasoning parsing enabled.

Without reasoning parsing: VLLM_WORKER_MULTIPROC_METHOD=spawn \ VLLM_LOGGING_LEVEL=DEBUG vllm serve Intel/Seed-OSS-36B-Instruct-int4-AutoRound --enable-auto-tool-choice --tool-call-parser seed_oss --trust-remote-code --tensor-parallel-size 2 --dtype bfloat16 --max_model_len 68000 --port 1234 --served-model-name seed-oss --gpu-memory-utilization 0.85 Started server process [74664] (APIServer pid=74664) INFO: Waiting for application startup. (APIServer pid=74664) INFO: Application startup complete. (APIServer pid=74664) DEBUG 09-06 19:37:00 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% (APIServer pid=74664) INFO 09-06 19:37:02 [chat_utils.py:507] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this. (APIServer pid=74664) INFO 09-06 19:37:02 [seed_oss_tool_parser.py:79] vLLM Seed-Oss XML tool parser loaded (SeedOssToolParser). (EngineCore_0 pid=74742) DEBUG 09-06 19:37:02 [core.py:753] EngineCore loop active. (APIServer pid=74664) INFO: 127.0.0.1:35946 - "POST /v1/chat/completions HTTP/1.1" 200 OK

image **With reasoning parsing:** image vllm feat/seed_oss_parse_support ❯❯❯ VLLM_WORKER_MULTIPROC_METHOD=spawn \ VLLM_LOGGING_LEVEL=DEBUG vllm serve Intel/Seed-OSS-36B-Instruct-int4-AutoRound --enable-auto-tool-choice --tool-call-parser seed_oss --trust-remote-code --tensor-parallel-size 2 --dtype bfloat16 --max_model_len 68000 --port 1234 --served-model-name seed-oss --gpu-memory-utilization 0.85 --reasoning-parser seed_oss Started server process [90350] (APIServer pid=90350) INFO: Waiting for application startup. (APIServer pid=90350) INFO: Application startup complete. (APIServer pid=90350) DEBUG 09-06 19:54:18 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% (APIServer pid=90350) DEBUG 09-06 19:54:28 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% (APIServer pid=90350) INFO 09-06 19:54:35 [chat_utils.py:507] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this. (APIServer pid=90350) INFO 09-06 19:54:35 [seed_oss_tool_parser.py:79] vLLM Seed-Oss XML tool parser loaded (SeedOssToolParser). (EngineCore_0 pid=90427) DEBUG 09-06 19:54:35 [core.py:753] EngineCore loop active. (APIServer pid=90350) INFO: 127.0.0.1:35900 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Btw this model is amazing, great work.

@WojtekMatula
Copy link

WojtekMatula commented Sep 7, 2025

curl -X POST http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "seed-oss", "max_tokens": 32000, "messages": [ { "role": "system", "content": "You are helpful assistant. Use tools to assist user. Answer concisely (<4 lines)." }, { "role": "user", "content": "execute ls -la" } ], "tools": [ { "type": "function", "function": { "name": "bash", "description": "Run bash commands. Quote paths with spaces. Prefer rg over grep. Describe command in 5-10 words.", "parameters": { "type": "object", "properties": { "command": {"type": "string", "description": "Command to execute"}, "timeout": {"type": "number", "description": "Timeout in ms"}, "description": {"type": "string", "description": "5-10 word description"} }, "required": ["command", "description"] } } } ], "tool_choice": "auto", "stream": true }'

Looks like tools are not parsed only when reasoning is enabled and streaming is enabled. With streaming disabled everything is fine.

@LuYanFCP
Copy link
Contributor Author

LuYanFCP commented Sep 7, 2025

Thinks for your reply, i will resolve this issue in current pr.

curl -X POST http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "seed-oss", "max_tokens": 32000, "messages": [ { "role": "system", "content": "You are helpful assistant. Use tools to assist user. Answer concisely (<4 lines)." }, { "role": "user", "content": "execute ls -la" } ], "tools": [ { "type": "function", "function": { "name": "bash", "description": "Run bash commands. Quote paths with spaces. Prefer rg over grep. Describe command in 5-10 words.", "parameters": { "type": "object", "properties": { "command": {"type": "string", "description": "Command to execute"}, "timeout": {"type": "number", "description": "Timeout in ms"}, "description": {"type": "string", "description": "5-10 word description"} }, "required": ["command", "description"] } } } ], "tool_choice": "auto", "stream": true }'

Looks like tools are not parsed only when reasoning is enabled and streaming is enabled. With streaming disabled everything is fine.

@WojtekMatula
Copy link

WojtekMatula commented Sep 8, 2025

I think there is one more issue with the tool parsing. I told you that tool parsing works, even with streaming enabled, as long as reasoning is disabled but mow I think it is not 100% true.
Yes, tool usage is parsed but I think all parameters are always wrapped with double quotes when streaming is enabled.

In my curl example:
curl -X POST http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "seed-oss", "max_tokens": 32000, "messages": [ { "role": "system", "content": "You are helpful assistant. Use tools to assist user. Answer concisely (<4 lines)." }, { "role": "user", "content": "execute ls -la" } ], "tools": [ { "type": "function", "function": { "name": "bash", "description": "Run bash commands. Quote paths with spaces. Prefer rg over grep. Describe command in 5-10 words.", "parameters": { "type": "object", "properties": { "command": {"type": "string", "description": "Command to execute"}, "timeout": {"type": "number", "description": "Timeout in ms"}, "description": {"type": "string", "description": "5-10 word description"} }, "required": ["command", "description"] } } } ], "tool_choice": "auto", "stream": true }'

Tool invocation is always failing in stream mode because model output:
{
"timeout": "10000",
}
instead of:
{
"timeout": 10000
}
This is working like that across all json types, for example arrays are also wrapped with double quotes.

First I thought that this was a problem with the model but when I was testing the model with curls, without "stream": true flag, the tool call json was fine.

So to sum up:
Tool parsing not working in stream mode if reasoning parsing is on.
Tool parsing is working partially (it is warping all parameters with double quotes) in stream mode if reasoning parsing is off.

Signed-off-by: Yan Lu <luyan@nvidia.com>
Copy link
Collaborator

@chaunceyjiang chaunceyjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, LGTM.

/cc @aarnphm @gaocegege PTAL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you paste your local test results, especially with Stream=True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK,I will submit some case result in local

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unittest file already test this case

@LuYanFCP
Copy link
Contributor Author

LuYanFCP commented Sep 15, 2025

This issue in
image.

When stream=True If delta_text include both tool start and end token, it will be return error. I also noticed this issue when I turned off the reason parser.

I think there is one more issue with the tool parsing. I told you that tool parsing works, even with streaming enabled, as long as reasoning is disabled but mow I think it is not 100% true. Yes, tool usage is parsed but I think all parameters are always wrapped with double quotes when streaming is enabled.

In my curl example: curl -X POST http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "seed-oss", "max_tokens": 32000, "messages": [ { "role": "system", "content": "You are helpful assistant. Use tools to assist user. Answer concisely (<4 lines)." }, { "role": "user", "content": "execute ls -la" } ], "tools": [ { "type": "function", "function": { "name": "bash", "description": "Run bash commands. Quote paths with spaces. Prefer rg over grep. Describe command in 5-10 words.", "parameters": { "type": "object", "properties": { "command": {"type": "string", "description": "Command to execute"}, "timeout": {"type": "number", "description": "Timeout in ms"}, "description": {"type": "string", "description": "5-10 word description"} }, "required": ["command", "description"] } } } ], "tool_choice": "auto", "stream": true }'

Tool invocation is always failing in stream mode because model output: { "timeout": "10000", } instead of: { "timeout": 10000 } This is working like that across all json types, for example arrays are also wrapped with double quotes.

First I thought that this was a problem with the model but when I was testing the model with curls, without "stream": true flag, the tool call json was fine.

So to sum up: Tool parsing not working in stream mode if reasoning parsing is on. Tool parsing is working partially (it is warping all parameters with double quotes) in stream mode if reasoning parsing is off.

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 17, 2025
Copy link
Contributor

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It includes a refactor and a new feature (SeedOSS support), which adds complexity and modifies the Mistral, DeepSeek, and Qwen3 code paths. Still, we have reasoning tests for Mistral, DeepSeek, and Qwen3, so I think it should work.

@mgoin mgoin merged commit be0bb56 into vllm-project:main Sep 24, 2025
40 checks passed
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Yan Lu <luyan@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
@vanshilshah97
Copy link

Hi @LuYanFCP
Great work !
are the issues with solved ? or is there a follow up work to be done ? a MR maybe a open issue ?

I think there is one more issue with the tool parsing. I told you that tool parsing works, even with streaming enabled, as long as reasoning is disabled but mow I think it is not 100% true. Yes, tool usage is parsed but I think all parameters are always wrapped with double quotes when streaming is enabled.

In my curl example: curl -X POST http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "seed-oss", "max_tokens": 32000, "messages": [ { "role": "system", "content": "You are helpful assistant. Use tools to assist user. Answer concisely (<4 lines)." }, { "role": "user", "content": "execute ls -la" } ], "tools": [ { "type": "function", "function": { "name": "bash", "description": "Run bash commands. Quote paths with spaces. Prefer rg over grep. Describe command in 5-10 words.", "parameters": { "type": "object", "properties": { "command": {"type": "string", "description": "Command to execute"}, "timeout": {"type": "number", "description": "Timeout in ms"}, "description": {"type": "string", "description": "5-10 word description"} }, "required": ["command", "description"] } } } ], "tool_choice": "auto", "stream": true }'

Tool invocation is always failing in stream mode because model output: { "timeout": "10000", } instead of: { "timeout": 10000 } This is working like that across all json types, for example arrays are also wrapped with double quotes.

First I thought that this was a problem with the model but when I was testing the model with curls, without "stream": true flag, the tool call json was fine.

So to sum up: Tool parsing not working in stream mode if reasoning parsing is on. Tool parsing is working partially (it is warping all parameters with double quotes) in stream mode if reasoning parsing is off.

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Yan Lu <luyan@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
gjc0824 pushed a commit to gjc0824/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Yan Lu <luyan@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: gaojc <1055866782@qq.com>
@CallmeZhangChenchen
Copy link

When stream=True, the Seed-OSS tool call returns an incorrect structure

ChatCompletionChunk(id='chatcmpl-19b9a38b01f8455eb92c413125dad057', choices=[Choice(delta=ChoiceDelta(content='</seed:tool_call>', function_call=None, refusal=None, role=None, tool_calls=None), finish_reason=None, index=0, logprobs=None, token_ids=None)], created=1760060258, model='Seed-OSS-36B-Instruct-AWQ', object='chat.completion.chunk', service_tier=None, system_fingerprint=None, usage=None)

It should be in tool_calls, but it ended up in content

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Yan Lu <luyan@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
Signed-off-by: Yan Lu <luyan@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
Signed-off-by: Yan Lu <luyan@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Yan Lu <luyan@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Signed-off-by: Yan Lu <luyan@nvidia.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek Related to DeepSeek models qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants