[Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled #18879

chaunceyjiang · 2025-05-29T04:50:52Z

Introduced by PR #16577.

github-actions · 2025-05-29T04:51:01Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

…Thinking is disabled Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

chaunceyjiang · 2025-05-29T10:38:13Z

Qwen3 and DeepSeek_R1

# vllm serve /home/jovyan/public-models/Deepseek-R1-Distill-Qwen-14B  --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser deepseek_r1


# vllm serve /home/jovyan/qwen3-32b-awq  --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3

python test.py

from openai import OpenAI
from pydantic import BaseModel


client = OpenAI(api_key="xxx", base_url="http://127.0.0.1:8000/v1")


class OutputModel(BaseModel):
    result: int


prompt = """\
123+456等于多少？
结果以JSON格式给出:
{{
    "result": "结果"
}}
"""


rsp = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": prompt},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": OutputModel.model_json_schema(),},
    temperature=0,
)
print('---')
print(rsp.choices[0].message.content)
print('---')

rsp = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": prompt},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}, "guided_json": OutputModel.model_json_schema()},
    temperature=0.7,
)
print('---')
print(rsp.choices[0].message.content)
print('---')

# or
rsp = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": prompt},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": OutputModel.model_json_schema()},
    temperature=0,
)
print('---')
print(rsp.choices[0].message.content)
print('---')
# service will be blocked
rsp = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": prompt},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": OutputModel.model_json_schema()},
    temperature=0.7,
)
print('---')
print(rsp.choices[0].message.content)
print('---')
class Step(BaseModel):
    ground_truth_key_ideas: str 
    system_response_key_ideas: str
    discussion: str
    recall: float
    precision: float



# client.chat.completions.create
json_schema = Step.model_json_schema()

chat_response = client.beta.chat.completions.parse(
    model="",
    messages=[
        {'role': 'system',
        'content': 'Your input fields are:\n1. `question` (str)\n2. `ground_truth` (str)\n3. `system_response` (str)\n\nYour output fields are:\n1. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth\n2. `system_response_key_ideas` (str): enumeration of key ideas in the system response\n3. `discussion` (str): discussion of the overlap between ground truth and system response\n4. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response\n5. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\n[[ ## ground_truth ## ]]\n{ground_truth}\n\n[[ ## system_response ## ]]\n{system_response}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  "ground_truth_key_ideas": "{ground_truth_key_ideas}",\n  "system_response_key_ideas": "{system_response_key_ideas}",\n  "discussion": "{discussion}",\n  "recall": "{recall}        # note: the value you produce must be a single float value",\n  "precision": "{precision}        # note: the value you produce must be a single float value"\n}\n\nIn adhering to this structure, your objective is: \n        Compare a system\'s response to the ground truth to compute recall and precision of key ideas.\n        You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.'},
        {'role': 'user',
        'content': '[[ ## question ## ]]\nWhy is the action space of language modeling particularly large? Is it because of the vocab size? But then, moving in the real world also has a huge action space (degrees of movement).\n\n[[ ## ground_truth ## ]]\nThe action space for language modeling is equal to the vocabulary set of language models. Since the vocabularies are very large (i.e. tens of thousands of possible tokens), the action space is also very large. In general, locomotion in the real world can be condensed to three quantities - moving across X, Y or Z axes, or a linear combination thereof. The authors mention that typical RL problems have an action space that is an order of magnitude smaller, but do not specifically explain how the action spaces for typical problems is modeled or constructed.\n\n[[ ## system_response ## ]]\nThe action space in language modeling is indeed particularly large due to the **vocabulary size**—at each step, the model must choose from thousands of possible words (e.g., 50k+ tokens in large vocabularies), leading to combinatorial explosion in sequence generation. This discrete, high-dimensional choice is distinct from real-world continuous action spaces (e.g., motor control with infinite degrees of movement). However, the challenges differ: language models face **discrete, high-cardinality decisions** with combinatorial complexity, while real-world actions often involve **continuous control**. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) or action space reduction (e.g., GALAD) address the former by managing variance and exploration in discrete, large vocabularies, whereas real-world control typically uses gradient-based methods for continuous spaces.\n\nRespond with a JSON object in the following order of fields: `ground_truth_key_ideas`, then `system_response_key_ideas`, then `discussion`, then `recall` (must be formatted as a valid Python float), then `precision` (must be formatted as a valid Python float).'}
    ],
    temperature=0.0,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": json_schema},
)
print("-----")
print(chat_response.choices[0].message.content)
print("-----")

chat_response = client.beta.chat.completions.parse(
    model="",
    messages=[
        {'role': 'system',
        'content': 'Your input fields are:\n1. `question` (str)\n2. `ground_truth` (str)\n3. `system_response` (str)\n\nYour output fields are:\n1. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth\n2. `system_response_key_ideas` (str): enumeration of key ideas in the system response\n3. `discussion` (str): discussion of the overlap between ground truth and system response\n4. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response\n5. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\n[[ ## ground_truth ## ]]\n{ground_truth}\n\n[[ ## system_response ## ]]\n{system_response}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  "ground_truth_key_ideas": "{ground_truth_key_ideas}",\n  "system_response_key_ideas": "{system_response_key_ideas}",\n  "discussion": "{discussion}",\n  "recall": "{recall}        # note: the value you produce must be a single float value",\n  "precision": "{precision}        # note: the value you produce must be a single float value"\n}\n\nIn adhering to this structure, your objective is: \n        Compare a system\'s response to the ground truth to compute recall and precision of key ideas.\n        You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.'},
        {'role': 'user',
        'content': '[[ ## question ## ]]\nWhy is the action space of language modeling particularly large? Is it because of the vocab size? But then, moving in the real world also has a huge action space (degrees of movement).\n\n[[ ## ground_truth ## ]]\nThe action space for language modeling is equal to the vocabulary set of language models. Since the vocabularies are very large (i.e. tens of thousands of possible tokens), the action space is also very large. In general, locomotion in the real world can be condensed to three quantities - moving across X, Y or Z axes, or a linear combination thereof. The authors mention that typical RL problems have an action space that is an order of magnitude smaller, but do not specifically explain how the action spaces for typical problems is modeled or constructed.\n\n[[ ## system_response ## ]]\nThe action space in language modeling is indeed particularly large due to the **vocabulary size**—at each step, the model must choose from thousands of possible words (e.g., 50k+ tokens in large vocabularies), leading to combinatorial explosion in sequence generation. This discrete, high-dimensional choice is distinct from real-world continuous action spaces (e.g., motor control with infinite degrees of movement). However, the challenges differ: language models face **discrete, high-cardinality decisions** with combinatorial complexity, while real-world actions often involve **continuous control**. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) or action space reduction (e.g., GALAD) address the former by managing variance and exploration in discrete, large vocabularies, whereas real-world control typically uses gradient-based methods for continuous spaces.\n\nRespond with a JSON object in the following order of fields: `ground_truth_key_ideas`, then `system_response_key_ideas`, then `discussion`, then `recall` (must be formatted as a valid Python float), then `precision` (must be formatted as a valid Python float).'}
    ],
    temperature=0.0,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}, "guided_json": json_schema},
)
print("-----")
print(chat_response.choices[0].message.content)
print("-----")

---
{  
    "result": 579
}
---
---
{ "result": 579 }
---
---
{  
    "result": 579
}
---
---
{  
    "result": 579  
}
---
-----
{
  "ground_truth_key_ideas": "1. The action space for language modeling is equal to the vocabulary set of language models. 2. The vocabulary size is very large (tens of thousands of possible tokens). 3. Real-world locomotion can be condensed to three quantities (X, Y, or Z axes).",
  "system_response_key_ideas": "1. Action space in language modeling is large due to vocabulary size (e.g., 50k+ tokens). 2. Combinatorial explosion in sequence generation. 3. Discrete, high-cardinality decisions vs. real-world continuous control. 4. Techniques like actor-critic methods and action space reduction address challenges.",
  "discussion": "The system response accurately captures the key ideas from the ground truth, including the relationship between vocabulary size and action space, and the comparison to real-world action spaces. Additionally, the system response provides more detailed explanations, such as the combinatorial explosion in sequence generation and the distinction between discrete and continuous action spaces. It also mentions specific techniques for addressing these challenges, which are not covered in the ground truth.",
  "recall": 1.0,
  "precision": 0.6666666666666666
}
-----
-----
{
  "ground_truth_key_ideas": "1. The action space for language modeling is equal to the vocabulary set of language models. 2. The vocabulary size is very large (tens of thousands of possible tokens). 3. Real-world locomotion can be condensed to three axes (X, Y, Z) or linear combinations thereof.",
  "system_response_key_ideas": "1. Action space in language modeling is large due to vocabulary size (e.g., 50k+ tokens). 2. The action space involves discrete, high-cardinality decisions with combinatorial complexity. 3. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 4. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) manage variance and exploration in discrete, large vocabularies. 5. Action space reduction techniques (e.g., GALAD) are used for handling large vocabularies.",
  "discussion": "The system response fully covers all key ideas from the ground truth and adds additional details. It expands on the challenges of discrete, high-cardinality decisions in language modeling and contrasts them with continuous control in real-world actions. The system also mentions specific techniques to address these challenges, which were not discussed in the ground truth.",
  "recall": 1.0,
  "precision": 1.0
}
-----

chaunceyjiang · 2025-05-30T02:23:40Z

/cc @aarnphm PTAL.

aarnphm

Thanks.

chaunceyjiang · 2025-05-31T06:43:33Z

@DarkLight1337 PTAL。

DarkLight1337

Stamp

…Thinking is disabled (vllm-project#18879) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: amit <amit.man@gmail.com>

AlphaINF · 2025-07-17T07:46:32Z

Hello! How can i fix the problem in V0 Engine?
the V0 Engine seem have some stability issue that it will crash when the process is up.

So I want to fix the problem in V0 Engine.

chaunceyjiang · 2025-07-17T08:33:14Z

@AlphaINF V0 has already been deprecated. I suggest upgrading to v1.

AlphaINF · 2025-07-17T12:16:56Z

@AlphaINF V0 has already been deprecated. I suggest upgrading to v1.

However, V1 has some reliability problem, it will crash if you send more requests.

PART 1: Eagle + Structured Output FSM Validation Fix ===================================================== ISSUE: Eagle speculative decoding with structured output crashes with AssertionError when FSM rejects tokens in scheduled_spec_decode_tokens list. SYMPTOMS: - Error: "Failed to advance FSM for request ... for tokens XXX" - Followed by: AssertionError at vllm/v1/structured_output/__init__.py:263 - Crashes entire vLLM engine under load with Eagle + tool calling ROOT CAUSE (partially identified): - FSM can terminate mid-validation when accepting stop token - Remaining spec tokens still attempted for validation - Original code asserts all scheduled tokens must be valid - Assertion fails when FSM rejects tokens after termination SOLUTION: Implemented defensive fix in grammar_bitmask() method: - Replace assertion with conditional check - If token rejected, log debug message and continue - Still fill bitmasks for all positions (required by downstream code) - Makes code resilient to FSM state mismatches IMPLEMENTATION: - New patch: mantle_extensions/patches/eagle_structured_output_fix.py - Monkey-patches StructuredOutputManager.grammar_bitmask() - Registered as 12th patch in plugin system - Enabled by default in patch_config.json TESTING: ✓ Plugin loads successfully with all 12 patches ✓ No more AssertionError crashes ✓ No more 500 Internal Server errors ✓ Eagle + structured output + penalties works correctly ⚠ Expected warnings from xgrammar about terminated FSM (benign) NOTES: - This is a defensive fix without full root cause understanding - Possible causes: FSM state mismatch, xgrammar rollback bug, concurrency - Upstreamable: Yes - should be contributed to vLLM upstream - Bug exists since PR vllm-project#18879 (May 2025) PART 2: Clean Up Unused Patch Files ==================================== Removed 3 unused patch files: 1. pr26291_streaming_method.py - Unused reference implementation 2. streaming_patches.py - Unused streaming patch loader 3. qwen3_tool_parser_fix_complete.py - Now implemented in-tree Updated files: - mantle_extensions/patches/__init__.py - Removed streaming_patches export - mantle_extensions/plugin.py - Added note about qwen3 in-tree fix Rationale: - pr26291 and streaming_patches were never used in production - qwen3 fix moved to in-tree (line 523) due to APIServer plugin limitation - Keeping unused files adds maintenance burden and confusion SUMMARY: - Added: 1 new critical fix (eagle_structured_output_fix) - Removed: 3 unused patch files - Total active patches: 12 (all enabled and working) Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>

chaunceyjiang force-pushed the think_struct_output branch from f4267d7 to 4f20677 Compare May 29, 2025 10:22

[Bugfix]: Fix the incompatibility issue with Structured Outputs when …

2156cfa

…Thinking is disabled Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

mergify bot added structured-output v1 labels May 29, 2025

github-project-automation bot added this to Structured Output May 29, 2025

chaunceyjiang force-pushed the think_struct_output branch from 4f20677 to 2156cfa Compare May 29, 2025 10:23

chaunceyjiang marked this pull request as ready for review May 29, 2025 10:39

chaunceyjiang requested review from mgoin and russellb as code owners May 29, 2025 10:39

chaunceyjiang mentioned this pull request May 29, 2025

[Bug]: Broken Structured Output (Guided Decoding) with Qwen3 models when enable_thinking=False #18819

Closed

1 task

aarnphm approved these changes May 30, 2025

View reviewed changes

DarkLight1337 approved these changes May 31, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) May 31, 2025 06:51

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 31, 2025

DarkLight1337 merged commit ba5111f into vllm-project:main May 31, 2025
65 of 67 checks passed

github-project-automation bot moved this to Done in Structured Output May 31, 2025

This was referenced Jun 3, 2025

[Bug]:Tool calls not correctly parsed from the reasoning_content field when thinking is disabled #19017

Closed

[Bug]: reasoning-parser=deepseek_r1 wrong output with enable_thinking=False #19222

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled #18879

[Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled #18879

Uh oh!

chaunceyjiang commented May 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented May 29, 2025

Uh oh!

chaunceyjiang commented May 29, 2025

Uh oh!

chaunceyjiang commented May 30, 2025

Uh oh!

aarnphm left a comment

Uh oh!

chaunceyjiang commented May 31, 2025

Uh oh!

DarkLight1337 left a comment

Uh oh!

Uh oh!

AlphaINF commented Jul 17, 2025

Uh oh!

chaunceyjiang commented Jul 17, 2025

Uh oh!

AlphaINF commented Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled #18879

[Bugfix]: Fix the incompatibility issue with Structured Outputs when Thinking is disabled #18879

Uh oh!

Conversation

chaunceyjiang commented May 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 29, 2025

Uh oh!

chaunceyjiang commented May 29, 2025

Uh oh!

chaunceyjiang commented May 30, 2025

Uh oh!

aarnphm left a comment

Choose a reason for hiding this comment

Uh oh!

chaunceyjiang commented May 31, 2025

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AlphaINF commented Jul 17, 2025

Uh oh!

chaunceyjiang commented Jul 17, 2025

Uh oh!

AlphaINF commented Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

chaunceyjiang commented May 29, 2025 •

edited by github-actions bot

Loading