Skip to content

Conversation

@chaunceyjiang
Copy link
Collaborator

@chaunceyjiang chaunceyjiang commented May 29, 2025

FIX #18821 (comment)

Introduced by PR #16577.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@chaunceyjiang chaunceyjiang force-pushed the think_struct_output branch from f4267d7 to 4f20677 Compare May 29, 2025 10:22
…Thinking is disabled

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
@chaunceyjiang
Copy link
Collaborator Author

Qwen3 and DeepSeek_R1

# vllm serve /home/jovyan/public-models/Deepseek-R1-Distill-Qwen-14B  --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser deepseek_r1


# vllm serve /home/jovyan/qwen3-32b-awq  --enable-auto-tool-choice --tool-call-parser hermes --reasoning-parser qwen3

python test.py

from openai import OpenAI
from pydantic import BaseModel


client = OpenAI(api_key="xxx", base_url="http://127.0.0.1:8000/v1")


class OutputModel(BaseModel):
    result: int


prompt = """\
123+456等于多少?
结果以JSON格式给出:
{{
    "result": "结果"
}}
"""


rsp = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": prompt},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": OutputModel.model_json_schema(),},
    temperature=0,
)
print('---')
print(rsp.choices[0].message.content)
print('---')

rsp = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": prompt},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": False}, "guided_json": OutputModel.model_json_schema()},
    temperature=0.7,
)
print('---')
print(rsp.choices[0].message.content)
print('---')

# or
rsp = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": prompt},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": OutputModel.model_json_schema()},
    temperature=0,
)
print('---')
print(rsp.choices[0].message.content)
print('---')
# service will be blocked
rsp = client.chat.completions.create(
    model="",
    messages=[
        {"role": "user", "content": prompt},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": OutputModel.model_json_schema()},
    temperature=0.7,
)
print('---')
print(rsp.choices[0].message.content)
print('---')
class Step(BaseModel):
    ground_truth_key_ideas: str 
    system_response_key_ideas: str
    discussion: str
    recall: float
    precision: float



# client.chat.completions.create
json_schema = Step.model_json_schema()

chat_response = client.beta.chat.completions.parse(
    model="",
    messages=[
        {'role': 'system',
        'content': 'Your input fields are:\n1. `question` (str)\n2. `ground_truth` (str)\n3. `system_response` (str)\n\nYour output fields are:\n1. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth\n2. `system_response_key_ideas` (str): enumeration of key ideas in the system response\n3. `discussion` (str): discussion of the overlap between ground truth and system response\n4. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response\n5. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\n[[ ## ground_truth ## ]]\n{ground_truth}\n\n[[ ## system_response ## ]]\n{system_response}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  "ground_truth_key_ideas": "{ground_truth_key_ideas}",\n  "system_response_key_ideas": "{system_response_key_ideas}",\n  "discussion": "{discussion}",\n  "recall": "{recall}        # note: the value you produce must be a single float value",\n  "precision": "{precision}        # note: the value you produce must be a single float value"\n}\n\nIn adhering to this structure, your objective is: \n        Compare a system\'s response to the ground truth to compute recall and precision of key ideas.\n        You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.'},
        {'role': 'user',
        'content': '[[ ## question ## ]]\nWhy is the action space of language modeling particularly large? Is it because of the vocab size? But then, moving in the real world also has a huge action space (degrees of movement).\n\n[[ ## ground_truth ## ]]\nThe action space for language modeling is equal to the vocabulary set of language models. Since the vocabularies are very large (i.e. tens of thousands of possible tokens), the action space is also very large. In general, locomotion in the real world can be condensed to three quantities - moving across X, Y or Z axes, or a linear combination thereof. The authors mention that typical RL problems have an action space that is an order of magnitude smaller, but do not specifically explain how the action spaces for typical problems is modeled or constructed.\n\n[[ ## system_response ## ]]\nThe action space in language modeling is indeed particularly large due to the **vocabulary size**—at each step, the model must choose from thousands of possible words (e.g., 50k+ tokens in large vocabularies), leading to combinatorial explosion in sequence generation. This discrete, high-dimensional choice is distinct from real-world continuous action spaces (e.g., motor control with infinite degrees of movement). However, the challenges differ: language models face **discrete, high-cardinality decisions** with combinatorial complexity, while real-world actions often involve **continuous control**. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) or action space reduction (e.g., GALAD) address the former by managing variance and exploration in discrete, large vocabularies, whereas real-world control typically uses gradient-based methods for continuous spaces.\n\nRespond with a JSON object in the following order of fields: `ground_truth_key_ideas`, then `system_response_key_ideas`, then `discussion`, then `recall` (must be formatted as a valid Python float), then `precision` (must be formatted as a valid Python float).'}
    ],
    temperature=0.0,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}, "guided_json": json_schema},
)
print("-----")
print(chat_response.choices[0].message.content)
print("-----")

chat_response = client.beta.chat.completions.parse(
    model="",
    messages=[
        {'role': 'system',
        'content': 'Your input fields are:\n1. `question` (str)\n2. `ground_truth` (str)\n3. `system_response` (str)\n\nYour output fields are:\n1. `ground_truth_key_ideas` (str): enumeration of key ideas in the ground truth\n2. `system_response_key_ideas` (str): enumeration of key ideas in the system response\n3. `discussion` (str): discussion of the overlap between ground truth and system response\n4. `recall` (float): fraction (out of 1.0) of ground truth covered by the system response\n5. `precision` (float): fraction (out of 1.0) of system response covered by the ground truth\n\nAll interactions will be structured in the following way, with the appropriate values filled in.\n\nInputs will have the following structure:\n\n[[ ## question ## ]]\n{question}\n\n[[ ## ground_truth ## ]]\n{ground_truth}\n\n[[ ## system_response ## ]]\n{system_response}\n\nOutputs will be a JSON object with the following fields.\n\n{\n  "ground_truth_key_ideas": "{ground_truth_key_ideas}",\n  "system_response_key_ideas": "{system_response_key_ideas}",\n  "discussion": "{discussion}",\n  "recall": "{recall}        # note: the value you produce must be a single float value",\n  "precision": "{precision}        # note: the value you produce must be a single float value"\n}\n\nIn adhering to this structure, your objective is: \n        Compare a system\'s response to the ground truth to compute recall and precision of key ideas.\n        You will first enumerate key ideas in each response, discuss their overlap, and then report recall and precision.'},
        {'role': 'user',
        'content': '[[ ## question ## ]]\nWhy is the action space of language modeling particularly large? Is it because of the vocab size? But then, moving in the real world also has a huge action space (degrees of movement).\n\n[[ ## ground_truth ## ]]\nThe action space for language modeling is equal to the vocabulary set of language models. Since the vocabularies are very large (i.e. tens of thousands of possible tokens), the action space is also very large. In general, locomotion in the real world can be condensed to three quantities - moving across X, Y or Z axes, or a linear combination thereof. The authors mention that typical RL problems have an action space that is an order of magnitude smaller, but do not specifically explain how the action spaces for typical problems is modeled or constructed.\n\n[[ ## system_response ## ]]\nThe action space in language modeling is indeed particularly large due to the **vocabulary size**—at each step, the model must choose from thousands of possible words (e.g., 50k+ tokens in large vocabularies), leading to combinatorial explosion in sequence generation. This discrete, high-dimensional choice is distinct from real-world continuous action spaces (e.g., motor control with infinite degrees of movement). However, the challenges differ: language models face **discrete, high-cardinality decisions** with combinatorial complexity, while real-world actions often involve **continuous control**. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) or action space reduction (e.g., GALAD) address the former by managing variance and exploration in discrete, large vocabularies, whereas real-world control typically uses gradient-based methods for continuous spaces.\n\nRespond with a JSON object in the following order of fields: `ground_truth_key_ideas`, then `system_response_key_ideas`, then `discussion`, then `recall` (must be formatted as a valid Python float), then `precision` (must be formatted as a valid Python float).'}
    ],
    temperature=0.0,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}, "guided_json": json_schema},
)
print("-----")
print(chat_response.choices[0].message.content)
print("-----")


---
{  
    "result": 579
}
---
---
{ "result": 579 }
---
---
{  
    "result": 579
}
---
---
{  
    "result": 579  
}
---
-----
{
  "ground_truth_key_ideas": "1. The action space for language modeling is equal to the vocabulary set of language models. 2. The vocabulary size is very large (tens of thousands of possible tokens). 3. Real-world locomotion can be condensed to three quantities (X, Y, or Z axes).",
  "system_response_key_ideas": "1. Action space in language modeling is large due to vocabulary size (e.g., 50k+ tokens). 2. Combinatorial explosion in sequence generation. 3. Discrete, high-cardinality decisions vs. real-world continuous control. 4. Techniques like actor-critic methods and action space reduction address challenges.",
  "discussion": "The system response accurately captures the key ideas from the ground truth, including the relationship between vocabulary size and action space, and the comparison to real-world action spaces. Additionally, the system response provides more detailed explanations, such as the combinatorial explosion in sequence generation and the distinction between discrete and continuous action spaces. It also mentions specific techniques for addressing these challenges, which are not covered in the ground truth.",
  "recall": 1.0,
  "precision": 0.6666666666666666
}
-----
-----
{
  "ground_truth_key_ideas": "1. The action space for language modeling is equal to the vocabulary set of language models. 2. The vocabulary size is very large (tens of thousands of possible tokens). 3. Real-world locomotion can be condensed to three axes (X, Y, Z) or linear combinations thereof.",
  "system_response_key_ideas": "1. Action space in language modeling is large due to vocabulary size (e.g., 50k+ tokens). 2. The action space involves discrete, high-cardinality decisions with combinatorial complexity. 3. Real-world actions involve continuous control (e.g., motor control with infinite degrees of movement). 4. Techniques like actor-critic methods (e.g., Bahdanau et al. 2016) manage variance and exploration in discrete, large vocabularies. 5. Action space reduction techniques (e.g., GALAD) are used for handling large vocabularies.",
  "discussion": "The system response fully covers all key ideas from the ground truth and adds additional details. It expands on the challenges of discrete, high-cardinality decisions in language modeling and contrasts them with continuous control in real-world actions. The system also mentions specific techniques to address these challenges, which were not discussed in the ground truth.",
  "recall": 1.0,
  "precision": 1.0
}
-----

@chaunceyjiang
Copy link
Collaborator Author

/cc @aarnphm PTAL.

Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@chaunceyjiang
Copy link
Collaborator Author

@DarkLight1337 PTAL。

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamp

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) May 31, 2025 06:51
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 31, 2025
@DarkLight1337 DarkLight1337 merged commit ba5111f into vllm-project:main May 31, 2025
65 of 67 checks passed
amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025
…Thinking is disabled (vllm-project#18879)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: amit <amit.man@gmail.com>
amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025
…Thinking is disabled (vllm-project#18879)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: amit <amit.man@gmail.com>
@AlphaINF
Copy link

Hello! How can i fix the problem in V0 Engine?
the V0 Engine seem have some stability issue that it will crash when the process is up.

So I want to fix the problem in V0 Engine.

@chaunceyjiang
Copy link
Collaborator Author

@AlphaINF V0 has already been deprecated. I suggest upgrading to v1.

@AlphaINF
Copy link

@AlphaINF V0 has already been deprecated. I suggest upgrading to v1.

However, V1 has some reliability problem, it will crash if you send more requests.

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Oct 20, 2025
PART 1: Eagle + Structured Output FSM Validation Fix
=====================================================

ISSUE: Eagle speculative decoding with structured output crashes with
AssertionError when FSM rejects tokens in scheduled_spec_decode_tokens list.

SYMPTOMS:
- Error: "Failed to advance FSM for request ... for tokens XXX"
- Followed by: AssertionError at vllm/v1/structured_output/__init__.py:263
- Crashes entire vLLM engine under load with Eagle + tool calling

ROOT CAUSE (partially identified):
- FSM can terminate mid-validation when accepting stop token
- Remaining spec tokens still attempted for validation
- Original code asserts all scheduled tokens must be valid
- Assertion fails when FSM rejects tokens after termination

SOLUTION:
Implemented defensive fix in grammar_bitmask() method:
- Replace assertion with conditional check
- If token rejected, log debug message and continue
- Still fill bitmasks for all positions (required by downstream code)
- Makes code resilient to FSM state mismatches

IMPLEMENTATION:
- New patch: mantle_extensions/patches/eagle_structured_output_fix.py
- Monkey-patches StructuredOutputManager.grammar_bitmask()
- Registered as 12th patch in plugin system
- Enabled by default in patch_config.json

TESTING:
✓ Plugin loads successfully with all 12 patches
✓ No more AssertionError crashes
✓ No more 500 Internal Server errors
✓ Eagle + structured output + penalties works correctly
⚠ Expected warnings from xgrammar about terminated FSM (benign)

NOTES:
- This is a defensive fix without full root cause understanding
- Possible causes: FSM state mismatch, xgrammar rollback bug, concurrency
- Upstreamable: Yes - should be contributed to vLLM upstream
- Bug exists since PR vllm-project#18879 (May 2025)

PART 2: Clean Up Unused Patch Files
====================================

Removed 3 unused patch files:
1. pr26291_streaming_method.py - Unused reference implementation
2. streaming_patches.py - Unused streaming patch loader
3. qwen3_tool_parser_fix_complete.py - Now implemented in-tree

Updated files:
- mantle_extensions/patches/__init__.py - Removed streaming_patches export
- mantle_extensions/plugin.py - Added note about qwen3 in-tree fix

Rationale:
- pr26291 and streaming_patches were never used in production
- qwen3 fix moved to in-tree (line 523) due to APIServer plugin limitation
- Keeping unused files adds maintenance burden and confusion

SUMMARY:
- Added: 1 new critical fix (eagle_structured_output_fix)
- Removed: 3 unused patch files
- Total active patches: 12 (all enabled and working)

Signed-off-by: Pradyun Ramadorai <pradyunr@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed structured-output v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Bug]: In Version V0.9.0, Qwen3-32B-AWQ Error when turn off thinking and use guided_json simultaneously.

4 participants