-
Notifications
You must be signed in to change notification settings - Fork 13.7k
common: Generalized XML-style tool-call parsing with streaming support (GLM 4.5/4.6 + MiniMax M2 + SeedOSS + Kimi-K2 + Qwen3-Coder + Apriel-1.5 + Xiaomi-MiMo) #16932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
I'm looking forward to get this PR merged! @hksdpc255 Does it require a custom jinja template from the previous PR or it works good as is? |
|
For now, I’d recommend using a custom template if you’re running more complex workloads. Edit: The official template is now working properly. There’s no longer need for a custom template. Edit2: Official template support for Minimax-M2 has been removed. See comment and ochafik/minja#7 (comment) for details. |
|
FYI I've updated (my fork of) Minja w/ support for GLM 4.6's template. |
|
@ochafik Excellent work! Once llama.cpp syncs your changes, some parts of this PR can be safely removed. However, there are still a few small patches needed — for example, replacing |
|
Currently, the official Minimax-M2 chat template fails to run tool calls because |
@hksdpc255 Both should be supported. The confusing error you probably got was because minja implements As for And please feel free to file bugs on https://github.com/ochafik/minja, it's should be cleaner to add syntax support there than to patch things up in llama.cpp. |
|
@ochafik Thank you for pointing that out. I’m currently applying your suggested fix in llama.cpp and will test whether it works as expected. Thanks again for the help! |
|
Good news! The Minimax M2 tool call is now working. I’ll push the fix later. |
|
Model: unsloth's UD-Q3_K_XL |
|
Hi @hksdpc255 , Model: unsloth--MiniMax-M2-GGUF Q8_0 ./llama-cli \
-m /models/hub/models--unsloth--MiniMax-M2-GGUF/snapshots/*/Q8_0/MiniMax-M2-Q8_0-00001-of-00005.gguf \
-ngl 99 \
-sm layer \
-ts 1,1,1,1,1,1,1,1 \
-c 78000 \
-t 16 \
--jinja \
-iOutput: > what is the capital of france?
Okay, the user asked a straightforward question: "What is the capital of France?" This is basic geography knowledge, so the answer should be simple. I don't need to overcomplicate things.
Hmm, maybe the user is just testing if I know basic facts, or perhaps they're new to this kind of question. Either way, the response should be clear and concise. No need for extra details unless they ask follow-ups.
I recall that Paris is the capital of France. It's one of the most well-known capitals globally, so this should be an easy one. The user might be a student working on homework, or someone prepping for trivia. Or maybe they're just curious—either way, I should confirm it confidently.
No signs of confusion or deeper needs here. The question is very direct. I'll just state the answer plainly. If they want more info later, like landmarks or history, they'll ask. For now, keep it simple: Paris is the capital.
Wait, should I add that it's also a major cultural hub? Nah, overcomplicating it. Just the fact. Done.
</think>
The capital of France is **Paris**.
Paris is not only the political center but also a major cultural, economic, and gastronomic hub, famous for landmarks like the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Champs-Élysées. |
|
@emuchogu Sorry, I haven’t tested it with If you want I’m not sure whether |
|
I’ve reverted my previous PR (reasoning-format-minimax-m2) and merged PR #16932 into my testing-branch16 for isolated testing. Without this PR :Streaming, no initial <think> tag in the output: Curl without streaming no initial <think> tag in the output : With this PR :Streaming : Curl without streaming, no initial <think> tag in the output : |
|
Oh! It seems you’re using non-streaming mode. I can now reproduce your issue with Let me dig into what’s happening… |
Yes, exactly: it works correctly in streaming mode (tested through the SvelteUI, which specifically designed to be debug-friendly without needing curl -N), but not in non-streaming mode. |
|
Toolcall debug on SvelteUI with your #16932 + #16618 :) Custom JSON :
|
|
@ServeurpersoCom The problem is that I added some code that makes it fall back to llama.cpp’s original parser when there are no tools, so the new parser is never called. Lines 2748 to 2753 in af5216e
Simply deleting the code above should fix the issue. I’ll run more tests before pushing a new commit.
|
I’ve successfully tested it without these lines of code and confirmed it works as expected for streaming / non streaming / reasoning_content / toolcall |
|
I just realized this, and it seems strange: shouldn’t --reasoning-format none completely bypass any parsing logic instead of still going through it? It’s meant to be the raw passthrough mode for observing the model’s native output. The .cpp files are already becoming huge and monolithic, making them harder to touch or refactor safely. The --reasoning-format options are also poorly named and not very explicit. In the long run, a modular templating system would help avoid piling up even more C++ parsing code. If this work is meant to unify several next-generation parsers, maybe we could add a new keyword to --reasoning-format instead? It’s important to keep none as a truly no-parsing mode, since it’s essential for debugging new models. Also, the current "auto" mode is actually just "deepseek" in practice, so it might be clearer to rename or document it that way to avoid confusion: and your unified detection logic could be implemented directly under auto (or deepseek, since they’re basically aliases) ? |
|
@ngladitz Try now |
@hksdpc255 Thank you, it seems to be working now 🎉 |
|
Hello, I'm trying this with GLM 4.5 Air + official template + OpenWebUI. However, "Error: no triggers set for lazy grammar!" is occuring. Template: https://huggingface.co/zai-org/GLM-4.5-Air/blob/main/chat_template.jinja Logs with -lv 1: Command line: ~/llama/llama-server \
--host 0.0.0.0 --port 8000 \
-m "GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf" \
-a "glm45-air" \
-c 0 \
-fa on \
--jinja \
--chat-template-file "template.jinja" \
--chat-template-kwargs "{\"enable_thinking\": false}" \
-lv 1 \
--no-mmap
|
|
Sorry guys, probably not the best place but I'm hella confused. I've cloned your repo @hksdpc255 (branch xml_toolcall) and built the project (with CUDA support). Do I need specific arguments when running llama-server? Do I need to specify a custom jinja template file or is everything automatic with your branch? Here is how I run it: However, continue.dev tool use seems to fail / work a bit randomly. On top of that llama webUI chat fails completely on any request:
Also tried llama-cli, and I get a core dumped on prompt "test" (same args as llama-server except for its specific ones): Cheers! |
Does this problem still exist with my latest commit? |
Yes, unfortunately |
|
@HelloKS @lainwir3d Does revert commit 374c061 and aa66837 solve the problem? The key changes will be delete all |
|
Web chat seems fixed. cli crash in the same way: |
|
Reverting both commit fixes the issue for me. |
That’s strange; I recall that this template issue had already been fixed upstream. @ochafik, excuse me, could you help clarify why this is still occurring? |
|
@HelloKS @lainwir3d I still cannot reproduce the issue for |
|
@lainwir3d As for the template issue, what template are you using? Have you tried the template provided in this PR? |
|
@hksdpc255 the "error no triggers set" has been fixed by the revert, sorry if I wasn't clear about this. As for the template, no I'm very confused hence my questions. I should be using a template using --chat-template-file? This one: models/templates/GLM-4.6.jinja ? |
|
@lainwir3d Both unsloth’s fixed template and
I mean, even without the revert, I’m still not able to reproduce the problem. |
merged up to commit 7273f76 (glm45-tool/xml_toolcall)
~/llama/llama-server \
--host 0.0.0.0 --port 8000 \
-m "GLM-4.5-Air-UD-Q6_K_XL-00001-of-00003.gguf" \
-a "glm45-air" \
-c 0 \
-fa on \
--jinja \
--chat-template-file "template.jinja" \
--chat-template-kwargs "{\"enable_thinking\": false}" \
-lv 1 \
--no-mmap
$ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "glm45-air", "messages": [{"role": "system", "content": "System prompt test"}, {"role": "user", "content": "Hello!"}], "stream_options": {"include_usage": true}, "temperature": 0.6, "top_k": 20, "top_p": 0.9, "min_p": 0.1}'
If need more of something, ping me. Thanks! |
|
I am also getting the "no triggers set for lazy grammar!" error. I just sent a "hi" message. I use the template that comes with Unsloth GLM 4.5 air model. Here is how I run the model: Logs with -lv 1 |
Thank you. I will try to fix it. |
…onal stops for Kimi-K2
|
@HelloKS @lainwir3d @sbrnaderi Fixed |
|
I tried build it with fix, but: I think |
|
@HelloKS Yes, you are right. |
|
Thanks. I just tested with and without tool calling, and it happily runs with GLM 4.5 Air. |
|
llama-server chat works great, thanks! llama-cli still having issues but not sure it's related: |
|
Trying to use continue.dev to make some code change. After a few minutes of running ended up with this: Please be aware that I have no idea what I'm doing, so please don't hesitate to tell me if that's out of scope! :-) |
|
@lainwir3d It appears there may be an issue with the template. Additional modifications might be required. |
In this case, change The llama.cpp maintainers suggested that I should not patch chat templates for known unsupported patterns during loading, so I have removed that logic. Users will need to modify the templates themselves if they rely on these patterns. |









Generalized and streaming-capable XML-style tool-call parsing with grammar enforcement and automatic template fixing.
Based on PR #15904, this patch introduces a generalized implementation for almost all XML-style tool-call formats.
Supported models
Grammar-constrained tool-call outputs
Tool-call messages generated by the model are now strictly validated against a defined grammar.
A new automatic grammar generator simplifies the process of creating grammars for new models.
This ensures that all tool-call outputs are well-formed, structurally consistent, and reliably parsed.
Streaming support for tool-call parsing
The parser now supports streaming parsing, enabling incremental processing of tool-call messages as they are generated.
This enhancement improves responsiveness and allows real-time interaction during model inference.
Automatic chat-template fixing
A lightweight Jinja2-based patcher has been added to automatically fix official chat templates before use.
With this change, official templates now work out of the box, eliminating the need for custom modifications.
In-context reasoning
The parser now supports multiple reasoning blocks within a single generation, even when interleaved with tool calls.
All reasoning content is preserved. No information is lost during parsing or streaming.
Enhanced unit tests
Add unit test for streaming-mode parser. It simulates the generation phase by feeding content character-by-character, comparing the parsed results and verifying that streaming and non-streaming modes reach the same final state.
Additional Notes
--reasoning-format none-lv 1in the command line to enable more detailed logging.