-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server
: streaming of tool calls and thoughts when --jinja
is on
#12379
base: master
Are you sure you want to change the base?
Conversation
…al based thinking tags parsing)
…r's capture!)" This reverts commit f0ea330.
Please forgive my ignorance on the OAI streaming protocol, but would it be something to consider waiting until the complete toolcall is collected and then send the whole thing back? The tool-calls will essential be re-assembled on the client side at the expense of a lot of JSON verbosity! And, outside of debug output this will not be presentable to users until the whole thing arrives anyhow. 😉 I like @ngxson idea about the state machine as this is something I was considering in the llama-cli toolcalls, as it would be necessary for cleaner output. In an ideal world, there is a simple stack which pushes tokens (perhaps a buffer >= 1) for nested begin/end delimiters and "all that is needed" (in theory) is basic token comparison. If the tool-call is opened, then wait until the rest arrives (or some timeout condition elapses). |
It is still needed for some use case, see #12379 (comment) JSON verbosity is not a big problem for now IMO, we can always enforce a minimum chunk length in the future, to reduce the number of SSE events to be emitted. Btw @ochafik I would like to help on this if needed. At least the case for streaming response (without streaming toolcall) is very necessary for now. And I think @bandoti ask for that because at this point, this is the only blocking point to bring MCP into llama.cpp server Web UI. |
Are there client compatibility issues with streaming partial tool call responses? If so maybe streaming of the tool call response itself should be optional (e.g. default state controllable by argument, and a json parameter)? Tool calling is supported in a lot more places than open source IDE plugins (personally I want to use it in Home Assistant and Skyrim mods, currently I'm using prompt injecting alternatives because of the streaming issue). |
I'm getting a hard crash when sending a request that has a tool response as the last message.
The last message was:
The model is I'm trying to use the puppeteer mcp server. Is there a way to get more detail on this error? |
@llowrey Thanks a lot for trying this out! If you could share the llama-server output w/
Haven't tried gemma3 yet, will definitely spend time on it over the weekend! Have you tried any other model by any chance?
@ngxson Hoping to spend this raining weekend on this, hope to send you a few parts to review if you have the time :-) (I'd also highly welcome some feedback on how I've plugged things into the slot + partial / final response logic so far)
@antcodd Just as with OpenAI's chat completion API, streaming is enabled through the
@bandoti Aside from the streamed thoughts (already presentable as they come), there's definitely ways to use the tool calls as they trickle back (either w/ parallel tool calls once you have some complete calls and are still receiving others, or w/ a single tool call when an argument is very long, e.g. file diffs that can be partially applied on the fly as in cline, see cline/cline#1946 ). I hope to contribute generators-based partial TypeScript JSON decoders once this gets in :-)
@bandoti I like state machines too, but in this case there may be too many states to enumerate manually, the regexps of some formats do some funky grouping that kinda simplify the code. AND we can most likely turn this whole thing into a giant state machine using C++ coroutines once the project adopts C++20. Starting with something inefficient for ease of iteration / maintenance but got my eyes on the prize ;-) |
Thanks for the quick response @ochafik Here's the console output:
Here's the POST body that causes this: crash.json I can send with postman and get a crash every time After more experimenting, it works most of the time. I just got unlucky with my first attempt. I really appreciate the work you are doing and hope this info helps. |
This PR is still WIP (see todos at the bottom) but welcoming early feedback / testing
<think>
reasoning content inside the content (same output for all thinking models when using the default--reasoning-content deepseek
, even for those not using the<think>
syntax like Command R7B), and even if the<think>
tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).{"code": "json-encoded code"}
for multiline programs)This fixes #12107, #10920, #11861
Follow up to #9639
How to test / use
Get and build this PR's branch
Run
llama-server
w/ any model (see more details in the tool calling docs; note that some GGUFs require a chat template override!):Call the chat completions endpoint in streamed mode with any OpenAI-compatible library, or plain curl:
You can also open http://localhost:8080/ to see thoughts being streamed back properly even for models which template add an opening
<think>
tag to the end of the prompt (QwQ, now DeepSeek R1 too although most GGUFs have their initial version) and models like Cohere Command R7B that natively use a different thinking tags syntax (now normalized, since—reasoning-format deepseek
is the default)Context
Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.
While tool calls are returned in a standard format, each w/ a function name, tool call id and JSON encoded arguments, model outputs vary greatly in their syntax. That syntax mostly uses JSON for arguments but not always.
Function calls and their arguments can be at various levels:
[TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}]
)<tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>
; note that some models use other keys here, e.g.tool_name
,parameters
, and may have the tool call id too)<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>special_function\n```json\n{"arg1": 1}\n```<|tool▁call▁end|><|tool▁calls▁end|>
, or functionary v3.2:special_function\n{"arg1": 1}
){"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}}
(or insidetool_calls
array ifparallel_tool_calls
is on)python
tool call, with two variants:<|python_tag|>multiline python code here
(functionary v3.1),python\nmultiline python code here
(functionary v3.2; w/ prefix>>>
if after textual response)<|python_tag|>python.call(code="multiline\npython\ncode\nhere")
Side note about raw python code:
<|python_tag>foo.call(bar="baz")
in Llama 3.x style will return"tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}]
, while the same output from Functionary would be parsed as"tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}]
.Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.
(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)
The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full
common_chat_msg
against the last one we sent back, and compute OpenAI-compatible deltas out of this.Location, location, location 🏡
Note that the output of the model may be truncated (max token output length reached or streaming in progress), and that may fall inside an expected literal (e.g.
<think>
isn't a single token on QwQ-32B), inside a regex (used for some matchers), or inside some JSON.But more interesting is where it happens, esp. for partial JSON:
tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for
/* is_partial= */ true
)See examples of streamed tool call deltas
Implementation notes
Partial parsing utils
I added a
common_chat_msg_parser
utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:common_regex
(seecommon/regex-partial.cpp
)./abc/
gives/((?:(?:c)?b)?a)[\s\S]*/
, with a single capturing group which end indicates - in reverse - where the partial match started)nlohmann/json
's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parseconsume_json
accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)try_*
parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart fromoptional
s, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when
finish_reason
!=tool_call
).To think or not to think... what is the prompt?
I've also introduced
common_chat_syntax
which wrapscommon_reasoning_format
,common_chat_format
together with:thinking_forced_open
: whether the prompt was detected to end w/ a (model-specific)<think>
tag to force thinking modereasoning_in_content
: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.This allows streaming back a standard
<think>...
syntax even for models that use a different set of tags (e.g. Command R7B). And of course,--reasoning-format none
is still allowed to get the raw output.Note: Ideally, we'd stream the thoughts as a
reasoning_content
delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if--reasoning-format deepseek
, which is the default).Triggering thoughts 😓
I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.
To address this, I made it possible for
common_chat_templates_apply
to create trigger regexes that match on the entire output (this was already the case in the sampler).COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL
(renamed from_START
) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.Functionary v3.2 w/ raw python
Ask
bartowski/functionary-small-v3.2-GGUF:Q4_K_M
to write a hello world in Python and it outputspython\n{"code": "print('hey')"}
.But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax
python\nprint('hey')\n# many other lines
. This is now supported.TODOs
tool-call
: ensure there's always a non-empty tool call id #12292logprobs
for tools mode (right now, forbidden; we don't return diffs for every token, for instance if a function name is in multiple tokens we don't want to send its name in chunks)llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L
)<|START_RESPONSE|>
at the end of the prompt. Output will contain an<|END_RESPONSE|>
that needs handling (would fit nicely in newcommon_chat_syntax
struct). Maybe combine w/ forced/disabled thinking modes as a follow up PRcommon_regex
) as separate PRcommon_json
) as separate PR(?) or fold intochat-parser.cpp
scripts/tool_bench.sh
to compare againstmaster
(+ compare timings)Future follow ups:
cc/ @jpohhhh