Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: streaming of tool calls and thoughts when --jinja is on #12379

Draft
wants to merge 59 commits into
base: master
Choose a base branch
from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Mar 14, 2025

This PR is still WIP (see todos at the bottom) but welcoming early feedback / testing

  • Support streaming of tool calls in OpenAI format
  • Improve handling of thinking model (DeepSeek R1 Distills, QwQ, Command R7B):
    • Stream <think> reasoning content inside the content (same output for all thinking models when using the default --reasoning-content deepseek, even for those not using the <think> syntax like Command R7B), and even if the <think> tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).
    • Avoid spurious lazy (tool call) grammar triggers from "thoughts about tool calls" (only trigger after closing any unclosed thoughts)
  • Improves Functionary v3.2 support (allow raw python code, preferred by models over {"code": "json-encoded code"} for multiline programs)
  • Support truncated outputs incl. reasoning_content & tool_calls (returns salvageable fields when finish_reason = length)

This fixes #12107, #10920, #11861

Follow up to #9639

How to test / use

  • Get and build this PR's branch
    git clone https://github.com/ggml-org/llama.cpp
    cd llama.cpp
    git remote add ochafik https://github.com/ochafik/llama.cpp
    git fetch ochafik
    git checkout ochafik/tool-diffs
    cmake -B build -DLLAMA_CURL=1 # -DGGML_CUDA=1 ...
    cmake --build build -t llama-server --parallel --config Release
    alias llama-server=./build/bin/llama-server
  • Run llama-server w/ any model (see more details in the tool calling docs; note that some GGUFs require a chat template override!):

    # Thoughts of Command R7B / DeepSeek R1 / QwQ will be streamed in the content inside <think> tags
    llama-server --jinja -fa -hf bartowski/Qwen_QwQ-32B-GGUF
    
    # Models w/ generic tool call support now return clean interrupted output when hitting token limit
    llama-server --jinja -fa -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF
    
  • Call the chat completions endpoint in streamed mode with any OpenAI-compatible library, or plain curl:

    curl http://localhost:8080/v1/chat/completions -d '{
      "model": "gpt-3.5-turbo",
      "tools": [
        {
          "type": "function",
          "function": {
            "name": "python",
            "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters": {
              "type": "object",
              "properties": {
                "code": {
                  "type": "string",
                  "description": "The code to run in the ipython interpreter."
                }
              },
              "required": ["code"]
            }
          }
        }
      ],
      "messages": [
        {
          "role": "user",
          "content": "Print a hello world message with python."
        }
      ],
      "stream": true
    }'
  • You can also open http://localhost:8080/ to see thoughts being streamed back properly even for models which template add an opening <think> tag to the end of the prompt (QwQ, now DeepSeek R1 too although most GGUFs have their initial version) and models like Cohere Command R7B that natively use a different thinking tags syntax (now normalized, since —reasoning-format deepseek is the default)

Context

Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.

While tool calls are returned in a standard format, each w/ a function name, tool call id and JSON encoded arguments, model outputs vary greatly in their syntax. That syntax mostly uses JSON for arguments but not always.

Function calls and their arguments can be at various levels:

  • JSON array of tool calls (e.g. Mistral Nemo: [TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}])
  • Standalone JSON tool call (e.g. Hermes syntax: <tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>; note that some models use other keys here, e.g. tool_name, parameters, and may have the tool call id too)
  • JSON arguments object w/ name in some prefix (e.g. Deepseek: <|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>special_function\n```json\n{"arg1": 1}\n```<|tool▁call▁end|><|tool▁calls▁end|>, or functionary v3.2: special_function\n{"arg1": 1})
  • Nested JSON for the generic mode {"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}} (or inside tool_calls array if parallel_tool_calls is on)
  • No JSON / raw code string for python tool call, with two variants:
    • Unconstrained verbatim code: <|python_tag|>multiline python code here (functionary v3.1), python\nmultiline python code here (functionary v3.2; w/ prefix >>> if after textual response)
    • Constrained pythonish syntax for "builtin tools" (Llama 3.x, quite widespread): <|python_tag|>python.call(code="multiline\npython\ncode\nhere")

Side note about raw python code: <|python_tag>foo.call(bar="baz") in Llama 3.x style will return "tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}], while the same output from Functionary would be parsed as "tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}].

Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.

(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)

The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full common_chat_msg against the last one we sent back, and compute OpenAI-compatible deltas out of this.

Location, location, location 🏡

Note that the output of the model may be truncated (max token output length reached or streaming in progress), and that may fall inside an expected literal (e.g. <think> isn't a single token on QwQ-32B), inside a regex (used for some matchers), or inside some JSON.

But more interesting is where it happens, esp. for partial JSON:

  • If it happens inside an arguments object or a contents string (for generic mode), we should return it partial / truncated (and json-dumped in the case of the arguments), and diffed from the last parsed value for the streamed case
  • If it happens inside the wrapper of the arguments, then it depends. We don't want to get a half-function name, but as soon as we have a complete function name we can send a diff. So we try and heal the JSON (we identify which json paths can be partially healed - because they're inside the arguments, and which ones must be dropped), and only populate a tool call if we have at least a name). Likewise, if there is an array of function calls with the first complete, and the next partial, we want to make sure the client can start calling the first function.

tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for /* is_partial= */ true)

See examples of streamed tool call deltas
curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
        {
        "type":"function",
        "function":{
            "name":"python",
            "description":"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters":{
            "type":"object",
            "properties":{
                "code":{
                "type":"string",
                "description":"The code to run in the ipython interpreter."
                }
            },
            "required":["code"]
            }
        }
        }
    ],
    "messages": [
        {
        "role": "user",
        "content": "Print a hello world message with python."
        }
    ], "stream": true
}'
data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_aqwOReHDKPnqiF7NbRxzDTY1","type":"function","function":{"name":"python","arguments":""}}],"refusal":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"code"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"print"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"('"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"Hello"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":","}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":" World"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"!"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"')"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"}"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"tool_calls"}]}

data: [DONE]

Implementation notes

Partial parsing utils

I added a common_chat_msg_parser utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:

  • Supports partial regex parsing
    • Since the STL still doesn't have partial matching support (unlike Boost), I had to implement my own in common_regex (see common/regex-partial.cpp).
    • The trick = transform the original regex to a regex that matches in reverse from the end of the string (e.g. /abc/ gives /((?:(?:c)?b)?a)[\s\S]*/, with a single capturing group which end indicates - in reverse - where the partial match started)
  • Supports partial JSON parsing:
    • Used nlohmann/json's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parse
    • Healing the JSON w/ a healing marker that can then be found when visiting the resulting JSON (to remove things we don't want to heal - e.g. function name - and cut any JSON encoded result at the "right" place, which must be somewhere inside function arguments: consume_json accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)
  • Supports control flow w/ try_* parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart from optionals, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.
  • Supports full or partial parsing w/ same code (throws partial exceptions to interrupt the control flow without making parsing code more complex)

This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when finish_reason != tool_call).

To think or not to think... what is the prompt?

I've also introduced common_chat_syntax which wraps common_reasoning_format, common_chat_format together with:

  • thinking_forced_open: whether the prompt was detected to end w/ a (model-specific) <think> tag to force thinking mode
  • reasoning_in_content: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.

This allows streaming back a standard <think>... syntax even for models that use a different set of tags (e.g. Command R7B). And of course, --reasoning-format none is still allowed to get the raw output.

Note: Ideally, we'd stream the thoughts as a reasoning_content delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if --reasoning-format deepseek, which is the default).

Triggering thoughts 😓

I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.

To address this, I made it possible for common_chat_templates_apply to create trigger regexes that match on the entire output (this was already the case in the sampler). COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL (renamed from _START) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.

Functionary v3.2 w/ raw python

Ask bartowski/functionary-small-v3.2-GGUF:Q4_K_M to write a hello world in Python and it outputs python\n{"code": "print('hey')"}.

But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax python\nprint('hey')\n# many other lines. This is now supported.

TODOs

  • Fix tool call id attribution logic (disabled for now) from tool-call: ensure there's always a non-empty tool call id #12292
  • Might need one last diff in the final response after a stream, say, to close any raw python code
  • Decide what to do about logprobs for tools mode (right now, forbidden; we don't return diffs for every token, for instance if a function name is in multiple tokens we don't want to send its name in chunks)
    • Edit: OpenAI returns null logpropbs in tool call mode. Just need to ensure normal mode doesn't regress (test failing atm)
  • Fix Mistral Nemo crash (llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L)
  • Command R7B's non-tool-calling template (they have 3 templates) forces <|START_RESPONSE|> at the end of the prompt. Output will contain an <|END_RESPONSE|> that needs handling (would fit nicely in new common_chat_syntax struct). Maybe combine w/ forced/disabled thinking modes as a follow up PR
  • Add some docs
  • Add more tests
  • Send partial regex (common_regex) as separate PR
  • Send partial JSON (common_json) as separate PR(?) or fold into chat-parser.cpp
  • Run scripts/tool_bench.sh to compare against master (+ compare timings)

Future follow ups:

  • To make this faster, I suggest two options:
    • Wait for the project to switch to C++20 & turn all the parser functions into resumable coroutines (feed them tokens and persist their state in the slot)
    • Only compute and send deltas after N milliseconds

cc/ @jpohhhh

@github-actions github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples python python script changes server labels Mar 14, 2025
@bandoti
Copy link
Collaborator

bandoti commented Mar 17, 2025

Please forgive my ignorance on the OAI streaming protocol, but would it be something to consider waiting until the complete toolcall is collected and then send the whole thing back? The tool-calls will essential be re-assembled on the client side at the expense of a lot of JSON verbosity! And, outside of debug output this will not be presentable to users until the whole thing arrives anyhow. 😉

I like @ngxson idea about the state machine as this is something I was considering in the llama-cli toolcalls, as it would be necessary for cleaner output. In an ideal world, there is a simple stack which pushes tokens (perhaps a buffer >= 1) for nested begin/end delimiters and "all that is needed" (in theory) is basic token comparison. If the tool-call is opened, then wait until the rest arrives (or some timeout condition elapses).

@ngxson
Copy link
Collaborator

ngxson commented Mar 18, 2025

Please forgive my ignorance on the OAI streaming protocol, but would it be something to consider waiting until the complete toolcall is collected and then send the whole thing back?

It is still needed for some use case, see #12379 (comment)

JSON verbosity is not a big problem for now IMO, we can always enforce a minimum chunk length in the future, to reduce the number of SSE events to be emitted.

Btw @ochafik I would like to help on this if needed. At least the case for streaming response (without streaming toolcall) is very necessary for now. And I think @bandoti ask for that because at this point, this is the only blocking point to bring MCP into llama.cpp server Web UI.

@antcodd
Copy link

antcodd commented Mar 18, 2025

Are there client compatibility issues with streaming partial tool call responses? If so maybe streaming of the tool call response itself should be optional (e.g. default state controllable by argument, and a json parameter)?

Tool calling is supported in a lot more places than open source IDE plugins (personally I want to use it in Home Assistant and Skyrim mods, currently I'm using prompt injecting alternatives because of the streaming issue).

@llowrey
Copy link

llowrey commented Mar 21, 2025

I'm getting a hard crash when sending a request that has a tool response as the last message.

terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '{"name":"servethehome_homepage' not found at start of '{"name":'

The last message was:

{
  "role":"tool",
  "tool_call_id":"AJHxgbA2l5We2057NiDORsHRtf6Vcqzt",
  "content":"Navigated to https://servethehome.com/"
}

The model is google_gemma-3-12b-it-Q6_K.gguf

I'm trying to use the puppeteer mcp server. Is there a way to get more detail on this error?

@ochafik
Copy link
Collaborator Author

ochafik commented Mar 21, 2025

I'm getting a hard crash when sending a request that has a tool response as the last message.

terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '{"name":"servethehome_homepage' not found at start of '{"name":'

@llowrey Thanks a lot for trying this out! If you could share the llama-server output w/ --verbose flag, that would help (in particular, the last couple of Parsing input with format: lines, and if possible the request: {... log to provide a full repro case)

The model is google_gemma-3-12b-it-Q6_K.gguf

Haven't tried gemma3 yet, will definitely spend time on it over the weekend! Have you tried any other model by any chance?

Btw @ochafik I would like to help on this if needed

@ngxson Hoping to spend this raining weekend on this, hope to send you a few parts to review if you have the time :-) (I'd also highly welcome some feedback on how I've plugged things into the slot + partial / final response logic so far)

Are there client compatibility issues with streaming partial tool call responses? If so maybe streaming of the tool call response itself should be optional (e.g. default state controllable by argument, and a json parameter)?

@antcodd Just as with OpenAI's chat completion API, streaming is enabled through the "stream": true parameter and aims to be 100% compatible with OAI's format.

And, outside of debug output this will not be presentable to users until the whole thing arrives anyhow. 😉

@bandoti Aside from the streamed thoughts (already presentable as they come), there's definitely ways to use the tool calls as they trickle back (either w/ parallel tool calls once you have some complete calls and are still receiving others, or w/ a single tool call when an argument is very long, e.g. file diffs that can be partially applied on the fly as in cline, see cline/cline#1946 ). I hope to contribute generators-based partial TypeScript JSON decoders once this gets in :-)

I like @ngxson idea about the state machine

@bandoti I like state machines too, but in this case there may be too many states to enumerate manually, the regexps of some formats do some funky grouping that kinda simplify the code. AND we can most likely turn this whole thing into a giant state machine using C++ coroutines once the project adopts C++20. Starting with something inefficient for ease of iteration / maintenance but got my eyes on the prize ;-)

@llowrey
Copy link

llowrey commented Mar 21, 2025

Thanks for the quick response @ochafik

Here's the console output:

srv  update_chat_: Parsing chat message: {"tool_call": {"name": "puppeteer_screenshot", "arguments": {"name": "servethehome_homepage",
Parsing input with format Generic: {"tool_call": {"name": "puppeteer_screenshot", "arguments": {"name": "servethehome_homepage",
Failed to parse up to error: [json.exception.parse_error.101] parse error at line 1, column 94: syntax error while parsing object key - unexpected end of input; expected string literal: <<<{"tool_call": {"name": "puppeteer_screenshot", "arguments": {"name": "servethehome_homepage",>>>
Parsed partial JSON: {"tool_call":{"name":"puppeteer_screenshot","arguments":{"name":"278722862"}}} (json_healing_marker: "278722862)
Cleaned up JSON {"tool_call":{"name":"puppeteer_screenshot","arguments":{"name":"278722862"}}} to {"tool_call":{"name":"puppeteer_screenshot","arguments":"{\"name\":"}} (json_healing_marker : '"278722862')
Partial parse: incomplete tool call
Parsed message: {"role":"assistant","content":null,"tool_calls":[{"type":"function","function":{"name":"puppeteer_screenshot","arguments":"{\"name\":"}}]}
terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '{"name":"servethehome_homepage' not found at start of '{"name":'
Aborted (core dumped)

Here's the POST body that causes this: crash.json

I can send with postman and get a crash every time

After more experimenting, it works most of the time. I just got unlucky with my first attempt. I really appreciate the work you are doing and hope this info helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: llama-cpp-deepseek-r1.jinja template will miss the <think> tag
5 participants