`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

ochafik · 2025-03-14T04:45:40Z

This PR is still WIP (see todos at the bottom) but welcoming early feedback / testing

Support streaming of tool calls in OpenAI format
Improve handling of thinking model (DeepSeek R1 Distills, QwQ, Command R7B):
- Stream <think> reasoning content inside the content (same output for all thinking models when using the default --reasoning-content deepseek, even for those not using the <think> syntax like Command R7B), and even if the <think> tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).
- Avoid spurious lazy (tool call) grammar triggers from "thoughts about tool calls" (only trigger after closing any unclosed thoughts)
Improves Functionary v3.2 support (allow raw python code, preferred by models over {"code": "json-encoded code"} for multiline programs)
Support truncated outputs incl. reasoning_content & tool_calls (returns salvageable fields when finish_reason = length)

Follow up to #9639

How to test / use

Get and build this PR's branch

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/tool-diffs
cmake -B build -DLLAMA_CURL=1 # -DGGML_CUDA=1 ...
cmake --build build -t llama-server --parallel --config Release
alias llama-server=./build/bin/llama-server

Run llama-server w/ any model (see more details in the tool calling docs; note that some GGUFs require a chat template override!):

# Thoughts of Command R7B / DeepSeek R1 / QwQ will be streamed in the content inside <think> tags
llama-server --jinja -fa -hf bartowski/Qwen_QwQ-32B-GGUF

# Models w/ generic tool call support now return clean interrupted output when hitting token limit
llama-server --jinja -fa -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF

Call the chat completions endpoint in streamed mode with any OpenAI-compatible library, or plain curl:

curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gpt-3.5-turbo",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "python",
        "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
        "parameters": {
          "type": "object",
          "properties": {
            "code": {
              "type": "string",
              "description": "The code to run in the ipython interpreter."
            }
          },
          "required": ["code"]
        }
      }
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "Print a hello world message with python."
    }
  ],
  "stream": true
}'

You can also open http://localhost:8080/ to see thoughts being streamed back properly even for models which template add an opening <think> tag to the end of the prompt (QwQ, now DeepSeek R1 too although most GGUFs have their initial version) and models like Cohere Command R7B that natively use a different thinking tags syntax (now normalized, since —reasoning-format deepseek is the default)

Context

Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.

While tool calls are returned in a standard format, each w/ a function name, tool call id and JSON encoded arguments, model outputs vary greatly in their syntax. That syntax mostly uses JSON for arguments but not always.

Function calls and their arguments can be at various levels:

JSON array of tool calls (e.g. Mistral Nemo: [TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}])
Standalone JSON tool call (e.g. Hermes syntax: <tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>; note that some models use other keys here, e.g. tool_name, parameters, and may have the tool call id too)
JSON arguments object w/ name in some prefix (e.g. Deepseek: <｜tool▁calls▁begin｜><｜tool▁call▁begin｜>function<｜tool▁sep｜>special_function\n```json\n{"arg1": 1}\n```<｜tool▁call▁end｜><｜tool▁calls▁end｜>, or functionary v3.2: special_function\n{"arg1": 1})
Nested JSON for the generic mode {"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}} (or inside tool_calls array if parallel_tool_calls is on)
No JSON / raw code string for python tool call, with two variants:
- Unconstrained verbatim code: <|python_tag|>multiline python code here (functionary v3.1), python\nmultiline python code here (functionary v3.2; w/ prefix >>> if after textual response)
- Constrained pythonish syntax for "builtin tools" (Llama 3.x, quite widespread): <|python_tag|>python.call(code="multiline\npython\ncode\nhere")

Side note about raw python code: <|python_tag>foo.call(bar="baz") in Llama 3.x style will return "tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}], while the same output from Functionary would be parsed as "tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}].

Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.

(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)

The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full common_chat_msg against the last one we sent back, and compute OpenAI-compatible deltas out of this.

Location, location, location 🏡

Note that the output of the model may be truncated (max token output length reached or streaming in progress), and that may fall inside an expected literal (e.g. <think> isn't a single token on QwQ-32B), inside a regex (used for some matchers), or inside some JSON.

But more interesting is where it happens, esp. for partial JSON:

If it happens inside an arguments object or a contents string (for generic mode), we should return it partial / truncated (and json-dumped in the case of the arguments), and diffed from the last parsed value for the streamed case
If it happens inside the wrapper of the arguments, then it depends. We don't want to get a half-function name, but as soon as we have a complete function name we can send a diff. So we try and heal the JSON (we identify which json paths can be partially healed - because they're inside the arguments, and which ones must be dropped), and only populate a tool call if we have at least a name). Likewise, if there is an array of function calls with the first complete, and the next partial, we want to make sure the client can start calling the first function.

tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for /* is_partial= */ true)

See examples of streamed tool call deltas

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
        {
        "type":"function",
        "function":{
            "name":"python",
            "description":"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters":{
            "type":"object",
            "properties":{
                "code":{
                "type":"string",
                "description":"The code to run in the ipython interpreter."
                }
            },
            "required":["code"]
            }
        }
        }
    ],
    "messages": [
        {
        "role": "user",
        "content": "Print a hello world message with python."
        }
    ], "stream": true
}'

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_aqwOReHDKPnqiF7NbRxzDTY1","type":"function","function":{"name":"python","arguments":""}}],"refusal":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"code"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"print"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"('"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"Hello"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":","}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":" World"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"!"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"')"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"}"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"tool_calls"}]}

data: [DONE]

Implementation notes

Partial parsing utils

I added a common_chat_msg_parser utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:

Supports partial regex parsing
- Since the STL still doesn't have partial matching support (unlike Boost), I had to implement my own in common_regex (see common/regex-partial.cpp).
- The trick = transform the original regex to a regex that matches in reverse from the end of the string (e.g. /abc/ gives /((?:(?:c)?b)?a)[\s\S]*/, with a single capturing group which end indicates - in reverse - where the partial match started)
Supports partial JSON parsing:
- Used nlohmann/json's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parse
- Healing the JSON w/ a healing marker that can then be found when visiting the resulting JSON (to remove things we don't want to heal - e.g. function name - and cut any JSON encoded result at the "right" place, which must be somewhere inside function arguments: consume_json accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)
Supports control flow w/ try_* parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart from optionals, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.
Supports full or partial parsing w/ same code (throws partial exceptions to interrupt the control flow without making parsing code more complex)

This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when finish_reason != tool_call).

To think or not to think... what is the prompt?

I've also introduced common_chat_syntax which wraps common_reasoning_format, common_chat_format together with:

thinking_forced_open: whether the prompt was detected to end w/ a (model-specific) <think> tag to force thinking mode
reasoning_in_content: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.

This allows streaming back a standard <think>... syntax even for models that use a different set of tags (e.g. Command R7B). And of course, --reasoning-format none is still allowed to get the raw output.

Note: Ideally, we'd stream the thoughts as a reasoning_content delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if --reasoning-format deepseek, which is the default).

Triggering thoughts 😓

I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.

To address this, I made it possible for common_chat_templates_apply to create trigger regexes that match on the entire output (this was already the case in the sampler). COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL (renamed from _START) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.

Functionary v3.2 w/ raw python

Ask bartowski/functionary-small-v3.2-GGUF:Q4_K_M to write a hello world in Python and it outputs python\n{"code": "print('hey')"}.

But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax python\nprint('hey')\n# many other lines. This is now supported.

TODOs

Future follow ups:

To make this faster, I suggest two options:
- Wait for the project to switch to C++20 & turn all the parser functions into resumable coroutines (feed them tokens and persist their state in the slot)
- Only compute and send deltas after N milliseconds

cc/ @jpohhhh

…al based thinking tags parsing)

…r's capture!)" This reverts commit f0ea330.

…ing tags

…from others)

bandoti · 2025-03-17T17:47:04Z

Please forgive my ignorance on the OAI streaming protocol, but would it be something to consider waiting until the complete toolcall is collected and then send the whole thing back? The tool-calls will essential be re-assembled on the client side at the expense of a lot of JSON verbosity! And, outside of debug output this will not be presentable to users until the whole thing arrives anyhow. 😉

I like @ngxson idea about the state machine as this is something I was considering in the llama-cli toolcalls, as it would be necessary for cleaner output. In an ideal world, there is a simple stack which pushes tokens (perhaps a buffer >= 1) for nested begin/end delimiters and "all that is needed" (in theory) is basic token comparison. If the tool-call is opened, then wait until the rest arrives (or some timeout condition elapses).

ngxson · 2025-03-18T10:34:09Z

Please forgive my ignorance on the OAI streaming protocol, but would it be something to consider waiting until the complete toolcall is collected and then send the whole thing back?

It is still needed for some use case, see #12379 (comment)

JSON verbosity is not a big problem for now IMO, we can always enforce a minimum chunk length in the future, to reduce the number of SSE events to be emitted.

Btw @ochafik I would like to help on this if needed. At least the case for streaming response (without streaming toolcall) is very necessary for now. And I think @bandoti ask for that because at this point, this is the only blocking point to bring MCP into llama.cpp server Web UI.

antcodd · 2025-03-18T20:59:35Z

Are there client compatibility issues with streaming partial tool call responses? If so maybe streaming of the tool call response itself should be optional (e.g. default state controllable by argument, and a json parameter)?

Tool calling is supported in a lot more places than open source IDE plugins (personally I want to use it in Home Assistant and Skyrim mods, currently I'm using prompt injecting alternatives because of the streaming issue).

llowrey · 2025-03-21T14:59:06Z

I'm getting a hard crash when sending a request that has a tool response as the last message.

terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '{"name":"servethehome_homepage' not found at start of '{"name":'

The last message was:

{
  "role":"tool",
  "tool_call_id":"AJHxgbA2l5We2057NiDORsHRtf6Vcqzt",
  "content":"Navigated to https://servethehome.com/"
}

The model is google_gemma-3-12b-it-Q6_K.gguf

I'm trying to use the puppeteer mcp server. Is there a way to get more detail on this error?

ochafik · 2025-03-21T15:42:43Z

I'm getting a hard crash when sending a request that has a tool response as the last message.
terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '{"name":"servethehome_homepage' not found at start of '{"name":'

@llowrey Thanks a lot for trying this out! If you could share the llama-server output w/ --verbose flag, that would help (in particular, the last couple of Parsing input with format: lines, and if possible the request: {... log to provide a full repro case)

The model is google_gemma-3-12b-it-Q6_K.gguf

Haven't tried gemma3 yet, will definitely spend time on it over the weekend! Have you tried any other model by any chance?

Btw @ochafik I would like to help on this if needed

@ngxson Hoping to spend this raining weekend on this, hope to send you a few parts to review if you have the time :-) (I'd also highly welcome some feedback on how I've plugged things into the slot + partial / final response logic so far)

Are there client compatibility issues with streaming partial tool call responses? If so maybe streaming of the tool call response itself should be optional (e.g. default state controllable by argument, and a json parameter)?

@antcodd Just as with OpenAI's chat completion API, streaming is enabled through the "stream": true parameter and aims to be 100% compatible with OAI's format.

And, outside of debug output this will not be presentable to users until the whole thing arrives anyhow. 😉

@bandoti Aside from the streamed thoughts (already presentable as they come), there's definitely ways to use the tool calls as they trickle back (either w/ parallel tool calls once you have some complete calls and are still receiving others, or w/ a single tool call when an argument is very long, e.g. file diffs that can be partially applied on the fly as in cline, see cline/cline#1946 ). I hope to contribute generators-based partial TypeScript JSON decoders once this gets in :-)

I like @ngxson idea about the state machine

@bandoti I like state machines too, but in this case there may be too many states to enumerate manually, the regexps of some formats do some funky grouping that kinda simplify the code. AND we can most likely turn this whole thing into a giant state machine using C++ coroutines once the project adopts C++20. Starting with something inefficient for ease of iteration / maintenance but got my eyes on the prize ;-)

llowrey · 2025-03-21T16:20:48Z

Thanks for the quick response @ochafik

Here's the console output:

srv  update_chat_: Parsing chat message: {"tool_call": {"name": "puppeteer_screenshot", "arguments": {"name": "servethehome_homepage",
Parsing input with format Generic: {"tool_call": {"name": "puppeteer_screenshot", "arguments": {"name": "servethehome_homepage",
Failed to parse up to error: [json.exception.parse_error.101] parse error at line 1, column 94: syntax error while parsing object key - unexpected end of input; expected string literal: <<<{"tool_call": {"name": "puppeteer_screenshot", "arguments": {"name": "servethehome_homepage",>>>
Parsed partial JSON: {"tool_call":{"name":"puppeteer_screenshot","arguments":{"name":"278722862"}}} (json_healing_marker: "278722862)
Cleaned up JSON {"tool_call":{"name":"puppeteer_screenshot","arguments":{"name":"278722862"}}} to {"tool_call":{"name":"puppeteer_screenshot","arguments":"{\"name\":"}} (json_healing_marker : '"278722862')
Partial parse: incomplete tool call
Parsed message: {"role":"assistant","content":null,"tool_calls":[{"type":"function","function":{"name":"puppeteer_screenshot","arguments":"{\"name\":"}}]}
terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '{"name":"servethehome_homepage' not found at start of '{"name":'
Aborted (core dumped)

Here's the POST body that causes this: crash.json

I can send with postman and get a crash every time

After more experimenting, it works most of the time. I just got unlucky with my first attempt. I really appreciate the work you are doing and hope this info helps.

ochafik added 12 commits March 12, 2025 23:51

add common_regex w/ support for partial final matches

16c9c63

add common_json w/ support for truncated json healing

6dcff43

renaming: string_find_partial_stop (moved to common.cpp)

a95fe78

add common_chat_msg_diff

ce2f593

partial common_chat_parse

cd3681d

refactor parser w/ optionals

9462365

server: wire chat diffs in stream mode

6ed8a8f

fix trigger of thinking models (must happen after thoughts are closed)

eaeed7d

nits + docs

d6e680a

fix functionary v3.2 raw python!

64ea080

rename: common_chat_syntax (now contains format)

c46d4da

rm common_regex.at_start

4358d5d

github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples python python script changes server labels Mar 14, 2025

ochafik added 2 commits March 14, 2025 11:55

Merge remote-tracking branch 'origin/master' into tool-diffs

f477288

fix gcc compilation

e0202b3

ochafik force-pushed the tool-diffs branch from 94f8b38 to e0202b3 Compare March 14, 2025 12:07

ochafik added 10 commits March 14, 2025 12:41

fix unreachable code warning after [[noreturn]] annotation

f840e3a

fix / refactor test-regex-partial

af7391e

fix test-chat

449917b

rm spaces

b428b5c

fix command r7b partial parsing (lacked args path)

668fc90

Update test_chat_completion.py

b48ab23

refactor + test chat parser (try_consume_json_with_dumped_args, liter…

aefc8a4

…al based thinking tags parsing)

return partial msg from server

22428a4

refactor partial json

5b9c5a4

don't return empty <think></think>

3fbe84f

ochafik added 19 commits March 15, 2025 20:03

Revert "fix thinking models + tool calls (</think> not part of trigge…

6c3f87e

…r's capture!)" This reverts commit f0ea330.

fix required tool calls w/ thinking models that have pre-opened think…

e2cef66

…ing tags

fix thinking model's initial trigger (take 2) + test qwq's template

7a61eca

refactor chat parser (rm incomplete)

2f55571

test groups of common_chat_msg_parser.try_consume_regex

303f640

run most test_tool_call tests in stream + non-stream modes

e9540ad

make functionary v3.2 parsing more strict (differentiate first match …

a818114

…from others)

send final diff from server, to close off raw python arguments

5031366

nit: spaces

dae6a28

fix diff aggregation logic in make_any_request

f026cb0

fix test_chat_completion_with_timings_per_token & test_logprobs_stream

e7f9d3e

add missing functional import for gcc compilation

165b525

fix typo in test_calc_result

9d4a6f1

fix thoughts parsing logic

64b4039

support partial content streaming in Generic mode

fbba5da

strip reasoning (now that tags are strings and not regexes)

4dcd653

run test_thoughts in stream mode too

56156b7

r1: avoid partial call triggers from spaces

5dfa2f7

fix test_thoughts / refactor expectations

91a5084

ochafik mentioned this pull request Mar 16, 2025

Eval bug: <think> tag with DeepSeek-R1-Distill-Qwen-1.5B-Q5_K_M.gguf #11325

Open

ochafik added 2 commits March 16, 2025 13:56

fix partial json crashes

4f78d44

fix test-chat's unparsed thought expectation

ea57e47

This was referenced Mar 16, 2025

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

Merged

Misc. bug: QwQ 32B doesn't put the reasoning content in message.reasoning_content #12275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

ochafik commented Mar 14, 2025 •

edited

Loading

bandoti commented Mar 17, 2025

ngxson commented Mar 18, 2025

antcodd commented Mar 18, 2025

llowrey commented Mar 21, 2025

ochafik commented Mar 21, 2025

llowrey commented Mar 21, 2025

server: streaming of tool calls and thoughts when --jinja is on #12379

Are you sure you want to change the base?

server: streaming of tool calls and thoughts when --jinja is on #12379

Conversation

ochafik commented Mar 14, 2025 • edited Loading

How to test / use

Context

Location, location, location 🏡

Implementation notes

Partial parsing utils

To think or not to think... what is the prompt?

Triggering thoughts 😓

Functionary v3.2 w/ raw python

TODOs

bandoti commented Mar 17, 2025

ngxson commented Mar 18, 2025

antcodd commented Mar 18, 2025

llowrey commented Mar 21, 2025

ochafik commented Mar 21, 2025

llowrey commented Mar 21, 2025

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

ochafik commented Mar 14, 2025 •

edited

Loading