Skip to content

Conversation

@firecoperana
Copy link
Collaborator

Enable DeepSeek V3.1 thinking mode as the default. Disable with --reasoning-budget 0.
It also implements tool calling support.
Thinking model disabled assistant prefill.

Merges ggml-org/llama.cpp#15533 and ggml-org/llama.cpp#15404

ExtReMLapin and others added 3 commits September 9, 2025 18:30
…(#15639)

Co-authored-by: CNE Pierre FICHEPOIL <pierre-1.fichepoil@gendarmerie.interieur.gouv.fr>
* feat: Set enable_thinking IFF not disabled and supported

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix inverted logic condition for prefill error

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Always parse the enable_thinking kwarg to overwrite the default value

From what I can tell, this started as a Qwen3-specific keyword, but from
the use in `chat.cpp` translates this inputs.enable_thinking to the right
thinking kwarg for the given model, this is now more of a standardized
kwarg, so it should always override the default value when sent as part of
the chat_template_kwargs field in the API.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Don't limit tempalte expansion check to jinja

With the use_jinja check, non-jinja models would enable thinking and always
fail assistant prefill

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add the error text to json type errors in json_value

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Explicitly reject string values for "enable_thinking"

There are too many possible "truthy" / "falsy" strings and too many
ambiguous strings that don't have a clear truthy/falsy value, so the
simplest thing to do here is to reject the request. Ideally, this would be
a 422 (Unprocessable Entity), but right now it's coming back as a 500.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Move logic for detecting template enable_thinking support to common

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use raw pointer for common chat template function

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
# Conflicts:
#	tools/server/server.cpp
#	tools/server/utils.hpp
…) (#15533)

* Add DeepSeek V3.1 thinking mode support

- Added COMMON_CHAT_FORMAT_DEEPSEEK_V3_1 enum value
- Created common_chat_params_init_deepseek_v3_1() function (currently uses R1 implementation)
- Created common_chat_parse_deepseek_v3_1() function that handles V3.1 thinking format:
  - Extracts reasoning content before '</think>' tag into reasoning_content
  - Extracts regular content after '</think>' tag into content
  - No opening '<think>' tag in V3.1 format
- Added detection logic for V3.1 templates based on pattern: 'message['prefix'] is defined and message['prefix'] and thinking'
- Added V3.1 case to parsing switch statement

This addresses the issue where V3.1 outputs reasoning content followed by '</think>' and then regular content without the opening '<think>' tag.

* Another attempt by V3.1 non-thinking

* Fix test, but it's not asserting anything.

* Ignore vim swap files in tests dir

* Update the test

* Try using try_find_literal instead of regex

* passing test

* Revert "Try using try_find_literal instead of regex"

This reverts commit c50d887ec2780dd9e6b8b397e92347d3db8d5575.

* Remove unnecessary change

* Remove comment

* Add code to handle non-thinking mode.

* Try to set message['prefix'] when thinking is enabled.

* This fixes reasoning, but breaks normal content. We need state in the
chat parser.

* DeepSeek V3.1 thinking is now the default. Disable with `--reasoning-budget 0`.

* Simplify (DeepSeek V3.1 reasoning)

* Fix sign inversion bug

* Add some tool calling code (not working).

* Tool calls working in non-reasoning mode.

* Attempt a unit test for tool call parsing.

* Passing test

* Add tests for both happy path and broken fenced DeepSeek V3.1 tool call variants.

* Passing DeepSeek V3.1 tool call tests, but model is not working.

* Revert assistance response prefill change. Not my monkeys.

* Add fenced_thinking unit test variant. Passes, but thinking tool calling
still isn't working for some reason.

* Tests pass in reasoning mode. Also e2e tool test passes.

* Make a copy of the parse_json_tool_calls function for deepseek-v3.1 so
as to not accidentally introduce regressions.

* Fix thinking_forced_open logic. tool calling broken. Need to add another
test case.

* That's what I get for cargo culting a newline.

* Add multi tool call test for deepseek v3.1 non-reasoning

* Move test, remove .gitignore change

* Place deepseek-v3.1 reasoning test directly into existing reasoning
function per CISC's request.

* Address whitespace CI failure.

* Merge two assert_equals per CISC's request.

* Add DeepSeek-V3.1 tests to tests/test-chat.cpp per CISC's request.

* Merge deepseek V3.1 and regular parse_json_tool_calls() function
behaviors by adding optional update_cursor argument.

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* DeepSeek V3.1 fix reasoning_format none

* Strip grammar down to strictly what we expect based on model card. Throw
out parts we cargo culted from R1 that don't make sense.

* Update tests/test-chat-parser.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* DeepSeek V3.1 - Add edge case where thinking is forced open, there is
tool calling in the reasoning content, but then the model just stops the
output without closing the </think> tag, so it's not a partial. In this
case, use the tool call in the reasoning content.

* DeepSeek V3.1 - simplify update_cursor

* Update common/chat.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update common/chat.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update common/chat.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Fix indent

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@firecoperana firecoperana self-assigned this Sep 9, 2025
@firecoperana firecoperana mentioned this pull request Sep 9, 2025
4 tasks
@ikawrakow
Copy link
Owner

Can somebody test this? Thanks!

@ChicoPinto70
Copy link

ChicoPinto70 commented Sep 10, 2025

I've just tested it and it seems to be working fine.

My test was with the Ubergarm's DeepSeek-V3.1-smol-IQ4_KSS model in Roo Code, and, in reasoning mode, it uses the tools and it shows the thinking output in the proper frame, but:

  1. To make it work, I must to replace the chat-template-file for the provided by the Unsloth version (I believe, in the Mainline page, they are using the Unsloth ones).
  2. I've noticed it, sometimes, fails in the tools calling. It may be because this model doesn't support tools calling with reasoning natively or because the chat-template-file injection is not a perfect solution.

That was the command line I've used:

CUDA_VISIBLE_DEVICES="1,2,0" ./build/bin/llama-server --alias DeepSeek-V3.1-IQ4_KSS -m /home/chico/.lmstudio/models/ubergarm/DeepSeek-V3.1-GGUF/DeepSeek-V3.1-smol-IQ4_KSS-00001-of-00007.gguf -ngl 64 -c 65536 -mla 3 -fa -amb 512 -fmoe -t 28 -ctk q8_0 -ot "blk.[0-6].._exps.=CUDA1,blk.(7|8|9|10).._exps.=CUDA2,exps=CPU" --parallel 1 --numa distribute -b 512 -ub 512 -ts 1,0,0 --host 192.168.0.9 --port 1235 --jinja --chat-template-file /home/chico/ik_llama.cpp/models/templates/Unsloth-DeepSeek-V3.1.jinja --reasoning-format auto

and, that is the chat-template-file I've used: https://huggingface.co/unsloth/DeepSeek-V3.1/blob/main/chat_template.jinja

I almost forgot!!! Thanks, @firecoperana!

@arichiardi
Copy link

@ikawrakow I can test this as well, compiling as we speak - FWIW I am using GLM-4.5-Air with a custom chat template

@arichiardi
Copy link

arichiardi commented Sep 10, 2025

Noticed this branch spits out

ik-llama@GLM-4.5-Air-ik[35756]: Enable thinking? 1

But my command line reads:

--chat-template-kwargs '{"enable_thinking":false}'

I tried to remove it and I still get the same. I was expecting false or 0 for what is worth but it might be unrelated.

EDIT: the patch is definitely applied as I see the correct error response when sending a payload with {"enable_thinking": "false"} (should be the literal false).

ik-llama@GLM-4.5-Air-ik[36148]: INFO [      log_server_request] request | tid="140108745404416" timestamp=1757536917 remote_addr="..." remote_port=62582 status=500 method="POST" path="/v1/chat/completions" params={}

@firecoperana
Copy link
Collaborator Author

Checked with mainline and it also shows Enable thinking? 1.
It does not use --chat-template-kwargs, just reasoning-budget to set this value. I will remove this to avoid confusion.

@arichiardi
Copy link

@firecoperana that makes me wonder where the above "enable_thinking" is actually used. I would suggest you double check if it works on your side as well.

Shouldn't "enable_thinking" and "reasoning_budget=0" do the same thing after all?

@firecoperana
Copy link
Collaborator Author

Yes, they do the same thing. The only caveat is that is when you have both, enable_thinking will now override reasoning_budget.

@arichiardi
Copy link

arichiardi commented Sep 12, 2025

FWIW, this looks good here (I compiled rebasing onto main as well)

@ikawrakow ikawrakow merged commit 6d2e7ca into main Sep 13, 2025
@kirnat
Copy link

kirnat commented Sep 23, 2025

Thanks for adding this. For some reason I can't get tool calling working properly with DeepSeek V3.1. It works fine with upstream. Could it be a chat template issue? The template ChicoPinto70 used should be the same as the one included with the Unsloth model I tested with, but I am a bit unsure if RooCode actually uses native OpenAI tool calling, I thought they used a custom XML-tag styled formatting/parsing.

Model used:
unsloth/DeepSeek-V3.1-GGUF/UD-Q3_K_XL

Notable arguments:
--jinja --reasoning-format auto

Test Case

payload = {
    "model": "model", 
    "messages": [{"role": "user", "content": "List files in /tmp"}],
    "tools": [{
        "type": "function",
        "function": {
            "name": "list_directory",
            "description": "List files in a directory",
            "parameters": {"type": "object", "properties": {"path": {"type": "string"}}}
        }
    }],
    "tool_choice": "auto"
}

Response (ik_llama.cpp 18f0435):

"choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "Okay, the user wants me to list the files in the /tmp directory. This is a straightforward request that requires a simple directory listing operation. \n\nI'll use the list_directory function with the path parameter set to \"/tmp\". This should return the contents of that directory. \n\nThe function is designed to handle this exact type of request, so no additional parameters or special handling is needed.",
        "content": "list_directory{\"path\": \"/tmp\"}"
      }
    }
  ]

Response (llama.cpp):

"choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "First, the user is asking to list the files in the /tmp directory. I have a function called list_directory that can handle this. The function requires a path parameter, which in this case is \"/tmp\".\n\nI need to call the list_directory function with the path set to \"/tmp\". The tool call syntax is:<\uff5ctool\u2581calls\u2581begin\uff5c><\uff5ctool\u2581call\u2581begin\uff5c>tool_name<\uff5ctool\u2581sep\uff5c>{\"arg1\": \"some_value\"}<\uff5ctool\u2581call\u2581end\uff5c><\uff5ctool\u2581calls\u2581end\uff5c> So for this, it should be:<\uff5ctool\u2581calls\u2581begin\uff5c><\uff5ctool\u2581call\u2581begin\uff5c>list_directory<\uff5ctool\u2581sep\uff5c>{\"path\": \"/tmp\"}<\uff5ctool\u2581call\u2581end\uff5c><\uff5ctool\u2581calls\u2581end\uff5c>",
        "content": "I'll list the files in the /tmp directory for you.",
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "list_directory",
              "arguments": "{\"path\":\"/tmp\"}"
            },
            "id": "xgYG5r2rWS2WBewIvPmYLNEuPfkTHsGC"
          }
        ]
      }
    }
  ]

@firecoperana
Copy link
Collaborator Author

So the reasoning works for deepseek v3.1. Can you try to send the same payload a few times? The success rate of tool calls varies by model.

@kirnat
Copy link

kirnat commented Sep 23, 2025

Thanks. Yes, the reasoning works great. I have run the test about 20-30 times, and the only thing I have noticed is that the LLM sometimes outputs the tool call in reasoning content and sometimes in the regular content. This particular model has been performing really well with tool calling in llama.cpp.

@firecoperana
Copy link
Collaborator Author

If you change tool_choice from auto to required, does it force it to generate tool call? Seems like not a parsing issue, but the model does not generate tool call content.

@kirnat
Copy link

kirnat commented Sep 25, 2025

If I set too_choice to required it does the same thing - output tool call in content like in my example output above. I don't know what else to test, same model works flawless in llama.cpp with the same options and after making 100's of tool calls it hasn't failed formatting even once, DeepSeek 3.1 is exceptionally good at this task, especially compared to V3 and R1.

@firecoperana
Copy link
Collaborator Author

Ok. Unfortunately I don't have Deepseek V3.1 at hand to test and it will be a while before I have time to try it. Hope someone who have used tool call successfully on Deepseek V3.1 can share their experience.

@kirnat
Copy link

kirnat commented Sep 26, 2025

Thanks a lot. I will try to do more debugging and report any potential findings.

@firecoperana
Copy link
Collaborator Author

#799 See if this makes any difference for you.

@firecoperana firecoperana deleted the fcp/deepseek3.1_toolcall branch October 26, 2025 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants