Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless --reasoning-format none #11607

Merged
merged 94 commits into from
Feb 13, 2025
Merged
Show file tree
Hide file tree
Changes from 75 commits
Commits
Show all changes
94 commits
Select commit Hold shift + click to select a range
d3b60b8
minja: enhance backfill of templates w/o tools description (use examp…
ochafik Feb 3, 2025
87de852
pass vocab to common_chat_params_init
ochafik Feb 3, 2025
130ca22
DeepSeek R1: parse thoughts / return in separate field in API (non st…
ochafik Feb 3, 2025
04d511b
Avoid double bos w/ jinja
ochafik Feb 3, 2025
2834587
server/oai: ensure content is null when there are tool calls
ochafik Feb 3, 2025
c80cb30
update logs
ochafik Feb 3, 2025
0871628
rename tests
ochafik Feb 3, 2025
73d08d4
tool-call: allow `--jinja --chat-template chatml`
ochafik Feb 3, 2025
04be723
tool-call: fix command-r7b parsing when response is multiline
ochafik Feb 3, 2025
ae9d581
tool-calls: add DeepSeek R1 Qwen 7B to server test_hello_world
ochafik Feb 3, 2025
19bea4e
tell DS R1 not to overthink (weather test)
ochafik Feb 3, 2025
5e6f2a2
add deepseek models to server tool call section in readme
ochafik Feb 3, 2025
1e9acd2
tool-call: allow `--jinja --chat-template chatml`
ochafik Feb 3, 2025
77ae97e
Update test_tool_call.py
ochafik Feb 3, 2025
a76073c
minimize diffs
ochafik Feb 3, 2025
cf83623
fix typo
ochafik Feb 3, 2025
5d18d76
fix double bos issue (drop bos/eos tokens from jinja template)
ochafik Feb 3, 2025
aa98e59
fix bad merge
ochafik Feb 3, 2025
2b3c482
fix build / rm diff
ochafik Feb 3, 2025
4cb0e1d
Merge branch 'jinja-chatml' into r1-toolcall
ochafik Feb 3, 2025
b2dd490
add missing try catch around jinja parsing to default to chatml
ochafik Feb 3, 2025
08271b5
Merge branch 'jinja-chatml' into r1-toolcall
ochafik Feb 3, 2025
df3474e
tool-calls: r1: add missing <|tool▁calls▁end|> to grammar!
ochafik Feb 3, 2025
c397bd1
tweak delta logic
ochafik Feb 3, 2025
569610e
tool-calls: accommodate variety of wrong tool call opening tags both …
ochafik Feb 3, 2025
d73448d
Simplify default chatml logic
ochafik Feb 3, 2025
0be7f65
Merge branch 'jinja-chatml' into r1-toolcall
ochafik Feb 3, 2025
7dc271f
tool-calls: add deepseek r1 template + accommodate broken official te…
ochafik Feb 3, 2025
c6214ee
rm unneeded vocab
ochafik Feb 3, 2025
1c302e1
simpler hacky fixes for original broken template (+ fix minja example…
ochafik Feb 3, 2025
108da90
sync: minja https://github.com/google/minja/pull/46
ochafik Feb 3, 2025
bc6d910
Merge branch 'master' into r1-toolcall
ochafik Feb 3, 2025
11c1f0c
actually we want eos_token in the template to infer tool call example…
ochafik Feb 3, 2025
30ea359
update to minja's new api
ochafik Feb 3, 2025
bbd45bf
sync: minja
ochafik Feb 4, 2025
bff549d
simplify hack to fix original template's backfill from minja
ochafik Feb 4, 2025
ce28224
tool-call: r1: add one more trigger approx "<|tool calls begin|>"
ochafik Feb 4, 2025
e84ee88
r1: fix inadvertent newline in grammar before <|tool▁call▁end|>
ochafik Feb 4, 2025
18a11f4
tool-call: r1: fix grammar
ochafik Feb 4, 2025
9a6847c
move trigger_words init inside non-llguidance branch
ochafik Feb 4, 2025
a682d12
fix / test parsing of r1 parser
ochafik Feb 4, 2025
f0154a6
Fix / test models/templates/llama-cpp-deepseek-r1.jinja
ochafik Feb 4, 2025
326e700
update test_calc_result
ochafik Feb 4, 2025
78b47bb
fix test_calc_result
ochafik Feb 4, 2025
86994db
fix spaces
ochafik Feb 4, 2025
09caa63
`sync`: minja
ochafik Feb 4, 2025
b152729
Update test-chat.cpp
ochafik Feb 4, 2025
56a14dd
fix mistral chat test: need empty tokens
ochafik Feb 4, 2025
f12e350
Update chat.cpp
ochafik Feb 4, 2025
d43e4f6
Merge branch 'sync-minja-4' into r1-toolcall
ochafik Feb 4, 2025
812544a
server: check that content is null when we get tool_calls
ochafik Feb 4, 2025
d44eb95
tool-call: ensure we don't return content when there are tool calls /…
ochafik Feb 4, 2025
b6e14a4
fix mistral expectation
ochafik Feb 4, 2025
1f5ec59
ensure deepseek r1 thoughts parsed even w/o tool calls
ochafik Feb 4, 2025
438ce0b
fix test-chat
ochafik Feb 4, 2025
21f2071
Update chat.cpp
ochafik Feb 4, 2025
b5b117f
Merge branch 'sync-minja-4' into r1-toolcall
ochafik Feb 4, 2025
0db9881
Fix r1 grammar since we made <|tool▁calls▁begin|> optional (triggerin…
ochafik Feb 4, 2025
d1b6691
r1: revert making <|tool▁calls▁begin|> optional as somehow sampling t…
ochafik Feb 4, 2025
39c1d81
return thoughts in reasoning_content field
ochafik Feb 4, 2025
b2d1728
update readme section about common model tool call formats
ochafik Feb 4, 2025
933f7a1
Merge branch 'master' into r1-toolcall
ochafik Feb 4, 2025
5d60ceb
Update test_tool_call.py
ochafik Feb 4, 2025
1f1f06a
Merge branch 'master' into r1-toolcall
ochafik Feb 5, 2025
9d7c3cc
--think to force any model to return reasoning_content (or just parse…
ochafik Feb 5, 2025
d20c2ce
Merge branch 'r1-toolcall' of github.com:ochafik/llama.cpp into r1-to…
ochafik Feb 5, 2025
f3e9f8b
fix test_thoughts
ochafik Feb 5, 2025
3841a16
fix compiler warning about parens
ochafik Feb 5, 2025
e6d9b52
align Command R7B w/ --think / reasoning_content behaviour
ochafik Feb 5, 2025
39b50c3
Update README.md
ochafik Feb 5, 2025
0917e0a
fix --think arg env
ochafik Feb 5, 2025
098629d
disable some failing chatml tests
ochafik Feb 5, 2025
33efcb3
Update README.md
ochafik Feb 5, 2025
994301d
use existing string_strip
ochafik Feb 5, 2025
d1a0640
revert tool example backfill change - command 7rb just needs the righ…
ochafik Feb 5, 2025
cc2c712
Merge remote-tracking branch 'origin/master' into r1-toolcall
ochafik Feb 8, 2025
c0f972b
Use --reasoning-format, remove forced thinking for now
ochafik Feb 8, 2025
af63886
return reasoning_content before content
ochafik Feb 8, 2025
a59fde2
update model template / format mapping
ochafik Feb 8, 2025
b829cab
fix test-chat
ochafik Feb 8, 2025
95cddfd
rm thoughts from generic parser
ochafik Feb 9, 2025
e598e7a
sync: minja (https://github.com/google/minja/pull/52)
ochafik Feb 9, 2025
91542ca
tool-calls: allow r1 output to miss <think> opening tag (since latest…
ochafik Feb 9, 2025
8d82be9
sync: minja (https://github.com/ggerganov/llama.cpp/pull/11774)
ochafik Feb 9, 2025
30dcfaa
rm wrong warning in command-r parser (when normal text)
ochafik Feb 9, 2025
e1bff8f
update deepseek r1 templates (+ put update commands in ./scripts/get_…
ochafik Feb 9, 2025
a29dc92
fix server test_tool_calls.py
ochafik Feb 9, 2025
ea2f41e
add models/templates/README.md
ochafik Feb 9, 2025
8409bf1
fix test_calc_result & test_thoughts
ochafik Feb 9, 2025
01db429
fix test-chat (update delta to latest r1 template change)
ochafik Feb 9, 2025
37a4bb2
Merge remote-tracking branch 'origin/master' into r1-toolcall
ochafik Feb 12, 2025
d52579a
prefer json::at to operator[] in chat.cpp
ochafik Feb 13, 2025
4700245
Merge remote-tracking branch 'origin/master' into r1-toolcall
ochafik Feb 13, 2025
043cb99
Apply suggestions from code review
ochafik Feb 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1975,6 +1975,15 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.use_jinja = true;
}
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_MAIN}).set_env("LLAMA_ARG_JINJA"));
add_opt(common_arg(
{"--think"},
"*experimental* thinking mode (default: disabled)\n"
"returns reasoning_content in messages, forcing model to think unless it supports native <think> tags (DeepSeek R1, Command R7B)\n"
"only supported for non-streamed responses",
[](common_params & params) {
params.think = true;
}
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_MAIN}).set_env("LLAMA_ARG_THINK"));
Copy link
Collaborator

@ngxson ngxson Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the --think flag is not very intuitive, it should be something like --reasoning-format, but personally I still prefer to do it per-request basis.

Also, for future-proof, we should make this flag accepts a param. For example, --reasoning-format deepseek will return it as reasoning_content. Again, this is because we are pretty sure that openai will break the whole thing in the near future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a usability standpoint, --think feels a bit more intuitive / memorable to me about what it does.

Other alternatives might be --separate-thinking or --extract-reasoning or --format-reasoning or...?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, what non-intuitive about it is that the model will still think even with --think not being added. This flag is supposed to force the model to start the reasoning process, but not to enable/disable it completely

Copy link
Collaborator Author

@ochafik ochafik Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I'm not really a fan of a query parameter yet, mostly because:

  • I see thinking mode as a static property of the model and its orchestration (much like the template)
  • It's a new non-standard API to define (more naming headaches)
  • It makes llama-server less compatible w/ DeepSeek's API, not more.
  • We can always decide to add it later if we start w/ a flag

Re/ flags, I think there may be room for two: one that controls the reasoning behaviour (extract, leave inline, force), and one for the return format. For the first one, how about:

  • --reasoning=extract → Parses DeepSeek R1 & Command R7B thoughts (default)
  • --reasoning=inline → Leaves thoughts inline in the content (format is model-specific)
  • --reasoning=force → Coerces non-thinking models to think (edit: maybe force-experimental for now)

As for the format, there are already two APIs out there:

  • DeepSeek's reasoning_content
  • Cohere's message.tool_plan, alas only as a prelude to tool calls (but it seems easy to trick R7B into thinking more generally, just needs a tweaked template)

I'd favour just sticking to reasoning_content for now until OpenAI announces their own format (and when they do, default to OpenAI's format and offer --reasoning-format=deepseek for backwards compatibility). OR decide to create our own format now / default --reasoning-format=llama.cpp that returns thoughts in the message.thoughts field, for instance.

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I'm not really a fan of a query parameter yet

Seems ok for me, but eventually I think someone will gonna add this as a per-request param.

Re. having 2 flags for behavior / format, this seems more reasonable for me. From a functional programming perspective, it can be expressed as response = format(behavior(original_generated_content))

But I think your idea is still mixed between these 2 layers.

For--reasoning extract|inline|force:

  • I'm not sure what force is supposed to do, but seems like it needs some prompt engineering so I think you should consider that, maintaining prompts can be a burden
  • For inline does that means reasoning can appear middle of generation? Example, content..think..content..think. Please lmk if I understand correctly.

For --reasoning-format I don't get why we want to invent a new --reasoning-format llama.cpp that put things inside message.thoughts. IMO we should keep thing simple until openai drop their format. Probably we can have --reasoning-format deepseek|none and set deepseek as the default for now, then change the default to oai once we have that openai format

Copy link
Collaborator

@ngxson ngxson Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think your idea is still mixed between these 2 layers.

IMO if we want --reasoning to control the behavior, then it should affect the generation phrase (for example, control grammar/logits bias). So, it should 3 values:

  • enabled: the model behaves as usual
  • disabled: we never allow model to use <think> token ==> control via logits bias
  • force: force the model to think, maybe use grammar or prompt engineering?

Then for the --reasoning-format, it should only "transform" the result into the desired format (a.k.a a pure function), we can have 3 values:

  • deepseek: put content inside reasoning_content
  • none: do not format, simply forward all the generated tokens to user
  • and then oai can be added in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if we want --reasoning to control the behavior, then it should affect the generation phrase (for example, control grammar/logits bias). So, it should 3 values:

  • enabled: the model behaves as usual
  • disabled: we never allow model to use <think> token ==> control via logits bias
  • force: force the model to think, maybe use grammar or prompt engineering?

@ngxson Makes sense, removed the forced thinking from this PR / will explore again as follow up (also, see more of a case of this option as a query param, while reasoning-format now has stronger flag vibes)

I'm not sure what force is supposed to do, but seems like it needs some prompt engineering so I think you should consider that, maintaining prompts can be a burden

Good point. In earlier experimentations I tried controlling the entire tool call process (even grammar generation) from Jinja templates, might play with this again.

Then for the --reasoning-format, it should only "transform" the result into the desired format (a.k.a a pure function), we can have 3 values:
deepseek: put content inside reasoning_content
none: do not format, simply forward all the generated tokens to user
and then oai can be added in the future

Updated the code (defaulting to deepseek), thanks!

Copy link
Collaborator Author

@ochafik ochafik Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I’ve updated the code to the latest DeepSeek template changes (they added a trailing <think> 😅; updated minja accordingly: #11774 (comment) )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ngxson, any more concerns / suggestions about this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm a bit busy recently, will do a review later today or tomorrow

add_opt(common_arg(
{"--chat-template"}, "JINJA_TEMPLATE",
string_format(
Expand Down
423 changes: 307 additions & 116 deletions common/chat.cpp

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions common/chat.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ struct common_chat_inputs {
bool stream;
std::string grammar;
bool add_generation_prompt = true;
bool think = false;
};

enum common_chat_format {
Expand All @@ -28,11 +29,13 @@ enum common_chat_format {
COMMON_CHAT_FORMAT_LLAMA_3_X,
COMMON_CHAT_FORMAT_LLAMA_3_X_WITH_BUILTIN_TOOLS,
COMMON_CHAT_FORMAT_DEEPSEEK_R1,
COMMON_CHAT_FORMAT_DEEPSEEK_R1_THINK,
COMMON_CHAT_FORMAT_FIREFUNCTION_V2,
COMMON_CHAT_FORMAT_FUNCTIONARY_V3_2,
COMMON_CHAT_FORMAT_FUNCTIONARY_V3_1_LLAMA_3_1,
COMMON_CHAT_FORMAT_HERMES_2_PRO,
COMMON_CHAT_FORMAT_COMMAND_R7B,
COMMON_CHAT_FORMAT_COMMAND_R7B_THINK,

COMMON_CHAT_FORMAT_COUNT, // Not a format, just the # formats
};
Expand Down
3 changes: 2 additions & 1 deletion common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -346,6 +346,7 @@ struct common_params {
std::string chat_template = ""; // NOLINT
bool use_jinja = false; // NOLINT
bool enable_chat_template = true;
bool think = false; // return reasoning_content, force model to think unless it supports native <think> tags.

std::vector<std::string> api_keys;

Expand Down Expand Up @@ -623,7 +624,7 @@ struct common_chat_msg {
std::string role;
std::string content;
std::vector<common_tool_call> tool_calls;
std::string tool_plan = "";
std::string reasoning_content = "";
};

// Check if the template supplied via "--chat-template" is supported or not. Returns true if it's valid
Expand Down
12 changes: 6 additions & 6 deletions common/sampling.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -151,12 +151,6 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, co

lparams.no_perf = params.no_perf;

std::vector<const char *> trigger_words;
trigger_words.reserve(params.grammar_trigger_words.size());
for (const auto & str : params.grammar_trigger_words) {
trigger_words.push_back(str.word.c_str());
}

struct llama_sampler * grmr;
if (params.grammar.compare(0, 11, "%llguidance") == 0) {
#ifdef LLAMA_USE_LLGUIDANCE
Expand All @@ -165,6 +159,12 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, co
GGML_ABORT("llguidance (cmake -DLLAMA_LLGUIDANCE=ON) is not enabled");
#endif // LLAMA_USE_LLGUIDANCE
} else {
std::vector<const char *> trigger_words;
trigger_words.reserve(params.grammar_trigger_words.size());
for (const auto & str : params.grammar_trigger_words) {
trigger_words.push_back(str.word.c_str());
}

grmr = params.grammar_lazy
? llama_sampler_init_grammar_lazy(vocab, params.grammar.c_str(), "root",
trigger_words.data(), trigger_words.size(),
Expand Down
125 changes: 74 additions & 51 deletions examples/server/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ The project is under active development, and we are [looking for feedback and co
| `--grammar-file FNAME` | file to read grammar from |
| `-j, --json-schema SCHEMA` | JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object<br/>For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead |
| `--jinja` | Enable experimental Jinja templating engine (required for tool use) |
| `--think` | Enable experimental thinking mode (extracts DeepSeek R1 & Command R7B's native thinking tags and forces any other model to think before responding, resulting thoughts are in the `reasoning_content` output field) (requires `--jinja`) |

**Example-specific params**

Expand Down Expand Up @@ -1136,61 +1137,74 @@ curl http://localhost:8080/v1/chat/completions \

| Template | Format |
|----------|--------|
| CohereForAI-c4ai-command-r-plus-default.jinja | generic tool calls |
| CohereForAI-c4ai-command-r-plus-rag.jinja | generic tool calls |
| CohereForAI-c4ai-command-r-plus-tool_use.jinja | generic tool calls |
| MiniMaxAI-MiniMax-Text-01.jinja | generic tool calls |
| NexaAIDev-Octopus-v2.jinja | generic tool calls |
| NousResearch-Hermes-2-Pro-Llama-3-8B-default.jinja | generic tool calls |
| NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja | hermes 2 pro tool calls |
| NousResearch-Hermes-2-Pro-Mistral-7B-default.jinja | generic tool calls |
| NousResearch-Hermes-2-Pro-Mistral-7B-tool_use.jinja | hermes 2 pro tool calls |
| NousResearch-Hermes-3-Llama-3.1-70B-default.jinja | generic tool calls |
| NousResearch-Hermes-3-Llama-3.1-70B-tool_use.jinja | hermes 2 pro tool calls |
| OrionStarAI-Orion-14B-Chat.jinja | generic tool calls |
| Qwen-QwQ-32B-Preview.jinja | hermes 2 pro tool calls |
| Qwen-Qwen2-7B-Instruct.jinja | generic tool calls |
| Qwen-Qwen2-VL-7B-Instruct.jinja | generic tool calls |
| Qwen-Qwen2.5-7B-Instruct.jinja | hermes 2 pro tool calls |
| Qwen-Qwen2.5-Math-7B-Instruct.jinja | hermes 2 pro tool calls |
| TheBloke-FusionNet_34Bx2_MoE-AWQ.jinja | generic tool calls |
| abacusai-Fewshot-Metamath-OrcaVicuna-Mistral.jinja | generic tool calls |
| bofenghuang-vigogne-2-70b-chat.jinja | generic tool calls |
| databricks-dbrx-instruct.jinja | generic tool calls |
| deepseek-ai-DeepSeek-Coder-V2-Instruct.jinja | generic tool calls |
| deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja | deepseek r1 tool calls |
| deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja | deepseek r1 tool calls |
| deepseek-ai-DeepSeek-R1-Distill-Qwen-7B.jinja | deepseek r1 tool calls |
| deepseek-ai-DeepSeek-V2.5.jinja | deepseek r1 tool calls |
| deepseek-ai-deepseek-coder-33b-instruct.jinja | generic tool calls |
| google-gemma-2-2b-it.jinja | generic tool calls |
| google-gemma-7b-it.jinja | generic tool calls |
| indischepartij-MiniCPM-3B-OpenHermes-2.5-v2.jinja | generic tool calls |
| mattshumer-Reflection-Llama-3.1-70B.jinja | generic tool calls |
| meetkai-functionary-medium-v3.2.jinja | functionary v3.2 tool calls |
| meta-llama-Llama-3.1-8B-Instruct.jinja | llama 3.x tool calls (w/ builtin tools) |
| meta-llama-Llama-3.2-3B-Instruct.jinja | llama 3.x tool calls |
| meta-llama-Llama-3.3-70B-Instruct.jinja | llama 3.x tool calls (w/ builtin tools) |
| meta-llama-Meta-Llama-3.1-8B-Instruct.jinja | llama 3.x tool calls (w/ builtin tools) |
| microsoft-Phi-3-medium-4k-instruct.jinja | generic tool calls |
| microsoft-Phi-3-mini-4k-instruct.jinja | generic tool calls |
| microsoft-Phi-3-small-8k-instruct.jinja | generic tool calls |
| microsoft-Phi-3.5-mini-instruct.jinja | generic tool calls |
| microsoft-Phi-3.5-vision-instruct.jinja | generic tool calls |
| mistralai-Mistral-7B-Instruct-v0.2.jinja | generic tool calls |
| mistralai-Mistral-Large-Instruct-2407.jinja | mistral nemo tool calls |
| mistralai-Mistral-Large-Instruct-2411.jinja | generic tool calls |
| mistralai-Mistral-Nemo-Instruct-2407.jinja | mistral nemo tool calls |
| mistralai-Mixtral-8x7B-Instruct-v0.1.jinja | generic tool calls |
| mlabonne-AlphaMonarch-7B.jinja | generic tool calls |
| nvidia-Llama-3.1-Nemotron-70B-Instruct-HF.jinja | llama 3.x tool calls (w/ builtin tools) |
| openchat-openchat-3.5-0106.jinja | generic tool calls |
| teknium-OpenHermes-2.5-Mistral-7B.jinja | generic tool calls |
| CohereForAI-c4ai-command-r-plus-default.jinja | Generic |
| CohereForAI-c4ai-command-r-plus-rag.jinja | Generic |
| CohereForAI-c4ai-command-r-plus-tool_use.jinja | Generic |
| CohereForAI-c4ai-command-r7b-12-2024-default.jinja | Command R7B |
| CohereForAI-c4ai-command-r7b-12-2024-rag.jinja | Command R7B |
| CohereForAI-c4ai-command-r7b-12-2024-tool_use.jinja | Command R7B |
| Infinigence-Megrez-3B-Instruct.jinja | Generic |
| LGAI-EXAONE-EXAONE-3.5-7.8B-Instruct.jinja | Generic |
| MiniMaxAI-MiniMax-Text-01.jinja | Generic |
| NexaAIDev-Octopus-v2.jinja | Generic |
| NousResearch-Hermes-2-Pro-Llama-3-8B-default.jinja | Generic |
| NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja | Hermes 2 Pro |
| NousResearch-Hermes-2-Pro-Mistral-7B-default.jinja | Generic |
| NousResearch-Hermes-2-Pro-Mistral-7B-tool_use.jinja | Hermes 2 Pro |
| NousResearch-Hermes-3-Llama-3.1-70B-default.jinja | Generic |
| NousResearch-Hermes-3-Llama-3.1-70B-tool_use.jinja | Hermes 2 Pro |
| OrionStarAI-Orion-14B-Chat.jinja | Generic |
| Qwen-QwQ-32B-Preview.jinja | Hermes 2 Pro |
| Qwen-Qwen2-7B-Instruct.jinja | Generic |
| Qwen-Qwen2-VL-7B-Instruct.jinja | Generic |
| Qwen-Qwen2.5-7B-Instruct.jinja | Hermes 2 Pro |
| Qwen-Qwen2.5-Math-7B-Instruct.jinja | Hermes 2 Pro |
| THUDM-glm-4-9b-chat.jinja | Generic |
| THUDM-glm-edge-1.5b-chat.jinja | Generic |
| TheBloke-FusionNet_34Bx2_MoE-AWQ.jinja | Generic |
| abacusai-Fewshot-Metamath-OrcaVicuna-Mistral.jinja | Generic |
| ai21labs-AI21-Jamba-1.5-Large.jinja | Generic |
| bofenghuang-vigogne-2-70b-chat.jinja | Generic |
| databricks-dbrx-instruct.jinja | Generic |
| deepseek-ai-DeepSeek-Coder-V2-Instruct.jinja | Generic |
| deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja | DeepSeek R1 |
| deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja | DeepSeek R1 |
| deepseek-ai-DeepSeek-R1-Distill-Qwen-7B.jinja | DeepSeek R1 |
| deepseek-ai-DeepSeek-V2.5.jinja | DeepSeek R1 |
| deepseek-ai-deepseek-coder-33b-instruct.jinja | Generic |
| fireworks-ai-llama-3-firefunction-v2.jinja | FireFunction v2 |
| google-gemma-2-2b-it.jinja | Generic |
| google-gemma-7b-it.jinja | Generic |
| ibm-granite-granite-3.1-8b-instruct.jinja | Generic |
| indischepartij-MiniCPM-3B-OpenHermes-2.5-v2.jinja | Generic |
| mattshumer-Reflection-Llama-3.1-70B.jinja | Generic |
| meetkai-functionary-medium-v3.1.jinja | Functionary v3.1 Llama 3.1 |
| meetkai-functionary-medium-v3.2.jinja | Functionary v3.2 |
| meta-llama-Llama-2-7b-chat-hf.jinja | Generic |
| meta-llama-Llama-3.1-8B-Instruct.jinja | Llama 3.x |
| meta-llama-Llama-3.2-3B-Instruct.jinja | Llama 3.x |
| meta-llama-Llama-3.3-70B-Instruct.jinja | Llama 3.x |
| meta-llama-Meta-Llama-3.1-8B-Instruct.jinja | Llama 3.x |
| microsoft-Phi-3-medium-4k-instruct.jinja | Generic |
| microsoft-Phi-3-mini-4k-instruct.jinja | Generic |
| microsoft-Phi-3-small-8k-instruct.jinja | Generic |
| microsoft-Phi-3.5-mini-instruct.jinja | Generic |
| microsoft-Phi-3.5-vision-instruct.jinja | Generic |
| mistralai-Mistral-7B-Instruct-v0.2.jinja | Generic |
| mistralai-Mistral-Large-Instruct-2407.jinja | Mistral Nemo |
| mistralai-Mistral-Large-Instruct-2411.jinja | Generic |
| mistralai-Mistral-Nemo-Instruct-2407.jinja | Mistral Nemo |
| mistralai-Mixtral-8x7B-Instruct-v0.1.jinja | Generic |
| mlabonne-AlphaMonarch-7B.jinja | Generic |
| nvidia-Llama-3.1-Nemotron-70B-Instruct-HF.jinja | Llama 3.x |
| openchat-openchat-3.5-0106.jinja | Generic |
| teknium-OpenHermes-2.5-Mistral-7B.jinja | Generic |

This table can be generated with:

```bash
./build/bin/test-chat ../minja/build/tests/*.jinja 2>/dev/null
```

</details>

Expand All @@ -1202,11 +1216,20 @@ curl http://localhost:8080/v1/chat/completions \

```shell
# Native support:

llama-server --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M
llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L
llama-server --jinja -fa -hf bartowski/functionary-small-v3.2-GGUF:Q4_K_M
llama-server --jinja -fa -hf bartowski/Llama-3.3-70B-Instruct-GGUF:Q4_K_M

# Native support for DeepSeek R1 works best w/ our own template (official template buggy)

llama-server --jinja -fa -hf bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF:Q6_K_L --think \
--chat-template-file models/templates/llama-cpp-deepseek-r1.jinja

llama-server --jinja -fa -hf bartowski/DeepSeek-R1-Distill-Qwen-32B-GGUF:Q4_K_M --think \
--chat-template-file models/templates/llama-cpp-deepseek-r1.jinja

# Native support requires the right template for these GGUFs:

llama-server --jinja -fa -hf bartowski/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M \
Expand All @@ -1218,7 +1241,7 @@ curl http://localhost:8080/v1/chat/completions \
llama-server --jinja -fa -hf bartowski/firefunction-v2-GGUF -hff firefunction-v2-IQ1_M.gguf \
--chat-template-file <( python scripts/get_chat_template.py fireworks-ai/llama-3-firefunction-v2 tool_use )

llama-server --jinja -fa -hf bartowski/c4ai-command-r7b-12-2024-GGUF:Q6_K_L \
llama-server --jinja -fa -hf bartowski/c4ai-command-r7b-12-2024-GGUF:Q6_K_L --think \
--chat-template-file <( python scripts/get_chat_template.py CohereForAI/c4ai-command-r7b-12-2024 tool_use )

# Generic format support
Expand Down
11 changes: 6 additions & 5 deletions examples/server/server.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,7 @@ struct slot_params {
{"grammar_trigger_words", grammar_trigger_words},
{"grammar_trigger_tokens", sampling.grammar_trigger_tokens},
{"preserved_tokens", sampling.preserved_tokens},
{"chat_format", common_chat_format_name(oaicompat_chat_format)},
{"samplers", samplers},
{"speculative.n_max", speculative.n_max},
{"speculative.n_min", speculative.n_min},
Expand Down Expand Up @@ -740,12 +741,12 @@ struct server_task_result_cmpl_final : server_task_result {
}

json message {
{"content", msg.content},
{"content", msg.content == "" && !tool_calls.empty() ? json() : json(msg.content)},
{"tool_calls", tool_calls},
{"role", "assistant"},
};
if (!msg.tool_plan.empty()) {
message["tool_plan"] = msg.tool_plan;
if (!msg.reasoning_content.empty()) {
message["reasoning_content"] = msg.reasoning_content;
}

json choice {
Expand Down Expand Up @@ -4054,7 +4055,7 @@ int main(int argc, char ** argv) {
}

auto body = json::parse(req.body);
json data = oaicompat_completion_params_parse(body, params.use_jinja, ctx_server.chat_templates);
json data = oaicompat_completion_params_parse(body, params.use_jinja, params.think, ctx_server.chat_templates);

return handle_completions_impl(
SERVER_TASK_TYPE_COMPLETION,
Expand All @@ -4067,7 +4068,7 @@ int main(int argc, char ** argv) {
// same with handle_chat_completions, but without inference part
const auto handle_apply_template = [&ctx_server, &params, &res_ok](const httplib::Request & req, httplib::Response & res) {
auto body = json::parse(req.body);
json data = oaicompat_completion_params_parse(body, params.use_jinja, ctx_server.chat_templates);
json data = oaicompat_completion_params_parse(body, params.use_jinja, params.think, ctx_server.chat_templates);
res_ok(res, {{ "prompt", std::move(data.at("prompt")) }});
};

Expand Down
Loading
Loading