Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless --reasoning-format none #11607

Merged
merged 94 commits into from
Feb 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
94 commits
Select commit Hold shift + click to select a range
d3b60b8
minja: enhance backfill of templates w/o tools description (use examp…
ochafik Feb 3, 2025
87de852
pass vocab to common_chat_params_init
ochafik Feb 3, 2025
130ca22
DeepSeek R1: parse thoughts / return in separate field in API (non st…
ochafik Feb 3, 2025
04d511b
Avoid double bos w/ jinja
ochafik Feb 3, 2025
2834587
server/oai: ensure content is null when there are tool calls
ochafik Feb 3, 2025
c80cb30
update logs
ochafik Feb 3, 2025
0871628
rename tests
ochafik Feb 3, 2025
73d08d4
tool-call: allow `--jinja --chat-template chatml`
ochafik Feb 3, 2025
04be723
tool-call: fix command-r7b parsing when response is multiline
ochafik Feb 3, 2025
ae9d581
tool-calls: add DeepSeek R1 Qwen 7B to server test_hello_world
ochafik Feb 3, 2025
19bea4e
tell DS R1 not to overthink (weather test)
ochafik Feb 3, 2025
5e6f2a2
add deepseek models to server tool call section in readme
ochafik Feb 3, 2025
1e9acd2
tool-call: allow `--jinja --chat-template chatml`
ochafik Feb 3, 2025
77ae97e
Update test_tool_call.py
ochafik Feb 3, 2025
a76073c
minimize diffs
ochafik Feb 3, 2025
cf83623
fix typo
ochafik Feb 3, 2025
5d18d76
fix double bos issue (drop bos/eos tokens from jinja template)
ochafik Feb 3, 2025
aa98e59
fix bad merge
ochafik Feb 3, 2025
2b3c482
fix build / rm diff
ochafik Feb 3, 2025
4cb0e1d
Merge branch 'jinja-chatml' into r1-toolcall
ochafik Feb 3, 2025
b2dd490
add missing try catch around jinja parsing to default to chatml
ochafik Feb 3, 2025
08271b5
Merge branch 'jinja-chatml' into r1-toolcall
ochafik Feb 3, 2025
df3474e
tool-calls: r1: add missing <|tool▁calls▁end|> to grammar!
ochafik Feb 3, 2025
c397bd1
tweak delta logic
ochafik Feb 3, 2025
569610e
tool-calls: accommodate variety of wrong tool call opening tags both …
ochafik Feb 3, 2025
d73448d
Simplify default chatml logic
ochafik Feb 3, 2025
0be7f65
Merge branch 'jinja-chatml' into r1-toolcall
ochafik Feb 3, 2025
7dc271f
tool-calls: add deepseek r1 template + accommodate broken official te…
ochafik Feb 3, 2025
c6214ee
rm unneeded vocab
ochafik Feb 3, 2025
1c302e1
simpler hacky fixes for original broken template (+ fix minja example…
ochafik Feb 3, 2025
108da90
sync: minja https://github.com/google/minja/pull/46
ochafik Feb 3, 2025
bc6d910
Merge branch 'master' into r1-toolcall
ochafik Feb 3, 2025
11c1f0c
actually we want eos_token in the template to infer tool call example…
ochafik Feb 3, 2025
30ea359
update to minja's new api
ochafik Feb 3, 2025
bbd45bf
sync: minja
ochafik Feb 4, 2025
bff549d
simplify hack to fix original template's backfill from minja
ochafik Feb 4, 2025
ce28224
tool-call: r1: add one more trigger approx "<|tool calls begin|>"
ochafik Feb 4, 2025
e84ee88
r1: fix inadvertent newline in grammar before <|tool▁call▁end|>
ochafik Feb 4, 2025
18a11f4
tool-call: r1: fix grammar
ochafik Feb 4, 2025
9a6847c
move trigger_words init inside non-llguidance branch
ochafik Feb 4, 2025
a682d12
fix / test parsing of r1 parser
ochafik Feb 4, 2025
f0154a6
Fix / test models/templates/llama-cpp-deepseek-r1.jinja
ochafik Feb 4, 2025
326e700
update test_calc_result
ochafik Feb 4, 2025
78b47bb
fix test_calc_result
ochafik Feb 4, 2025
86994db
fix spaces
ochafik Feb 4, 2025
09caa63
`sync`: minja
ochafik Feb 4, 2025
b152729
Update test-chat.cpp
ochafik Feb 4, 2025
56a14dd
fix mistral chat test: need empty tokens
ochafik Feb 4, 2025
f12e350
Update chat.cpp
ochafik Feb 4, 2025
d43e4f6
Merge branch 'sync-minja-4' into r1-toolcall
ochafik Feb 4, 2025
812544a
server: check that content is null when we get tool_calls
ochafik Feb 4, 2025
d44eb95
tool-call: ensure we don't return content when there are tool calls /…
ochafik Feb 4, 2025
b6e14a4
fix mistral expectation
ochafik Feb 4, 2025
1f5ec59
ensure deepseek r1 thoughts parsed even w/o tool calls
ochafik Feb 4, 2025
438ce0b
fix test-chat
ochafik Feb 4, 2025
21f2071
Update chat.cpp
ochafik Feb 4, 2025
b5b117f
Merge branch 'sync-minja-4' into r1-toolcall
ochafik Feb 4, 2025
0db9881
Fix r1 grammar since we made <|tool▁calls▁begin|> optional (triggerin…
ochafik Feb 4, 2025
d1b6691
r1: revert making <|tool▁calls▁begin|> optional as somehow sampling t…
ochafik Feb 4, 2025
39c1d81
return thoughts in reasoning_content field
ochafik Feb 4, 2025
b2d1728
update readme section about common model tool call formats
ochafik Feb 4, 2025
933f7a1
Merge branch 'master' into r1-toolcall
ochafik Feb 4, 2025
5d60ceb
Update test_tool_call.py
ochafik Feb 4, 2025
1f1f06a
Merge branch 'master' into r1-toolcall
ochafik Feb 5, 2025
9d7c3cc
--think to force any model to return reasoning_content (or just parse…
ochafik Feb 5, 2025
d20c2ce
Merge branch 'r1-toolcall' of github.com:ochafik/llama.cpp into r1-to…
ochafik Feb 5, 2025
f3e9f8b
fix test_thoughts
ochafik Feb 5, 2025
3841a16
fix compiler warning about parens
ochafik Feb 5, 2025
e6d9b52
align Command R7B w/ --think / reasoning_content behaviour
ochafik Feb 5, 2025
39b50c3
Update README.md
ochafik Feb 5, 2025
0917e0a
fix --think arg env
ochafik Feb 5, 2025
098629d
disable some failing chatml tests
ochafik Feb 5, 2025
33efcb3
Update README.md
ochafik Feb 5, 2025
994301d
use existing string_strip
ochafik Feb 5, 2025
d1a0640
revert tool example backfill change - command 7rb just needs the righ…
ochafik Feb 5, 2025
cc2c712
Merge remote-tracking branch 'origin/master' into r1-toolcall
ochafik Feb 8, 2025
c0f972b
Use --reasoning-format, remove forced thinking for now
ochafik Feb 8, 2025
af63886
return reasoning_content before content
ochafik Feb 8, 2025
a59fde2
update model template / format mapping
ochafik Feb 8, 2025
b829cab
fix test-chat
ochafik Feb 8, 2025
95cddfd
rm thoughts from generic parser
ochafik Feb 9, 2025
e598e7a
sync: minja (https://github.com/google/minja/pull/52)
ochafik Feb 9, 2025
91542ca
tool-calls: allow r1 output to miss <think> opening tag (since latest…
ochafik Feb 9, 2025
8d82be9
sync: minja (https://github.com/ggerganov/llama.cpp/pull/11774)
ochafik Feb 9, 2025
30dcfaa
rm wrong warning in command-r parser (when normal text)
ochafik Feb 9, 2025
e1bff8f
update deepseek r1 templates (+ put update commands in ./scripts/get_…
ochafik Feb 9, 2025
a29dc92
fix server test_tool_calls.py
ochafik Feb 9, 2025
ea2f41e
add models/templates/README.md
ochafik Feb 9, 2025
8409bf1
fix test_calc_result & test_thoughts
ochafik Feb 9, 2025
01db429
fix test-chat (update delta to latest r1 template change)
ochafik Feb 9, 2025
37a4bb2
Merge remote-tracking branch 'origin/master' into r1-toolcall
ochafik Feb 12, 2025
d52579a
prefer json::at to operator[] in chat.cpp
ochafik Feb 13, 2025
4700245
Merge remote-tracking branch 'origin/master' into r1-toolcall
ochafik Feb 13, 2025
043cb99
Apply suggestions from code review
ochafik Feb 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions common/arg.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1975,6 +1975,17 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
params.use_jinja = true;
}
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_MAIN}).set_env("LLAMA_ARG_JINJA"));
add_opt(common_arg(
{"--reasoning-format"}, "FORMAT",
"reasoning format (default: deepseek; allowed values: deepseek, none)\n"
"controls whether thought tags are extracted from the response, and in which format they're returned. 'none' leaves thoughts unparsed in `message.content`, 'deepseek' puts them in `message.reasoning_content` (for DeepSeek R1 & Command R7B only).\n"
"only supported for non-streamed responses",
[](common_params & params, const std::string & value) {
/**/ if (value == "deepseek") { params.reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK; }
else if (value == "none") { params.reasoning_format = COMMON_REASONING_FORMAT_NONE; }
else { std::invalid_argument("invalid value"); }
}
).set_examples({LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_MAIN}).set_env("LLAMA_ARG_THINK"));
Copy link
Collaborator

@ngxson ngxson Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the --think flag is not very intuitive, it should be something like --reasoning-format, but personally I still prefer to do it per-request basis.

Also, for future-proof, we should make this flag accepts a param. For example, --reasoning-format deepseek will return it as reasoning_content. Again, this is because we are pretty sure that openai will break the whole thing in the near future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a usability standpoint, --think feels a bit more intuitive / memorable to me about what it does.

Other alternatives might be --separate-thinking or --extract-reasoning or --format-reasoning or...?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, what non-intuitive about it is that the model will still think even with --think not being added. This flag is supposed to force the model to start the reasoning process, but not to enable/disable it completely

Copy link
Collaborator Author

@ochafik ochafik Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I'm not really a fan of a query parameter yet, mostly because:

  • I see thinking mode as a static property of the model and its orchestration (much like the template)
  • It's a new non-standard API to define (more naming headaches)
  • It makes llama-server less compatible w/ DeepSeek's API, not more.
  • We can always decide to add it later if we start w/ a flag

Re/ flags, I think there may be room for two: one that controls the reasoning behaviour (extract, leave inline, force), and one for the return format. For the first one, how about:

  • --reasoning=extract → Parses DeepSeek R1 & Command R7B thoughts (default)
  • --reasoning=inline → Leaves thoughts inline in the content (format is model-specific)
  • --reasoning=force → Coerces non-thinking models to think (edit: maybe force-experimental for now)

As for the format, there are already two APIs out there:

  • DeepSeek's reasoning_content
  • Cohere's message.tool_plan, alas only as a prelude to tool calls (but it seems easy to trick R7B into thinking more generally, just needs a tweaked template)

I'd favour just sticking to reasoning_content for now until OpenAI announces their own format (and when they do, default to OpenAI's format and offer --reasoning-format=deepseek for backwards compatibility). OR decide to create our own format now / default --reasoning-format=llama.cpp that returns thoughts in the message.thoughts field, for instance.

WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tbh I'm not really a fan of a query parameter yet

Seems ok for me, but eventually I think someone will gonna add this as a per-request param.

Re. having 2 flags for behavior / format, this seems more reasonable for me. From a functional programming perspective, it can be expressed as response = format(behavior(original_generated_content))

But I think your idea is still mixed between these 2 layers.

For--reasoning extract|inline|force:

  • I'm not sure what force is supposed to do, but seems like it needs some prompt engineering so I think you should consider that, maintaining prompts can be a burden
  • For inline does that means reasoning can appear middle of generation? Example, content..think..content..think. Please lmk if I understand correctly.

For --reasoning-format I don't get why we want to invent a new --reasoning-format llama.cpp that put things inside message.thoughts. IMO we should keep thing simple until openai drop their format. Probably we can have --reasoning-format deepseek|none and set deepseek as the default for now, then change the default to oai once we have that openai format

Copy link
Collaborator

@ngxson ngxson Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think your idea is still mixed between these 2 layers.

IMO if we want --reasoning to control the behavior, then it should affect the generation phrase (for example, control grammar/logits bias). So, it should 3 values:

  • enabled: the model behaves as usual
  • disabled: we never allow model to use <think> token ==> control via logits bias
  • force: force the model to think, maybe use grammar or prompt engineering?

Then for the --reasoning-format, it should only "transform" the result into the desired format (a.k.a a pure function), we can have 3 values:

  • deepseek: put content inside reasoning_content
  • none: do not format, simply forward all the generated tokens to user
  • and then oai can be added in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO if we want --reasoning to control the behavior, then it should affect the generation phrase (for example, control grammar/logits bias). So, it should 3 values:

  • enabled: the model behaves as usual
  • disabled: we never allow model to use <think> token ==> control via logits bias
  • force: force the model to think, maybe use grammar or prompt engineering?

@ngxson Makes sense, removed the forced thinking from this PR / will explore again as follow up (also, see more of a case of this option as a query param, while reasoning-format now has stronger flag vibes)

I'm not sure what force is supposed to do, but seems like it needs some prompt engineering so I think you should consider that, maintaining prompts can be a burden

Good point. In earlier experimentations I tried controlling the entire tool call process (even grammar generation) from Jinja templates, might play with this again.

Then for the --reasoning-format, it should only "transform" the result into the desired format (a.k.a a pure function), we can have 3 values:
deepseek: put content inside reasoning_content
none: do not format, simply forward all the generated tokens to user
and then oai can be added in the future

Updated the code (defaulting to deepseek), thanks!

Copy link
Collaborator Author

@ochafik ochafik Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I’ve updated the code to the latest DeepSeek template changes (they added a trailing <think> 😅; updated minja accordingly: #11774 (comment) )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ngxson, any more concerns / suggestions about this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I'm a bit busy recently, will do a review later today or tomorrow

add_opt(common_arg(
{"--chat-template"}, "JINJA_TEMPLATE",
string_format(
Expand Down
319 changes: 219 additions & 100 deletions common/chat.cpp

Large diffs are not rendered by default.

3 changes: 3 additions & 0 deletions common/chat.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ struct common_chat_inputs {
bool stream;
std::string grammar;
bool add_generation_prompt = true;
bool extract_reasoning = true;
};

enum common_chat_format {
Expand All @@ -28,11 +29,13 @@ enum common_chat_format {
COMMON_CHAT_FORMAT_LLAMA_3_X,
COMMON_CHAT_FORMAT_LLAMA_3_X_WITH_BUILTIN_TOOLS,
COMMON_CHAT_FORMAT_DEEPSEEK_R1,
COMMON_CHAT_FORMAT_DEEPSEEK_R1_EXTRACT_REASONING,
COMMON_CHAT_FORMAT_FIREFUNCTION_V2,
COMMON_CHAT_FORMAT_FUNCTIONARY_V3_2,
COMMON_CHAT_FORMAT_FUNCTIONARY_V3_1_LLAMA_3_1,
COMMON_CHAT_FORMAT_HERMES_2_PRO,
COMMON_CHAT_FORMAT_COMMAND_R7B,
COMMON_CHAT_FORMAT_COMMAND_R7B_EXTRACT_REASONING,

COMMON_CHAT_FORMAT_COUNT, // Not a format, just the # formats
};
Expand Down
8 changes: 7 additions & 1 deletion common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,11 @@ struct common_params_vocoder {
bool use_guide_tokens = false; // enable guide tokens to improve TTS accuracy // NOLINT
};

enum common_reasoning_format {
COMMON_REASONING_FORMAT_NONE,
COMMON_REASONING_FORMAT_DEEPSEEK, // Extract thinking tag contents and return as `message.reasoning_content`
};

struct common_params {
int32_t n_predict = -1; // new tokens to predict
int32_t n_ctx = 4096; // context size
Expand Down Expand Up @@ -346,6 +351,7 @@ struct common_params {
std::string chat_template = ""; // NOLINT
bool use_jinja = false; // NOLINT
bool enable_chat_template = true;
common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;

std::vector<std::string> api_keys;

Expand Down Expand Up @@ -623,7 +629,7 @@ struct common_chat_msg {
std::string role;
std::string content;
std::vector<common_tool_call> tool_calls;
std::string tool_plan = "";
std::string reasoning_content = "";
};

// Check if the template supplied via "--chat-template" is supported or not. Returns true if it's valid
Expand Down
12 changes: 6 additions & 6 deletions common/sampling.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -151,12 +151,6 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, co

lparams.no_perf = params.no_perf;

std::vector<const char *> trigger_words;
trigger_words.reserve(params.grammar_trigger_words.size());
for (const auto & str : params.grammar_trigger_words) {
trigger_words.push_back(str.word.c_str());
}

struct llama_sampler * grmr;
if (params.grammar.compare(0, 11, "%llguidance") == 0) {
#ifdef LLAMA_USE_LLGUIDANCE
Expand All @@ -165,6 +159,12 @@ struct common_sampler * common_sampler_init(const struct llama_model * model, co
GGML_ABORT("llguidance (cmake -DLLAMA_LLGUIDANCE=ON) is not enabled");
#endif // LLAMA_USE_LLGUIDANCE
} else {
std::vector<const char *> trigger_words;
trigger_words.reserve(params.grammar_trigger_words.size());
for (const auto & str : params.grammar_trigger_words) {
trigger_words.push_back(str.word.c_str());
}

grmr = params.grammar_lazy
? llama_sampler_init_grammar_lazy(vocab, params.grammar.c_str(), "root",
trigger_words.data(), trigger_words.size(),
Expand Down
301 changes: 251 additions & 50 deletions examples/server/README.md

Large diffs are not rendered by default.

29 changes: 16 additions & 13 deletions examples/server/server.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,7 @@ struct slot_params {
{"grammar_trigger_words", grammar_trigger_words},
{"grammar_trigger_tokens", sampling.grammar_trigger_tokens},
{"preserved_tokens", sampling.preserved_tokens},
{"chat_format", common_chat_format_name(oaicompat_chat_format)},
{"samplers", samplers},
{"speculative.n_max", speculative.n_max},
{"speculative.n_min", speculative.n_min},
Expand Down Expand Up @@ -724,9 +725,19 @@ struct server_task_result_cmpl_final : server_task_result {
msg.content = content;
}

json tool_calls;
json message {
{"role", "assistant"},
};
if (!msg.reasoning_content.empty()) {
message["reasoning_content"] = msg.reasoning_content;
}
if (msg.content.empty() && !msg.tool_calls.empty()) {
message["content"] = json();
} else {
message["content"] = msg.content;
}
if (!msg.tool_calls.empty()) {
tool_calls = json::array();
auto tool_calls = json::array();
for (const auto & tc : msg.tool_calls) {
tool_calls.push_back({
{"type", "function"},
Expand All @@ -737,15 +748,7 @@ struct server_task_result_cmpl_final : server_task_result {
{"id", tc.id},
});
}
}

json message {
{"content", msg.content},
{"tool_calls", tool_calls},
{"role", "assistant"},
};
if (!msg.tool_plan.empty()) {
message["tool_plan"] = msg.tool_plan;
message["tool_calls"] = tool_calls;
}

json choice {
Expand Down Expand Up @@ -4060,7 +4063,7 @@ int main(int argc, char ** argv) {
}

auto body = json::parse(req.body);
json data = oaicompat_completion_params_parse(body, params.use_jinja, ctx_server.chat_templates);
json data = oaicompat_completion_params_parse(body, params.use_jinja, params.reasoning_format, ctx_server.chat_templates);

return handle_completions_impl(
SERVER_TASK_TYPE_COMPLETION,
Expand All @@ -4073,7 +4076,7 @@ int main(int argc, char ** argv) {
// same with handle_chat_completions, but without inference part
const auto handle_apply_template = [&ctx_server, &params, &res_ok](const httplib::Request & req, httplib::Response & res) {
auto body = json::parse(req.body);
json data = oaicompat_completion_params_parse(body, params.use_jinja, ctx_server.chat_templates);
json data = oaicompat_completion_params_parse(body, params.use_jinja, params.reasoning_format, ctx_server.chat_templates);
res_ok(res, {{ "prompt", std::move(data.at("prompt")) }});
};

Expand Down
Loading
Loading