-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: grammar / json schema with reasoning format. Allow model free to think but strict to answer. #12276
Comments
Given models (R1 & QwQ) now force the |
I'll mention this is what I wanted to get to in a generalized fashion w/ the initial tools prototype ( @henryclw Re/ json support for thinking models, would you be able to check what DeepSeek API's behaviour is? |
Re. force thinking, we can detect if the model as dedicated Also keep in mind that not every models use |
@ngxson Note that DS R1 Distill doesn't have a
Templates that currently add a trailing
Indeed, and Command R7B uses |
Yes they do have, it's inside ![]() |
Argh my bad, thanks, I got confused with QwQ (which doesn’t - I <think> 😅) |
@ochafik Surprise! {
"error": {
"message": "deepseek-reasoner does not support Json Output.",
"type": "invalid_request_error",
"param": null,
"code": "invalid_request_error"
}
} Didn't expect this to be honest 😂 |
@ochafik Hi, if you don't mind, I would like to discuss more about the chat template. I think llama server should integrate with the assistant message provided by the user, rather than ignoring it. Say the current Lines 183 to 190 in 2c9f833
If the user provide a user(human) message with a assistant message, I could expect
instead of
The first one would let the model continue with the prefilled assistant message. The second one might start a whole new assistant message. I tried to create my own fork that allows if (tmpl == LLM_CHAT_TEMPLATE_CHATML) {
// chatml template
for (auto message : chat) {
ss << "<|im_start|>" << message->role << "\n" << message->content;
if (!last_is_assistant || message != chat.back()) {
ss << "<|im_end|>\n";
}
}
if (add_ass) {
ss << "<|im_start|>assistant\n";
} Which works great but doesn't work when jinja template option is on.
When I used method B as I mentioned, I'm using my own fork of llama.cpp, which enables me to call the llama server twice and continue the response with prefill. Allowing user to prefill the assistant message is an important ability and it should work without and with jinja template. I'm not the only one who needs this, as there are serval thumbs up for #11755 |
Prerequisites
Feature Description
When the
reasoning format
isdeepseek
, the reasoning part (things between<think>
and</think>
) would be place inmessage.reasoning_content
. Is it possible to put the grammar / json schema enforcement after the</think>
?Motivation
The model should be free to reason, but strict with an answer format. When the users use
deepseek
reasoning format, it means they don't care about the reasoning so much, just want to have the answer separately.Say I need to model to return the answer in a json format. If the model is free to reason for a while instead of putting the answer right in the json, the performance might be better.
Possible Implementation
A. Update the grammar root, enable a thinking section wrapped in
<think>
and</think>
is the reasoning format isdeepseek
or
B. An ugly way: let the model generate until it hits
</think>
, then apply grammar. (This is the current work around method I'm using)The text was updated successfully, but these errors were encountered: