Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: grammar / json schema with reasoning format. Allow model free to think but strict to answer. #12276

Open
4 tasks done
henryclw opened this issue Mar 8, 2025 · 8 comments
Assignees
Labels
enhancement New feature or request

Comments

@henryclw
Copy link

henryclw commented Mar 8, 2025

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

When the reasoning format is deepseek, the reasoning part (things between <think> and </think>) would be place in message.reasoning_content. Is it possible to put the grammar / json schema enforcement after the </think>?

Motivation

The model should be free to reason, but strict with an answer format. When the users use deepseek reasoning format, it means they don't care about the reasoning so much, just want to have the answer separately.
Say I need to model to return the answer in a json format. If the model is free to reason for a while instead of putting the answer right in the json, the performance might be better.

Possible Implementation

A. Update the grammar root, enable a thinking section wrapped in <think> and </think> is the reasoning format is deepseek
or
B. An ugly way: let the model generate until it hits </think>, then apply grammar. (This is the current work around method I'm using)

@henryclw henryclw added the enhancement New feature or request label Mar 8, 2025
@ochafik
Copy link
Collaborator

ochafik commented Mar 10, 2025

Given models (R1 & QwQ) now force the <think>, our hand is a bit forced here. Luckily should be easy w/ a lazy grammar triggered on </think>

@ochafik ochafik self-assigned this Mar 10, 2025
@ochafik
Copy link
Collaborator

ochafik commented Mar 10, 2025

I'll mention this is what I wanted to get to in a generalized fashion w/ the initial tools prototype (--style=thoughtful_steps). I briefly toyed with a --reasoning-format=forced in #11607 to force any model to think (only requiring small tweaks to the Generic chat handler) although as @ngxson pointed out that may come w/ iffy prompt engineering. Not a concern if we stick to natively thinking models for now.

@henryclw Re/ json support for thinking models, would you be able to check what DeepSeek API's behaviour is?

@ngxson
Copy link
Collaborator

ngxson commented Mar 10, 2025

Re. force thinking, we can detect if the model as dedicated <think> token and only allow "force" mode if it has one. For model not having it, we raise an error.

Also keep in mind that not every models use <think>, some use <thinking> with <answer>

@ochafik
Copy link
Collaborator

ochafik commented Mar 10, 2025

Re. force thinking, we can detect if the model as dedicated <think> token

@ngxson Note that DS R1 Distill doesn't have a <think> token in their vocab (it's a "soft token" haha).

and only allow "force" mode if it has one. For model not having it, we raise an error.

Templates that currently add a trailing <think> to the prompt (DS R1, QwQ) already have this "force thinking" semantics, which is straightforward to detect. Other use cases / user-imposed forced thoughts semantics can come later.

Also keep in mind that not every models use <think>, some use <thinking> with <answer>

Indeed, and Command R7B uses <|START_THINKING|>...<|END_THINKING|> (side-note: we'll probably want to rewrite this to <think> in streaming mode to make it all model agnostic)

@ngxson
Copy link
Collaborator

ngxson commented Mar 10, 2025

@ngxson Note that DS R1 Distill doesn't have a <think> token in their vocab (it's a "soft token" haha).

Yes they do have, it's inside tokenizer.json

Image

@ochafik
Copy link
Collaborator

ochafik commented Mar 10, 2025

@ngxson Note that DS R1 Distill doesn't have a <think> token in their vocab (it's a "soft token" haha).

Yes they do have, it's inside tokenizer.json

Argh my bad, thanks, I got confused with QwQ (which doesn’t - I <think> 😅)

@henryclw
Copy link
Author

@henryclw Re/ json support for thinking models, would you be able to check what DeepSeek API's behaviour is?

@ochafik Surprise!

{
    "error": {
        "message": "deepseek-reasoner does not support Json Output.",
        "type": "invalid_request_error",
        "param": null,
        "code": "invalid_request_error"
    }
}

Didn't expect this to be honest 😂

@henryclw
Copy link
Author

@ochafik Hi, if you don't mind, I would like to discuss more about the chat template.

I think llama server should integrate with the assistant message provided by the user, rather than ignoring it.

Say the current llama-chat.cpp

if (tmpl == LLM_CHAT_TEMPLATE_CHATML) {
// chatml template
for (auto message : chat) {
ss << "<|im_start|>" << message->role << "\n" << message->content << "<|im_end|>\n";
}
if (add_ass) {
ss << "<|im_start|>assistant\n";
}

If the user provide a user(human) message with a assistant message, I could expect

<|im_start|>user\n Why is the sky blue? "<|im_end|>\n <|im_start|>assistant\n Sure, here is the reason:

instead of

<|im_start|>user\n Why is the sky blue? "<|im_end|>\n <|im_start|>assistant\n Sure, here is the reason: <|im_end|>\n

The first one would let the model continue with the prefilled assistant message. The second one might start a whole new assistant message.

I tried to create my own fork that allows

    if (tmpl == LLM_CHAT_TEMPLATE_CHATML) {
        // chatml template
        for (auto message : chat) {
            ss << "<|im_start|>" << message->role << "\n" << message->content;
            if (!last_is_assistant || message != chat.back()) {
                ss << "<|im_end|>\n";
            }
        }
        if (add_ass) {
            ss << "<|im_start|>assistant\n";
        }

Which works great but doesn't work when jinja template option is on.

B. An ugly way: let the model generate until it hits , then apply grammar. (This is the current work around method I'm using)

When I used method B as I mentioned, I'm using my own fork of llama.cpp, which enables me to call the llama server twice and continue the response with prefill.

Allowing user to prefill the assistant message is an important ability and it should work without and with jinja template. I'm not the only one who needs this, as there are serval thumbs up for #11755

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants