-
Notifications
You must be signed in to change notification settings - Fork 13.6k
chat: Allow reasoning_content to be passed back #16934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chat: Allow reasoning_content to be passed back #16934
Conversation
This makes it possible for reasoning_content to be passed back to llama-server, which is useful for LLMs like GPT-OSS or Minimax-M2 that were trained for this.
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a server test case for this?
|
This should already work with gpt-oss. From what I have seen, And the Lines 314 to 317 in 76af40a
The template will complain if both |
| } | ||
| if (!has_content && !has_tool_calls) { | ||
| throw std::runtime_error("Expected 'content' or 'tool_calls' (ref: https://github.com/ggml-org/llama.cpp/issues/8367 & https://github.com/ggml-org/llama.cpp/issues/12279)"); | ||
| if (!has_content && !has_tool_calls && !has_reasoning_content) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aldehir about your comment: I was getting errors from llama_server when my codex fork sent "reasoning_content" in this validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting. It isn't the behavior I see from my own clients sending back reasoning_content. I also use codex, but with middleware that translates reasoning to reasoning_content. Have you inspected the traffic from codex to ensure it is passing back tool_calls?
This doesn't hurt anything, but it does codify that a model may output only reasoning and nothing else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's interesting. It isn't the behavior I see from my own clients sending back reasoning_content. I also use codex, but with middleware that translates reasoning to reasoning_content.
I actually have my own middleware which I use just to inspect requests. I could never see it sending reasoning back to llama.cpp without those changes I made. There was some code which dropped it when the last message was a user message, which is certainly always the case when sending the request.
Have you inspected the traffic from codex to ensure it is passing back tool_calls?
Yes, it does receive tool calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some code which dropped it when the last message was a user message, which is certainly always the case when sending the request.
gpt-oss only needs the reasoning when looping on tool calls, i.e where the last message has the tool role. The template itself will not include reasoning for tool calls prior to the last "final" message (an assistant message with content). The message before a user message usually is a final assistant message, so all prior reasoning is removed. Minimax M2 does appear to require it for every assistant message, though. Looks like MiniMax-M2 only keeps it for tool calling loops as well.
This test case should pass even if you don't pass back reasoning_content, as content should be present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gpt-oss only needs the reasoning when looping on tool calls, i.e where the last message has the tool role. The template itself will not include reasoning for tool calls prior to the last "final" message (an assistant message with content). The message before a user message usually is a final assistant message, so all prior reasoning is removed. Minimax M2 does appear to require it for every assistant message, though.
If I understood correctly, then there's no problem with always passing reasoning back since the template will only use when needed, right?
In that case it is best to just allow passing reasoning_content and let the template handle how LLMs use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that is preferable, the model creators typically generate the template so they should encode whatever logic they expect there. Worse case, we can manipulate the messages in the *_init_params() function for the specific model. That's my own opinion, I do not speak for the maintainers.
I tested your branch, and I found the cause of your problem:
Notice, on the right, your patch is sending the reasoning content in a separate message. This is why you are receiving the error, because there is no accompanying content or tool_calls. Even if allowed, the template would render a final message with no content (from the first message) and may degrade model performance.
Additionally, gpt-oss only needs the reasoning from tool call messages. If it comes from a regular assistant message, it is dropped. You see this in the chat template. (Note: it does add it if add_generation_prompt = false, which is only applicable during training)
Take a look at my patch: aldehir/codex@fe2ca23
I had to give it a more specific example, so I asked it to run ls and then read the first 3 lines of the README file in the directory. Notice the reasoning_content added to the assistant message with tool_calls. This works with the current master branch as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so to summarize:
- For GPT-OSS, reasoning has to be passed back only with tool calls or normal content. If not, it is either ignored or it can break the conversation
- We still use this PR to allow reasoning content to be passed back independently, because some LLMs like Minimax M2 might use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aldehir your codex patch is much simpler. I assume it can break for whatever other inference engine uses "reasoning" instead of "reasoning_content", so it probably needs to be a configurable.
Were you planning to submit a PR to codex to make it compatible with llama.cpp or are you just continue using the reasoning -> reasoning_content proxy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For GPT-OSS, reasoning has to be passed back only with tool calls or normal content. If not, it is either ignored or it can break the conversation
For gpt-oss, technically only tool calls. But, it doesn't hurt if you keep it intact with all assistant messages since the template will render it properly.
We still use this PR to allow reasoning content to be passed back independently, because some LLMs like Minimax M2 might use it.
I don't believe this is needed, as I point out in #16946 (comment), it works as is if I pass along reasoning_content.
Were you planning to submit a PR to codex to make it compatible with llama.cpp or are you just continue using the reasoning -> reasoning_content proxy?
I have no intention to submit a PR. I think the ideal approach here is to adopt a Responses API that automatically supports this interaction.
|
LGTM, had the same remark as @ngxson (need tests). |
|
I added a test that verifies that |
2aef331 to
48237c2
Compare




This makes it possible for reasoning_content to be passed back to llama-server, which is useful for LLMs like GPT-OSS or Minimax-M2 that were trained for this.
TBH I'm not sure this is the correct approach as I'm not familiar with the code. I've simply made the necessary changes for llama.cpp no longer error out when receiving
reasoning_contentback from the client.I've been using GPT-OSS 120B locally with a codex fork that sends reasoning_content back, and it seems to work quite well.
It also requires a slightly modified jinja chat template that replaces "thinking" with "reasoning_content".
If this is the way to go and is merged, I will follow up with a codex PR that makes this configurable so that codex can be used correctly with llama-server.
I've also looked at Minimax M2's chat template and it seems to use
reasoning_contentto render<think>blocks, which is compatible to how it is done here.In case someone wants to try my codex fork with this, here's the config you can drop to
~/.codex/config.toml:This is the llama-server command I use (adjust for what your hardware can handle):
cc @pwilkin