chat: Allow reasoning_content to be passed back #16934

tarruda · 2025-11-02T10:29:44Z

This makes it possible for reasoning_content to be passed back to llama-server, which is useful for LLMs like GPT-OSS or Minimax-M2 that were trained for this.

TBH I'm not sure this is the correct approach as I'm not familiar with the code. I've simply made the necessary changes for llama.cpp no longer error out when receiving reasoning_content back from the client.

I've been using GPT-OSS 120B locally with a codex fork that sends reasoning_content back, and it seems to work quite well.

It also requires a slightly modified jinja chat template that replaces "thinking" with "reasoning_content".

If this is the way to go and is merged, I will follow up with a codex PR that makes this configurable so that codex can be used correctly with llama-server.

I've also looked at Minimax M2's chat template and it seems to use reasoning_content to render <think> blocks, which is compatible to how it is done here.

In case someone wants to try my codex fork with this, here's the config you can drop to ~/.codex/config.toml:

profile = "llama_server"

[model_providers.llama_server]
name = "llama-server"
base_url = "http://localhost:8080/v1"
query_params = {"reasoning_effort" = "high"} # doesn't seem like this is currently working, still need to debug

[profiles.oss]
model_provider = "llama_server"
model = "gpt-oss-120b"

This is the llama-server command I use (adjust for what your hardware can handle):

llama-server --no-mmap --no-warmup --model gpt-oss-120b-mxfp4-00001-of-00003.gguf
 -a gpt-oss-120b --ctx-size 524280 -np 4 --jinja -fa on --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0 --swa-full --host 0.0.0.0 --chat-template-kwargs '{"reasoning_effort":"low"}' --chat-template-file gptoss.j2

cc @pwilkin

This makes it possible for reasoning_content to be passed back to llama-server, which is useful for LLMs like GPT-OSS or Minimax-M2 that were trained for this.

ngxson

Can you add a server test case for this?

aldehir · 2025-11-02T12:09:29Z

This should already work with gpt-oss. From what I have seen, reasoning_content always comes accompanied with content or tool_calls. Although, I don't know how Minimax M2 behaves so maybe it's still needed.

And the reasoning_content gets mapped to thinking here:

llama.cpp/common/chat.cpp

Lines 314 to 317 in 76af40a

    
           if (!msg.reasoning_content.empty()) { 
        
               jmsg["reasoning_content"] = msg.reasoning_content; 
        
               jmsg["thinking"] = msg.reasoning_content; // gpt-oss 
        
           }

The template will complain if both content and reasoning_content in a tool call message are present, maybe the messages should be adjusted in common_chat_params_init_gpt_oss() instead?

tarruda · 2025-11-02T12:10:57Z

Can you add a server test case for this?

I was planning to add tests, just wanted some feedback from the maintainers first. TBH I'm not even certain it is fully working because I don't know how to debug the templates yet, but it does seem to perform well on my manual testing.

I realize not everyone might be familiar with codex or how to compile it, so I add some "by step" instructions on how to build my codex fork and run llama-server in a way that codex can make the most of it.

I've tested these instructions on my mac and also used gpt-oss-20b so it is more accessible.

Install Rust toolchain:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

After installing, re-enter the shell to be able to use cargo.

Ensure pkg-config and libssl-dev are installed.

On Mac it might not be necessary (it wasn't when I ran locally), but on Debian flavored Linuxes you would use:

sudo apt install libssl-dev pkg-config build-essential

Clone and compile my codex fork.

git clone https://github.com/tarruda/codex -b send-thinking
cd codex/codex-rs
cargo build --workspace

After that, the codex binary will be located on target/debug/codex. Copy it to some PATH directory.

Configure codex

Configure codex so it won't ask for OpenAI login.

mkdir -pv ~/.codex
cat > ~/.codex/config.toml << EOF
profile = "oss"

[model_providers.llama_server]
name = "llama_server"
# adjust the base URL to the host which will be running llama-server
base_url = "http://localhost:8080/v1"
query_params = {"reasoning_effort" = "high"}

[profiles.oss]
model_provider = "llama_server"
model = "gpt-oss"

Get the modified chat template

As I mentioned, this requires a slightly modified chat template. Download it locally to gptoss.j2:

curl -LOC - https://gist.githubusercontent.com/tarruda/7075f921bd8a58e0b9755766acdad7a5/raw/ac6fc486c4c0b710ee62177d5a898a37f57536f8/gptoss.j2

Launch llama-server

# Adjust to where you downloaded the gpt oss
model=$HOME/models/ggml-org/gpt-oss-20b-GGUF/mxfp4/gpt-oss-20b-mxfp4.gguf

llama-server --no-mmap --no-warmup --model $model -a gpt-oss --ctx-size 262140 -np 2 --jinja -fa on --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0 --swa-full --host 0.0.0.0 --chat-template-file gptoss.j2

Finally, run codex. I ran it on llama.cpp and here are the results:

tarruda · 2025-11-02T12:12:58Z

common/chat.cpp

            }
-            if (!has_content && !has_tool_calls) {
-                throw std::runtime_error("Expected 'content' or 'tool_calls' (ref: https://github.com/ggml-org/llama.cpp/issues/8367 & https://github.com/ggml-org/llama.cpp/issues/12279)");
+            if (!has_content && !has_tool_calls && !has_reasoning_content) {


@aldehir about your comment: I was getting errors from llama_server when my codex fork sent "reasoning_content" in this validation.

That's interesting. It isn't the behavior I see from my own clients sending back reasoning_content. I also use codex, but with middleware that translates reasoning to reasoning_content. Have you inspected the traffic from codex to ensure it is passing back tool_calls?

This doesn't hurt anything, but it does codify that a model may output only reasoning and nothing else.

That's interesting. It isn't the behavior I see from my own clients sending back reasoning_content. I also use codex, but with middleware that translates reasoning to reasoning_content.

I actually have my own middleware which I use just to inspect requests. I could never see it sending reasoning back to llama.cpp without those changes I made. There was some code which dropped it when the last message was a user message, which is certainly always the case when sending the request.

Have you inspected the traffic from codex to ensure it is passing back tool_calls?

Yes, it does receive tool calls.

This is easy to verify: If you run llama.cpp master with my codex fork, it will fail with 500 on the second message (which is the first request that would send previous resoning content):

There was some code which dropped it when the last message was a user message, which is certainly always the case when sending the request.

gpt-oss only needs the reasoning when looping on tool calls, i.e where the last message has the tool role. The template itself will not include reasoning for tool calls prior to the last "final" message (an assistant message with content). The message before a user message usually is a final assistant message, so all prior reasoning is removed. ~~Minimax M2 does appear to require it for every assistant message, though.~~ Looks like MiniMax-M2 only keeps it for tool calling loops as well.

This test case should pass even if you don't pass back reasoning_content, as content should be present.

gpt-oss only needs the reasoning when looping on tool calls, i.e where the last message has the tool role. The template itself will not include reasoning for tool calls prior to the last "final" message (an assistant message with content). The message before a user message usually is a final assistant message, so all prior reasoning is removed. Minimax M2 does appear to require it for every assistant message, though.

If I understood correctly, then there's no problem with always passing reasoning back since the template will only use when needed, right?

In that case it is best to just allow passing reasoning_content and let the template handle how LLMs use it?

I believe that is preferable, the model creators typically generate the template so they should encode whatever logic they expect there. Worse case, we can manipulate the messages in the *_init_params() function for the specific model. That's my own opinion, I do not speak for the maintainers.

I tested your branch, and I found the cause of your problem:

tarruda/codex send-thinking

Notice, on the right, your patch is sending the reasoning content in a separate message. This is why you are receiving the error, because there is no accompanying content or tool_calls. Even if allowed, the template would render a final message with no content (from the first message) and may degrade model performance.

Additionally, gpt-oss only needs the reasoning from tool call messages. If it comes from a regular assistant message, it is dropped. You see this in the chat template. (Note: it does add it if add_generation_prompt = false, which is only applicable during training)

Take a look at my patch: aldehir/codex@fe2ca23

aldehir/codex llama-cpp-support

I had to give it a more specific example, so I asked it to run ls and then read the first 3 lines of the README file in the directory. Notice the reasoning_content added to the assistant message with tool_calls. This works with the current master branch as is.

Ok so to summarize:

For GPT-OSS, reasoning has to be passed back only with tool calls or normal content. If not, it is either ignored or it can break the conversation

We still use this PR to allow reasoning content to be passed back independently, because some LLMs like Minimax M2 might use it.

@aldehir your codex patch is much simpler. I assume it can break for whatever other inference engine uses "reasoning" instead of "reasoning_content", so it probably needs to be a configurable.

Were you planning to submit a PR to codex to make it compatible with llama.cpp or are you just continue using the reasoning -> reasoning_content proxy?

For GPT-OSS, reasoning has to be passed back only with tool calls or normal content. If not, it is either ignored or it can break the conversation

For gpt-oss, technically only tool calls. But, it doesn't hurt if you keep it intact with all assistant messages since the template will render it properly.

We still use this PR to allow reasoning content to be passed back independently, because some LLMs like Minimax M2 might use it.

I don't believe this is needed, as I point out in #16946 (comment), it works as is if I pass along reasoning_content.

Were you planning to submit a PR to codex to make it compatible with llama.cpp or are you just continue using the reasoning -> reasoning_content proxy?

I have no intention to submit a PR. I think the ideal approach here is to adopt a Responses API that automatically supports this interaction.

pwilkin · 2025-11-02T14:44:01Z

LGTM, had the same remark as @ngxson (need tests).

tarruda · 2025-11-03T00:16:18Z

I added a test that verifies that reasoning_content is accepted standalone.

tarruda · 2025-11-03T10:16:48Z

I've realized that this is not necessary after reflecting on what @aldehir's explanation, so closing.

If anyone is interested in using codex with llama.cpp + GPT-OSS, just use @aldehir's patch and it should work flawlessly.

chat: Allow reasoning_content to be passed back

de4343a

This makes it possible for reasoning_content to be passed back to llama-server, which is useful for LLMs like GPT-OSS or Minimax-M2 that were trained for this.

tarruda requested review from ggerganov and ngxson as code owners November 2, 2025 10:29

github-actions bot added examples server labels Nov 2, 2025

ngxson reviewed Nov 2, 2025

View reviewed changes

tarruda commented Nov 2, 2025

View reviewed changes

DajanaV mentioned this pull request Nov 2, 2025

UPSTREAM PR #16934: chat: Allow reasoning_content to be passed back auroralabs-loci/llama.cpp#43

Closed

aldehir mentioned this pull request Nov 2, 2025

common : move gpt-oss reasoning processing to init params #16937

Merged

DajanaV mentioned this pull request Nov 2, 2025

UPSTREAM PR #16937: common : move gpt-oss reasoning processing to init params auroralabs-loci/llama.cpp#44

Closed

pwilkin self-assigned this Nov 2, 2025

Merge branch 'master' into allow-passing-back-reasoning-content

d6e2094

Add test for checking if reasoning_content is accepted

48237c2

tarruda force-pushed the allow-passing-back-reasoning-content branch from 2aef331 to 48237c2 Compare November 3, 2025 00:21

tarruda requested a review from ngxson November 3, 2025 00:21

github-actions bot added the python python script changes label Nov 3, 2025

tarruda closed this Nov 3, 2025

chat: Allow reasoning_content to be passed back #16934

chat: Allow reasoning_content to be passed back #16934

Conversation

tarruda commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

aldehir commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarruda commented Nov 2, 2025

Install Rust toolchain:

Ensure pkg-config and libssl-dev are installed.

Clone and compile my codex fork.

Configure codex

Get the modified chat template

Launch llama-server

Uh oh!

tarruda Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

aldehir Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

tarruda Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

tarruda Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

aldehir Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarruda Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

aldehir Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tarruda Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

tarruda Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

aldehir Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pwilkin commented Nov 2, 2025

Uh oh!

tarruda commented Nov 3, 2025

Uh oh!

tarruda commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tarruda commented Nov 2, 2025 •

edited

Loading

aldehir commented Nov 2, 2025 •

edited

Loading

aldehir Nov 2, 2025 •

edited

Loading

aldehir Nov 2, 2025 •

edited

Loading

aldehir Nov 2, 2025 •

edited

Loading