Skip to content

Conversation

@tarruda
Copy link

@tarruda tarruda commented Nov 2, 2025

This makes it possible for reasoning_content to be passed back to llama-server, which is useful for LLMs like GPT-OSS or Minimax-M2 that were trained for this.

TBH I'm not sure this is the correct approach as I'm not familiar with the code. I've simply made the necessary changes for llama.cpp no longer error out when receiving reasoning_content back from the client.

I've been using GPT-OSS 120B locally with a codex fork that sends reasoning_content back, and it seems to work quite well.

It also requires a slightly modified jinja chat template that replaces "thinking" with "reasoning_content".

If this is the way to go and is merged, I will follow up with a codex PR that makes this configurable so that codex can be used correctly with llama-server.

I've also looked at Minimax M2's chat template and it seems to use reasoning_content to render <think> blocks, which is compatible to how it is done here.

In case someone wants to try my codex fork with this, here's the config you can drop to ~/.codex/config.toml:

profile = "llama_server"

[model_providers.llama_server]
name = "llama-server"
base_url = "http://localhost:8080/v1"
query_params = {"reasoning_effort" = "high"} # doesn't seem like this is currently working, still need to debug

[profiles.oss]
model_provider = "llama_server"
model = "gpt-oss-120b"

This is the llama-server command I use (adjust for what your hardware can handle):

llama-server --no-mmap --no-warmup --model gpt-oss-120b-mxfp4-00001-of-00003.gguf
 -a gpt-oss-120b --ctx-size 524280 -np 4 --jinja -fa on --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0 --swa-full --host 0.0.0.0 --chat-template-kwargs '{"reasoning_effort":"low"}' --chat-template-file gptoss.j2

cc @pwilkin

This makes it possible for reasoning_content to be passed back to llama-server,
which is useful for LLMs like GPT-OSS or Minimax-M2 that were trained for this.
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a server test case for this?

@aldehir
Copy link
Collaborator

aldehir commented Nov 2, 2025

This should already work with gpt-oss. From what I have seen, reasoning_content always comes accompanied with content or tool_calls. Although, I don't know how Minimax M2 behaves so maybe it's still needed.

And the reasoning_content gets mapped to thinking here:

llama.cpp/common/chat.cpp

Lines 314 to 317 in 76af40a

if (!msg.reasoning_content.empty()) {
jmsg["reasoning_content"] = msg.reasoning_content;
jmsg["thinking"] = msg.reasoning_content; // gpt-oss
}

The template will complain if both content and reasoning_content in a tool call message are present, maybe the messages should be adjusted in common_chat_params_init_gpt_oss() instead?

@tarruda
Copy link
Author

tarruda commented Nov 2, 2025

Can you add a server test case for this?

I was planning to add tests, just wanted some feedback from the maintainers first. TBH I'm not even certain it is fully working because I don't know how to debug the templates yet, but it does seem to perform well on my manual testing.

I realize not everyone might be familiar with codex or how to compile it, so I add some "by step" instructions on how to build my codex fork and run llama-server in a way that codex can make the most of it.

I've tested these instructions on my mac and also used gpt-oss-20b so it is more accessible.

Install Rust toolchain:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y

After installing, re-enter the shell to be able to use cargo.

Ensure pkg-config and libssl-dev are installed.

On Mac it might not be necessary (it wasn't when I ran locally), but on Debian flavored Linuxes you would use:

sudo apt install libssl-dev pkg-config build-essential

Clone and compile my codex fork.

git clone https://github.com/tarruda/codex -b send-thinking
cd codex/codex-rs
cargo build --workspace

After that, the codex binary will be located on target/debug/codex. Copy it to some PATH directory.

Configure codex

Configure codex so it won't ask for OpenAI login.

mkdir -pv ~/.codex
cat > ~/.codex/config.toml << EOF
profile = "oss"

[model_providers.llama_server]
name = "llama_server"
# adjust the base URL to the host which will be running llama-server
base_url = "http://localhost:8080/v1"
query_params = {"reasoning_effort" = "high"}

[profiles.oss]
model_provider = "llama_server"
model = "gpt-oss"

Get the modified chat template

As I mentioned, this requires a slightly modified chat template. Download it locally to gptoss.j2:

curl -LOC - https://gist.githubusercontent.com/tarruda/7075f921bd8a58e0b9755766acdad7a5/raw/ac6fc486c4c0b710ee62177d5a898a37f57536f8/gptoss.j2

Launch llama-server

# Adjust to where you downloaded the gpt oss
model=$HOME/models/ggml-org/gpt-oss-20b-GGUF/mxfp4/gpt-oss-20b-mxfp4.gguf

llama-server --no-mmap --no-warmup --model $model -a gpt-oss --ctx-size 262140 -np 2 --jinja -fa on --temp 1.0 --top-p 1.0 --top-k 0 --min-p 0.0 --swa-full --host 0.0.0.0 --chat-template-file gptoss.j2

Finally, run codex. I ran it on llama.cpp and here are the results:

image

}
if (!has_content && !has_tool_calls) {
throw std::runtime_error("Expected 'content' or 'tool_calls' (ref: https://github.com/ggml-org/llama.cpp/issues/8367 & https://github.com/ggml-org/llama.cpp/issues/12279)");
if (!has_content && !has_tool_calls && !has_reasoning_content) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldehir about your comment: I was getting errors from llama_server when my codex fork sent "reasoning_content" in this validation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. It isn't the behavior I see from my own clients sending back reasoning_content. I also use codex, but with middleware that translates reasoning to reasoning_content. Have you inspected the traffic from codex to ensure it is passing back tool_calls?

This doesn't hurt anything, but it does codify that a model may output only reasoning and nothing else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's interesting. It isn't the behavior I see from my own clients sending back reasoning_content. I also use codex, but with middleware that translates reasoning to reasoning_content.

I actually have my own middleware which I use just to inspect requests. I could never see it sending reasoning back to llama.cpp without those changes I made. There was some code which dropped it when the last message was a user message, which is certainly always the case when sending the request.

Have you inspected the traffic from codex to ensure it is passing back tool_calls?

Yes, it does receive tool calls.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is easy to verify: If you run llama.cpp master with my codex fork, it will fail with 500 on the second message (which is the first request that would send previous resoning content):

image

Copy link
Collaborator

@aldehir aldehir Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was some code which dropped it when the last message was a user message, which is certainly always the case when sending the request.

gpt-oss only needs the reasoning when looping on tool calls, i.e where the last message has the tool role. The template itself will not include reasoning for tool calls prior to the last "final" message (an assistant message with content). The message before a user message usually is a final assistant message, so all prior reasoning is removed. Minimax M2 does appear to require it for every assistant message, though. Looks like MiniMax-M2 only keeps it for tool calling loops as well.

image

This test case should pass even if you don't pass back reasoning_content, as content should be present.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpt-oss only needs the reasoning when looping on tool calls, i.e where the last message has the tool role. The template itself will not include reasoning for tool calls prior to the last "final" message (an assistant message with content). The message before a user message usually is a final assistant message, so all prior reasoning is removed. Minimax M2 does appear to require it for every assistant message, though.

If I understood correctly, then there's no problem with always passing reasoning back since the template will only use when needed, right?

In that case it is best to just allow passing reasoning_content and let the template handle how LLMs use it?

Copy link
Collaborator

@aldehir aldehir Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that is preferable, the model creators typically generate the template so they should encode whatever logic they expect there. Worse case, we can manipulate the messages in the *_init_params() function for the specific model. That's my own opinion, I do not speak for the maintainers.

I tested your branch, and I found the cause of your problem:

tarruda/codex send-thinking codex-test

Notice, on the right, your patch is sending the reasoning content in a separate message. This is why you are receiving the error, because there is no accompanying content or tool_calls. Even if allowed, the template would render a final message with no content (from the first message) and may degrade model performance.

Additionally, gpt-oss only needs the reasoning from tool call messages. If it comes from a regular assistant message, it is dropped. You see this in the chat template. (Note: it does add it if add_generation_prompt = false, which is only applicable during training)

Take a look at my patch: aldehir/codex@fe2ca23

aldehir/codex llama-cpp-support codex-test-patch

I had to give it a more specific example, so I asked it to run ls and then read the first 3 lines of the README file in the directory. Notice the reasoning_content added to the assistant message with tool_calls. This works with the current master branch as is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so to summarize:

  • For GPT-OSS, reasoning has to be passed back only with tool calls or normal content. If not, it is either ignored or it can break the conversation
  • We still use this PR to allow reasoning content to be passed back independently, because some LLMs like Minimax M2 might use it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldehir your codex patch is much simpler. I assume it can break for whatever other inference engine uses "reasoning" instead of "reasoning_content", so it probably needs to be a configurable.

Were you planning to submit a PR to codex to make it compatible with llama.cpp or are you just continue using the reasoning -> reasoning_content proxy?

Copy link
Collaborator

@aldehir aldehir Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For GPT-OSS, reasoning has to be passed back only with tool calls or normal content. If not, it is either ignored or it can break the conversation

For gpt-oss, technically only tool calls. But, it doesn't hurt if you keep it intact with all assistant messages since the template will render it properly.

We still use this PR to allow reasoning content to be passed back independently, because some LLMs like Minimax M2 might use it.

I don't believe this is needed, as I point out in #16946 (comment), it works as is if I pass along reasoning_content.

Were you planning to submit a PR to codex to make it compatible with llama.cpp or are you just continue using the reasoning -> reasoning_content proxy?

I have no intention to submit a PR. I think the ideal approach here is to adopt a Responses API that automatically supports this interaction.

@pwilkin
Copy link
Collaborator

pwilkin commented Nov 2, 2025

LGTM, had the same remark as @ngxson (need tests).

@tarruda
Copy link
Author

tarruda commented Nov 3, 2025

I added a test that verifies that reasoning_content is accepted standalone.

@tarruda tarruda force-pushed the allow-passing-back-reasoning-content branch from 2aef331 to 48237c2 Compare November 3, 2025 00:21
@tarruda tarruda requested a review from ngxson November 3, 2025 00:21
@github-actions github-actions bot added the python python script changes label Nov 3, 2025
@tarruda
Copy link
Author

tarruda commented Nov 3, 2025

I've realized that this is not necessary after reflecting on what @aldehir's explanation, so closing.

If anyone is interested in using codex with llama.cpp + GPT-OSS, just use @aldehir's patch and it should work flawlessly.

@tarruda tarruda closed this Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants