Skip to content

Commit eaf466d

Browse files
committed
Fix truncated logprobs when streaming is off
The logic to skip the logprobs of the stop token was originally from ggml-org/llama.cpp#2849, and was later modified as part of ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD. The latter change wasn't included in ikawrakow#723. Then, after ikawrakow#958 got merged, the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2, resulting in truncated logprobs when streaming is off. This commit reverts the logic from ggml-org/llama.cpp#2849, such that the logprobs of the stop token will always be included in the response, when logprobs is enabled. From testing, this matches with the behavior of Fireworks inference server, for both chat completions and text completions endpoints. Also fix logprobs param handling for the text completion endpoint.
1 parent 912c98f commit eaf466d

File tree

2 files changed

+9
-12
lines changed

2 files changed

+9
-12
lines changed

examples/server/server.cpp

Lines changed: 3 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2646,18 +2646,9 @@ struct server_context {
26462646

26472647
// populate res.probs_output
26482648
if (slot.sparams.n_probs > 0) {
2649-
if (!slot.params.stream && slot.stopped_word) {
2650-
const std::vector<llama_token> stop_word_toks = llama_tokenize(ctx, slot.stopping_word, false);
2651-
2652-
size_t safe_offset = std::min(slot.generated_token_probs.size(), stop_word_toks.size());
2653-
res.probs_output = std::vector<completion_token_output>(
2654-
slot.generated_token_probs.begin(),
2655-
slot.generated_token_probs.end() - safe_offset);
2656-
} else {
2657-
res.probs_output = std::vector<completion_token_output>(
2658-
slot.generated_token_probs.begin(),
2659-
slot.generated_token_probs.end());
2660-
}
2649+
res.probs_output = std::vector<completion_token_output>(
2650+
slot.generated_token_probs.begin(),
2651+
slot.generated_token_probs.end());
26612652
res.data["completion_probabilities"] = probs_vector_to_json(ctx, res.probs_output);
26622653
}
26632654

examples/server/utils.hpp

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -466,6 +466,12 @@ static json oaicompat_chat_params_parse(const json& body) {
466466
throw std::runtime_error("Only no echo is supported");
467467
}
468468

469+
// Handle "logprobs" field
470+
int n_probs = json_value(body, "logprobs", 0);
471+
if (n_probs > 0) {
472+
llama_params["n_probs"] = n_probs;
473+
}
474+
469475
// Params supported by OAI but unsupported by llama.cpp
470476
static const std::vector<std::string> unsupported_params{ "best_of", "suffix" };
471477
for (const auto& param : unsupported_params) {

0 commit comments

Comments
 (0)