-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add llamacpp params #221
feat: add llamacpp params #221
Conversation
In this PR the following params will be added:
This Pr also support return log probs when the option |
@@ -28,6 +48,24 @@ inline ChatCompletionRequest fromJson(std::shared_ptr<Json::Value> jsonBody) { | |||
completion.messages = (*jsonBody)["messages"]; | |||
completion.stop = (*jsonBody)["stop"]; | |||
completion.model_id = (*jsonBody).get("model", {}).asString(); | |||
|
|||
completion.seed = (*jsonBody).get("seed", -1).asInt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice this PR defines default values twice:
- struct definition (above)
- JSON parsing default value
Are we able to define once?
DRY principle: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the previous implementation like this one https://github.com/janhq/cortex.llamacpp/blob/main/src/chat_completion_request.h#L8, maybe some weird bug in the past force us to do it. Like this PR I fixed the race condition, even if we checked the everything code (mutex, only return slot if available, ... ) but error still pop up. So I have to add another check for if (slot==null)
so that the issue can be resolved.
We are using third party lib for json
so I think it's no harm to double check and make sure it work well, but if it necessary, I'll change it, but we need to test more. to make sure it won't break any thing
data["tfs_z"] = completion.tfs_z; | ||
data["typical_p"] = completion.typ_p; | ||
data["repeat_last_n"] = completion.repeat_last_n; | ||
data["repeat_penalty"] = completion.penalty_repeat; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woah, is there a way for us to align our penalty_repeat
param with the original llama.cpp repeat_penalty
?
- This is the sort of thing that trips an intern up a year from now
- If we align all params, is there a more elegant way to copy aligned k-v pairs from one struct to another? (llama3.1 tells me
std::copy
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's impossible because the data is a json
datatype, and completion is our custom struct
datatype. json
don't have overload operator =
for json
with our custom completion struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about the penalty_repeat
and repeat_penalty
, it is the same with previous implementation with frequency_penalty
https://github.com/janhq/cortex.llamacpp/blob/main/src/llama_server_context.cc#L445 . I think it is a way to provide unique params template interface for API.
state->task_id = state->llama.RequestCompletion(data, false, false, -1); | ||
while (state->llama.model_loaded_external) { | ||
TaskResult result = state->llama.NextResult(state->task_id); | ||
if (!result.error) { | ||
std::string to_send = result.result_json["content"]; | ||
std::string to_send; | ||
if (n_probs > 0){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I verify my understanding about n_probs
:
- From llama.cpp server docs if
n_probs
> 0, resp contains probabilities of N tokens - However, we don't seem to send the
content
back to the user in this case? - Or: do we send both
content
, and alsocompletion_probabilities
?
We should align with the conventions in llama.cpp's server, as much as possible
Resources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our implementation can return this form ggerganov/llama.cpp#4088 (comment), both content and list of (token - probs) for each token
slot->sparams.dynatemp_exponent = | ||
json_value(data, "dynatemp_exponent", default_sparams.dynatemp_exponent); | ||
slot->sparams.ignore_eos = | ||
json_value(data, "ignore_eos", default_sparams.ignore_eos); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I check my understanding:
- This code filters for top N tokens given sampling settings
- Fills out
completion_probabilities
k-v
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, the return n probs is in serveral place inside codebase.
At this part, for every inference step it will add n probs to the result https://github.com/janhq/cortex.llamacpp/blob/main/src/llama_server_context.cc#L1675.
With stream
mode in this line it will form the return json https://github.com/janhq/cortex.llamacpp/blob/main/src/llama_server_context.cc#L919
Fix janhq/cortex.cpp#1151