Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add llamacpp params #221

Merged
merged 2 commits into from
Sep 12, 2024

Conversation

nguyenhoangthuan99
Copy link
Contributor

@nguyenhoangthuan99 nguyenhoangthuan99 commented Sep 10, 2024

@nguyenhoangthuan99
Copy link
Contributor Author

nguyenhoangthuan99 commented Sep 10, 2024

In this PR the following params will be added:

  • Chat completion:

    • seed = -1; RNG seed (default: -1, use random seed for < 0)
    • dynatemp_range = 0.0f; dynamic temperature range (default: 0.0, 0.0 = disabled)
    • dynatemp_exponent = 1.0f; dynamic temperature exponent (default: 1.0)
    • top_k = 40; top-k sampling (default: 40, 0 = disabled)
    • min_p = 0.05f; min-p sampling (default: 0.05, 0.0 = disabled)
    • tfs_z = 1.0f; tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
    • typ_p = 1.0f; locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
    • repeat_last_n = 64; last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size)
    • penalty_repeat = 1.0f; penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
    • mirostat = false; use Mirostat sampling.
    • mirostat_tau = 5.0f; Mirostat target entropy, parameter tau (default: 5.0)
    • mirostat_eta = 0.1f; Mirostat learning rate, parameter eta (default: 0.1)
    • penalize_nl = false; penalize newline tokens (default: false)
    • ignore_eos = false; ignore end of stream token and continue generating (implies --logit-bias EOS-inf)
    • n_probs = 0; number of log probs per token to returb
    • min_keep = 0;
    • grammar;
  • Load Model:

    • n_predict : number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
    • prompt : prompt to start generation with
      in conversation mode, this will be used as system prompt
      (default: '')
    • conversation: run in conversation mode, does not print special tokens and suffix/prefix
      if suffix/prefix are not specified, default chat template will be used
      (default: false)
    • special : special tokens output enabled (default: false)

This Pr also support return log probs when the option n_probs > 0

@vansangpfiev vansangpfiev marked this pull request as ready for review September 11, 2024 01:20
@vansangpfiev vansangpfiev changed the title feat-add-llamacpp-params feat: add llamacpp params Sep 11, 2024
@@ -28,6 +48,24 @@ inline ChatCompletionRequest fromJson(std::shared_ptr<Json::Value> jsonBody) {
completion.messages = (*jsonBody)["messages"];
completion.stop = (*jsonBody)["stop"];
completion.model_id = (*jsonBody).get("model", {}).asString();

completion.seed = (*jsonBody).get("seed", -1).asInt();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice this PR defines default values twice:

  • struct definition (above)
  • JSON parsing default value

Are we able to define once?

DRY principle: https://en.wikipedia.org/wiki/Don%27t_repeat_yourself

Copy link
Contributor Author

@nguyenhoangthuan99 nguyenhoangthuan99 Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the previous implementation like this one https://github.com/janhq/cortex.llamacpp/blob/main/src/chat_completion_request.h#L8, maybe some weird bug in the past force us to do it. Like this PR I fixed the race condition, even if we checked the everything code (mutex, only return slot if available, ... ) but error still pop up. So I have to add another check for if (slot==null) so that the issue can be resolved.

We are using third party lib for json so I think it's no harm to double check and make sure it work well, but if it necessary, I'll change it, but we need to test more. to make sure it won't break any thing

data["tfs_z"] = completion.tfs_z;
data["typical_p"] = completion.typ_p;
data["repeat_last_n"] = completion.repeat_last_n;
data["repeat_penalty"] = completion.penalty_repeat;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah, is there a way for us to align our penalty_repeat param with the original llama.cpp repeat_penalty?

  • This is the sort of thing that trips an intern up a year from now
  • If we align all params, is there a more elegant way to copy aligned k-v pairs from one struct to another? (llama3.1 tells me std::copy)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's impossible because the data is a json datatype, and completion is our custom struct datatype. json don't have overload operator = for json with our custom completion struct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about the penalty_repeat and repeat_penalty, it is the same with previous implementation with frequency_penalty https://github.com/janhq/cortex.llamacpp/blob/main/src/llama_server_context.cc#L445 . I think it is a way to provide unique params template interface for API.

state->task_id = state->llama.RequestCompletion(data, false, false, -1);
while (state->llama.model_loaded_external) {
TaskResult result = state->llama.NextResult(state->task_id);
if (!result.error) {
std::string to_send = result.result_json["content"];
std::string to_send;
if (n_probs > 0){

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I verify my understanding about n_probs:

  • From llama.cpp server docs if n_probs > 0, resp contains probabilities of N tokens
  • However, we don't seem to send the content back to the user in this case?
  • Or: do we send both content, and also completion_probabilities?

We should align with the conventions in llama.cpp's server, as much as possible

Resources

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our implementation can return this form ggerganov/llama.cpp#4088 (comment), both content and list of (token - probs) for each token

slot->sparams.dynatemp_exponent =
json_value(data, "dynatemp_exponent", default_sparams.dynatemp_exponent);
slot->sparams.ignore_eos =
json_value(data, "ignore_eos", default_sparams.ignore_eos);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I check my understanding:

  • This code filters for top N tokens given sampling settings
  • Fills out completion_probabilities k-v

Copy link
Contributor Author

@nguyenhoangthuan99 nguyenhoangthuan99 Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the return n probs is in serveral place inside codebase.
At this part, for every inference step it will add n probs to the result https://github.com/janhq/cortex.llamacpp/blob/main/src/llama_server_context.cc#L1675.

With stream mode in this line it will form the return json https://github.com/janhq/cortex.llamacpp/blob/main/src/llama_server_context.cc#L919

@nguyenhoangthuan99 nguyenhoangthuan99 merged commit b6372bb into main Sep 12, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

epic: llama.cpp params are settable via API call or model.yaml
3 participants