epic: llama.cpp params are settable via API call or `model.yaml` #1151

dan-homebrew · 2024-09-08T04:48:54Z

Goal

Cortex can handle all llama.cpp params correctly
Model running params (i.e. POST /v1/models/<model_id>/start)
Inference params (i.e. POST /chat/completions)
Function Calling, eg for llama.cpp

Tasklist

I am using this epic to aggregate all llama.cpp params issues, including llama3.1 function calling + tool use

https://github.com/janhq/cortex.cpp/issues/618d
feat: Support response_format and structured JSON responses. jan#3785
feat: Cortex supports Function Calling #295
[WIP] add support for function calling #354 (prev PR with function calling/tool use)
min/max range for discrete int/float params, to prevent out-of-bounds errors (e.g. fix: missing type number for slider input right panel jan#3609)
bug: Duplicate BOS Token in Hugging Face Chat Templates #618

model.yaml

model.yaml as optional? (i.e. depend on GGUF params)
- See @tikikun's comment in epic: llama.cpp params are settable via API call or model.yaml #1151 (comment)
model.yaml should be well documented with approrpriate naming conventions
- Model loading params?
- Inference params?
- To exclude engine params?

Out-of-scope:

TensorRT-LLM
ONNX
feat: Return logits_prob in chat_completion #1163

fully support function calling: linking a comment from llamacpp's contributor.
Request body parameters: we need to add more option that supported in llama.cpp chat completion, load model to the request body. Need to Pr in cortex.llamacpp. Currently, most of popular params for load model and chat completion are supported in cortex.llamacpp.
Response body: add an option to return log probs -> need to modify cortex.llamacpp source, this task will need more effort because it related to inference implementation, need to carefully if not it will break inference or performance degrade.

Out-of-scope:

Speech api support
Image api support

model.yaml
From my side all information for running model should be in 1 file, because 1 model file can only run with 1 engine and user can do modify/tunning every parameters needed to run a model in the same place, it's more convenient.

Function calling
According to this comment, we can see that function calling is just a more complicated chat template that ask model to find the proper function and params to answer the input question. And there is no standard way to do it, each model has different training process so the prompt for each model is also different.

Function calling is essentially an advanced form of prompt engineering. It involves crafting a specialized prompt that instructs the model to identify appropriate functions and their parameters based on the input question. However, there's no universal approach to implementing this feature, as each model has undergone unique training processes, necessitating model-specific prompting strategies.
Developing a generic function calling feature presents significant challenges:

Model variability: Different models require distinct prompting techniques, making it difficult to create a one-size-fits-all solution.
Extensive experimentation: Even for a single model (e.g., llama3.1 - 8B), substantial testing is required to optimize performance across various scenarios.
User-defined functions: Since users will define custom functions, it's challenging to ensure that a preset system prompt will work effectively for all possible function definitions.
Quality assurance: Maintaining consistent output quality across diverse models and user-defined functions is extremely difficult.
Unpredictable responses: The complexity of the task increases the likelihood of unexpected or incorrect outputs.

Given these challenges, it's crucial to approach the implementation of a generalized function calling feature with caution. The goal of supporting every model and every user-defined function is likely unattainable due to the inherent variability and complexity involved. Instead, it may be more practical to focus on optimizing the feature for specific, well-defined use cases or a limited set of models.

I also check chat-gpt, mistral, groq, ... they also support function calling but the different is they do this feature for their own models.
llama.cpp also haven't support function calling yet since it is not really useful for normal user, and developer can do it themselves with better result.

louis-jan · 2024-09-09T16:52:04Z

model.yaml

Lessons learned from Jan

The model.json ID is quite easy to break -> it should be a computed field instead of stored.
The model.json structure is quite complicated, and users are unclear about where to place parameters, e.g. , between model loading and inference -> All of the parameters should be flattened and then handled at the application level (actually request middleman)
The model.json engine parameter is outdated, e.g., all of the models are defined as "nitro" but are actually routed to cortex.cpp, and we do not want to migrate them yet -> Detect model type and handle accordingly from engines.
The model.json should not define all available model parameters but the model / engine does. We want to keep model.json clean, short and simple. For example, ngl has been missing for a long time, so users could not adjust GPU offload for most of the previous versions of Jan.
Lacking model load parameters (cpu_threads, caching, flash attention ...)
The model.json is for templating, not for storing. I hadn't seen any use cases of an app that persists model.json until now.
Most of the inference parameters are unchanged over models, only max_tokens and stop words are maintained. But these all computed fields, E.g. default max_tokens can be ctx_length, and stop words can be pulled from GGUF file. Which is actually redundant because the engine should handle that automatically. GGUF model metadata is consolidated. All of these parameters' default values are actually defined by the engine already.
Mode load parameters are mostly computed fields until now, e.g. ctx_length, prompt_template, ngl, and llama_model_path; however, llama_model_path is duplicated, as we already know the location of the GGUF file, making redefinition from model.yaml unnecessary. Since GGUF model metadata is consolidated, adjusting those things is meaningless. As a user I would love to configure other params instead: e.g. cpu_threads, cache....
Decorative fields are killing the app. E.g. metatada, format, description... These fields should be optional, but without them, the app would break.. 😕
Grouping and sorting are hard when we don't know how to bring certain models to the top. Model tags are quite tricky. Model size should be a computed field as well. We are filling this in manually.
Sources: Currently, the app pre-populates model.json as a model template, and users can download it at any time without worrying about a blank page on Model Hub when there is no connectivity. And after pulling, app will depends on the file name field from source to determine BUT It's duplicated with llama_model_path, and also the engine can look for the model file in the model directory automatically, can't it? file_name also redundant it seem.

Sync parameters between Jan and engines

That would be great if we can apply something like protoc. Let's say there is a .proto (just for example) file that define the entities. It can be used for projects from JS to C++ and automatically generate entities. So we can just maintain one entity file that defines the model.yaml DTO, which can be used across projects (there are many engines to maintain as well).

Template parsing should be done from cortex.cpp?

We currently have to parse the model template in order to convert the Jinja template into ai_prompt, user_prompt, and system_prompt, so that engines can load it accordingly. Load model request should be simplified.

tikikun · 2024-09-10T02:19:41Z

Research input:

Market has tendency to consolidate on GGUF or Huggingface config file -> model already has its own config
What we want in the description is not related to the model, but how the "user store the config for that model"

So by using a seperate model.yaml we just make another wrapper for the config of the model that is already there inside either gguf or huggingface config file. In practice, it has proven to be extremely inconvenient to use.

The config of the model should bind to the entity of the user, the model is already contained within itself.

dan-homebrew · 2024-09-10T03:49:02Z

Research input:

Market has tendency to consolidate on GGUF or Huggingface config file -> model already has its own config

What we want in the description is not related to the model, but how the "user store the config for that model"

So by using a seperate model.yaml we just make another wrapper for the config of the model that is already there inside either gguf or huggingface config file. In practice, it has proven to be extremely inconvenient to use.

The config of the model should bind to the entity of the user, the model is already contained within itself.

I agree - given that GGUF already has built-in configs, we should optional model.yaml (i.e. just overrides existing GGUF params).

However:

model.yaml can still be very useful as a packaging tool (e.g. GGUF params are not editable by layman, or define Model Loading, Engine and inference params for the stack)
model.yaml can still be useful as a param definition tool, specially for other engines (e.g. TensorRT-LLM and ONNX/DirectML) which are less mature

dan-homebrew · 2024-09-10T03:55:19Z

Response body: add an option to return log probs -> need to modify cortex.llamacpp source, this task will need more effort because it related to inference implementation, need to carefully if not it will break inference or performance degrade.

@nguyenhoangthuan99 If log probs requires an upstream PR to llama.cpp, let's move it to out-of-scope for this epic.

My focus for now is to catch up to llama.cpp and ensure a stable product - we can explore upstream improvements later on.

dan-homebrew · 2024-09-10T03:58:02Z

Function calling According to this comment, we can see that function calling is just a more complicated chat template that ask model to find the proper function and params to answer the input question. And there is no standard way to do it, each model has different training process so the prompt for each model is also different.

Function calling is essentially an advanced form of prompt engineering. It involves crafting a specialized prompt that instructs the model to identify appropriate functions and their parameters based on the input question. However, there's no universal approach to implementing this feature, as each model has undergone unique training processes, necessitating model-specific prompting strategies. Developing a generic function calling feature presents significant challenges:

Model variability: Different models require distinct prompting techniques, making it difficult to create a one-size-fits-all solution.
Extensive experimentation: Even for a single model (e.g., llama3.1 - 8B), substantial testing is required to optimize performance across various scenarios.

User-defined functions: Since users will define custom functions, it's challenging to ensure that a preset system prompt will work effectively for all possible function definitions.

Quality assurance: Maintaining consistent output quality across diverse models and user-defined functions is extremely difficult.
Unpredictable responses: The complexity of the task increases the likelihood of unexpected or incorrect outputs.

Given these challenges, it's crucial to approach the implementation of a generalized function calling feature with caution. The goal of supporting every model and every user-defined function is likely unattainable due to the inherent variability and complexity involved. Instead, it may be more practical to focus on optimizing the feature for specific, well-defined use cases or a limited set of models.

I also check chat-gpt, mistral, groq, ... they also support function calling but the different is they do this feature for their own models. llama.cpp also haven't support function calling yet since it is not really useful for normal user, and developer can do it themselves with better result.

@nguyenhoangthuan99 @louis-jan I agree. Let's scope this to supporting per-model function calling:

We can focus on llama3.1 first
My naive understanding is that llama3.1 is the main model with function calling
i.e. ensure llama3.1 can support function calling (e.g. through combination of presets?, etc)
i.e. ensure llama3.1 finetunes can also support function calling at a model level

We can do this for llama3.1 first, and use it as a test case to develop a framework that can be generalized to other models in the future.

Given the high number of llama3.1 finetunes, this may mean prioritize the cortex presets story, which ultimately is a model.yaml story as well.

nguyenhoangthuan99 · 2024-09-10T05:26:10Z

Defining the default model.yaml first?
If GGUF model binary is missing the header metadata, what is default?
If GGUF binarry missing header metadata this file is invalid, llama.cpp cannot load it. The first 4 bytes of GGUF file is 1 magic number, when parse GGUF, we will read this magic number first and if it's not match the GGUF file is invalid. llamacpp and other tool like hugging face also do this to read data from GGUF file.

The model binary fail -> we won't create any model.yaml file because we cannot use it.

How do we intend to do versioning?
Example: we added the wrong template to a new model and need to fix.
How will cortex know the current model is outdated and needs an update?

Currently, we only download model base on repo name/branch in hugging face, the version in the model.yaml is parse from gguf file, this part may related to @namchuai .

nguyenhoangthuan99 · 2024-09-10T12:26:14Z

This PR can resolve:

Add most llamacpp's params to request body
Add log probs for response when using stream mode.

Since function calling is separated as different issue janhq/models#16 , I'll move function calling out of this epic.

dan-homebrew · 2024-09-11T05:39:31Z

@nguyenhoangthuan99 Quick check: there's a Jan issue asking for Beam search. Do we support it?

feat: support Beam search ("Best of") setting jan#3112
I don't see it in the normal list of llama.cpp params

If it's not in llama.cpp main branch, we don't need to support it. I just want to keep up with stable for now

nguyenhoangthuan99 · 2024-09-11T06:21:09Z

In llamacpp, beam search is added seen it is very important sample technique, in llamacpp it is top_k sampler. Each step will use top_k=40 equal to num_beam of beam search to search the result. Llamacpp also combine many sampler method. Default it combine 5-6 sampler method

I also added top_k option to the params for cortex.llamacpp

dan-homebrew · 2024-09-11T06:53:15Z

In llamacpp, beam search is added seen it is very important sample technique, in llamacpp it is top_k sampler. Each step will use top_k=40 equal to num_beam of beam search to search the result. Llamacpp also combine many sampler method. Default it combine 5-6 sampler method I also added top_k option to the params for cortex.llamacpp

Fantastic - yup, I was hoping it was a nomenclature difference

gabrielle-ong · 2024-10-03T13:54:44Z

This is a multi-sprint epic (including function calling), pushing to sprint 22

gabrielle-ong · 2024-10-21T03:46:40Z

Closing, merging into #295

dan-homebrew added this to Jan & Cortex Sep 8, 2024

dan-homebrew converted this from a draft issue Sep 8, 2024

dan-homebrew assigned dan-homebrew and nguyenhoangthuan99 Sep 8, 2024

This was referenced Sep 8, 2024

bug: Duplicate BOS Token in Hugging Face Chat Templates #618

Closed

bug: Cortex-cpp continues to have 1 layer offload to CPU while using GPU #1104

Closed

dan-homebrew changed the title ~~epic: Cortex can handle all llama.cpp params correctly~~ epic: Cortex and model.yaml can handle llama.cpp params correctly Sep 8, 2024

dan-homebrew moved this to In Progress in Jan & Cortex Sep 8, 2024

dan-homebrew removed their assignment Sep 8, 2024

dan-homebrew added the type: epic A major feature or initiative label Sep 8, 2024

dan-homebrew mentioned this issue Sep 8, 2024

feat: Return logits_prob in chat_completion #1163

Closed

dan-homebrew assigned louis-jan Sep 8, 2024

dan-homebrew changed the title ~~epic: Cortex and model.yaml can handle llama.cpp params correctly~~ epic: llama.cpp params are settable via API call or model.yaml Sep 9, 2024

dan-homebrew mentioned this issue Sep 9, 2024

epic: Essential, advanced parameters are missing janhq/jan#3140

Closed

0xSage added the P0: critical Mission critical label Sep 9, 2024

dan-homebrew mentioned this issue Sep 10, 2024

epic: Fix Local Engine issues (llama.cpp) janhq/jan#3614

Closed

10 tasks

0xSage added the category: model running Inference ux, handling context/parameters, runtime label Sep 10, 2024

dan-homebrew mentioned this issue Sep 10, 2024

feat: Cortex supports Function Calling #295

Closed

1 task

This was referenced Sep 10, 2024

test: add tests to model settings's Slider Input janhq/jan#3615

Merged

feat: Jan supports most llama.cpp params janhq/jan#3508

Open

nguyenhoangthuan99 mentioned this issue Sep 10, 2024

feat: add llamacpp params janhq/cortex.llamacpp#221

Merged

nguyenhoangthuan99 moved this from In Progress to In Review in Jan & Cortex Sep 10, 2024

dan-homebrew mentioned this issue Sep 11, 2024

feat: support Beam search ("Best of") setting janhq/jan#3112

Closed

1 task

nguyenhoangthuan99 closed this as completed in janhq/cortex.llamacpp#221 Sep 12, 2024

github-project-automation bot moved this from In Review to Completed in Jan & Cortex Sep 12, 2024

This was referenced Sep 12, 2024

epic: Jan Context Length issues janhq/jan#2320

Open

docs: Sprint 21 docs updates #1243

Closed

nguyenhoangthuan99 moved this from Completed to QA in Jan & Cortex Sep 19, 2024

irfanpena mentioned this issue Sep 20, 2024

docs: Update modelyaml, llama.cpp parameters, log in the data folder janhq/cortex.so#202

Merged

3 tasks

0xSage moved this from QA to In Progress in Jan & Cortex Sep 24, 2024

0xSage reopened this Sep 24, 2024

github-project-automation bot moved this from In Progress to Scheduled in Jan & Cortex Sep 24, 2024

0xSage moved this from Scheduled to QA in Jan & Cortex Sep 24, 2024

gabrielle-ong added this to the v1.0.1 milestone Oct 21, 2024

gabrielle-ong closed this as completed Oct 21, 2024

gabrielle-ong moved this from Review + QA to Completed in Jan & Cortex Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epic: llama.cpp params are settable via API call or `model.yaml` #1151

epic: llama.cpp params are settable via API call or `model.yaml` #1151

dan-homebrew commented Sep 8, 2024 •

edited by gabrielle-ong

Loading

nguyenhoangthuan99 commented Sep 9, 2024 •

edited

Loading

louis-jan commented Sep 9, 2024 •

edited

Loading

tikikun commented Sep 10, 2024 •

edited

Loading

dan-homebrew commented Sep 10, 2024 •

edited

Loading

dan-homebrew commented Sep 10, 2024

dan-homebrew commented Sep 10, 2024 •

edited

Loading

nguyenhoangthuan99 commented Sep 10, 2024

nguyenhoangthuan99 commented Sep 10, 2024 •

edited

Loading

dan-homebrew commented Sep 11, 2024 •

edited

Loading

nguyenhoangthuan99 commented Sep 11, 2024 •

edited

Loading

dan-homebrew commented Sep 11, 2024

gabrielle-ong commented Oct 3, 2024

gabrielle-ong commented Oct 21, 2024

epic: llama.cpp params are settable via API call or model.yaml #1151

epic: llama.cpp params are settable via API call or model.yaml #1151

Comments

dan-homebrew commented Sep 8, 2024 • edited by gabrielle-ong Loading

Goal

Tasklist

model.yaml

Out-of-scope:

Related

nguyenhoangthuan99 commented Sep 9, 2024 • edited Loading

louis-jan commented Sep 9, 2024 • edited Loading

model.yaml

Lessons learned from Jan

Sync parameters between Jan and engines

Template parsing should be done from cortex.cpp?

tikikun commented Sep 10, 2024 • edited Loading

dan-homebrew commented Sep 10, 2024 • edited Loading

dan-homebrew commented Sep 10, 2024

dan-homebrew commented Sep 10, 2024 • edited Loading

nguyenhoangthuan99 commented Sep 10, 2024

nguyenhoangthuan99 commented Sep 10, 2024 • edited Loading

dan-homebrew commented Sep 11, 2024 • edited Loading

nguyenhoangthuan99 commented Sep 11, 2024 • edited Loading

dan-homebrew commented Sep 11, 2024

gabrielle-ong commented Oct 3, 2024

gabrielle-ong commented Oct 21, 2024

epic: llama.cpp params are settable via API call or `model.yaml` #1151

epic: llama.cpp params are settable via API call or `model.yaml` #1151

dan-homebrew commented Sep 8, 2024 •

edited by gabrielle-ong

Loading

nguyenhoangthuan99 commented Sep 9, 2024 •

edited

Loading

louis-jan commented Sep 9, 2024 •

edited

Loading

tikikun commented Sep 10, 2024 •

edited

Loading

dan-homebrew commented Sep 10, 2024 •

edited

Loading

dan-homebrew commented Sep 10, 2024 •

edited

Loading

nguyenhoangthuan99 commented Sep 10, 2024 •

edited

Loading

dan-homebrew commented Sep 11, 2024 •

edited

Loading

nguyenhoangthuan99 commented Sep 11, 2024 •

edited

Loading