Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epic: llama.cpp params are settable via API call or model.yaml #1151

Closed
4 of 7 tasks
Tracked by #2320 ...
dan-homebrew opened this issue Sep 8, 2024 · 13 comments · Fixed by janhq/cortex.llamacpp#221
Closed
4 of 7 tasks
Tracked by #2320 ...
Assignees
Labels
category: model running Inference ux, handling context/parameters, runtime P0: critical Mission critical type: epic A major feature or initiative
Milestone

Comments

@dan-homebrew
Copy link
Contributor

dan-homebrew commented Sep 8, 2024

Goal

  • Cortex can handle all llama.cpp params correctly
  • Model running params (i.e. POST /v1/models/<model_id>/start)
  • Inference params (i.e. POST /chat/completions)
  • Function Calling, eg for llama.cpp

Tasklist

I am using this epic to aggregate all llama.cpp params issues, including llama3.1 function calling + tool use

model.yaml

Out-of-scope:

Related

@dan-homebrew dan-homebrew converted this from a draft issue Sep 8, 2024
@dan-homebrew dan-homebrew changed the title epic: Cortex can handle all llama.cpp params correctly epic: Cortex and model.yaml can handle llama.cpp params correctly Sep 8, 2024
@dan-homebrew dan-homebrew moved this to In Progress in Jan & Cortex Sep 8, 2024
@dan-homebrew dan-homebrew removed their assignment Sep 8, 2024
@dan-homebrew dan-homebrew added the type: epic A major feature or initiative label Sep 8, 2024
@dan-homebrew dan-homebrew changed the title epic: Cortex and model.yaml can handle llama.cpp params correctly epic: llama.cpp params are settable via API call or model.yaml Sep 9, 2024
@nguyenhoangthuan99
Copy link
Contributor

nguyenhoangthuan99 commented Sep 9, 2024

Generally, I will break down this epic into tasks:

  • fully support function calling: linking a comment from llamacpp's contributor.
  • Request body parameters: we need to add more option that supported in llama.cpp chat completion, load model to the request body. Need to Pr in cortex.llamacpp. Currently, most of popular params for load model and chat completion are supported in cortex.llamacpp.
  • Response body: add an option to return log probs -> need to modify cortex.llamacpp source, this task will need more effort because it related to inference implementation, need to carefully if not it will break inference or performance degrade.

Out-of-scope:

  • Speech api support
  • Image api support

model.yaml
From my side all information for running model should be in 1 file, because 1 model file can only run with 1 engine and user can do modify/tunning every parameters needed to run a model in the same place, it's more convenient.

Function calling
According to this comment, we can see that function calling is just a more complicated chat template that ask model to find the proper function and params to answer the input question. And there is no standard way to do it, each model has different training process so the prompt for each model is also different.

Function calling is essentially an advanced form of prompt engineering. It involves crafting a specialized prompt that instructs the model to identify appropriate functions and their parameters based on the input question. However, there's no universal approach to implementing this feature, as each model has undergone unique training processes, necessitating model-specific prompting strategies.
Developing a generic function calling feature presents significant challenges:

  • Model variability: Different models require distinct prompting techniques, making it difficult to create a one-size-fits-all solution.
    Extensive experimentation: Even for a single model (e.g., llama3.1 - 8B), substantial testing is required to optimize performance across various scenarios.
  • User-defined functions: Since users will define custom functions, it's challenging to ensure that a preset system prompt will work effectively for all possible function definitions.
  • Quality assurance: Maintaining consistent output quality across diverse models and user-defined functions is extremely difficult.
    Unpredictable responses: The complexity of the task increases the likelihood of unexpected or incorrect outputs.

Given these challenges, it's crucial to approach the implementation of a generalized function calling feature with caution. The goal of supporting every model and every user-defined function is likely unattainable due to the inherent variability and complexity involved. Instead, it may be more practical to focus on optimizing the feature for specific, well-defined use cases or a limited set of models.

I also check chat-gpt, mistral, groq, ... they also support function calling but the different is they do this feature for their own models.
llama.cpp also haven't support function calling yet since it is not really useful for normal user, and developer can do it themselves with better result.

Image

@0xSage 0xSage added the P0: critical Mission critical label Sep 9, 2024
@louis-jan
Copy link
Contributor

louis-jan commented Sep 9, 2024

model.yaml

Lessons learned from Jan

  • The model.json ID is quite easy to break -> it should be a computed field instead of stored.
  • The model.json structure is quite complicated, and users are unclear about where to place parameters, e.g. , between model loading and inference -> All of the parameters should be flattened and then handled at the application level (actually request middleman)
  • The model.json engine parameter is outdated, e.g., all of the models are defined as "nitro" but are actually routed to cortex.cpp, and we do not want to migrate them yet -> Detect model type and handle accordingly from engines.
  • The model.json should not define all available model parameters but the model / engine does. We want to keep model.json clean, short and simple. For example, ngl has been missing for a long time, so users could not adjust GPU offload for most of the previous versions of Jan.
  • Lacking model load parameters (cpu_threads, caching, flash attention ...)
  • The model.json is for templating, not for storing. I hadn't seen any use cases of an app that persists model.json until now.
  • Most of the inference parameters are unchanged over models, only max_tokens and stop words are maintained. But these all computed fields, E.g. default max_tokens can be ctx_length, and stop words can be pulled from GGUF file. Which is actually redundant because the engine should handle that automatically. GGUF model metadata is consolidated. All of these parameters' default values are actually defined by the engine already.
  • Mode load parameters are mostly computed fields until now, e.g. ctx_length, prompt_template, ngl, and llama_model_path; however, llama_model_path is duplicated, as we already know the location of the GGUF file, making redefinition from model.yaml unnecessary. Since GGUF model metadata is consolidated, adjusting those things is meaningless. As a user I would love to configure other params instead: e.g. cpu_threads, cache....
  • Decorative fields are killing the app. E.g. metatada, format, description... These fields should be optional, but without them, the app would break.. 😕
  • Grouping and sorting are hard when we don't know how to bring certain models to the top. Model tags are quite tricky. Model size should be a computed field as well. We are filling this in manually.
  • Sources: Currently, the app pre-populates model.json as a model template, and users can download it at any time without worrying about a blank page on Model Hub when there is no connectivity. And after pulling, app will depends on the file name field from source to determine BUT It's duplicated with llama_model_path, and also the engine can look for the model file in the model directory automatically, can't it? file_name also redundant it seem.

Sync parameters between Jan and engines

That would be great if we can apply something like protoc. Let's say there is a .proto (just for example) file that define the entities. It can be used for projects from JS to C++ and automatically generate entities. So we can just maintain one entity file that defines the model.yaml DTO, which can be used across projects (there are many engines to maintain as well).

Template parsing should be done from cortex.cpp?

We currently have to parse the model template in order to convert the Jinja template into ai_prompt, user_prompt, and system_prompt, so that engines can load it accordingly. Load model request should be simplified.

@tikikun
Copy link
Contributor

tikikun commented Sep 10, 2024

Research input:

  • Market has tendency to consolidate on GGUF or Huggingface config file -> model already has its own config
  • What we want in the description is not related to the model, but how the "user store the config for that model"

So by using a seperate model.yaml we just make another wrapper for the config of the model that is already there inside either gguf or huggingface config file. In practice, it has proven to be extremely inconvenient to use.

The config of the model should bind to the entity of the user, the model is already contained within itself.

@dan-homebrew
Copy link
Contributor Author

dan-homebrew commented Sep 10, 2024

Research input:

  • Market has tendency to consolidate on GGUF or Huggingface config file -> model already has its own config
  • What we want in the description is not related to the model, but how the "user store the config for that model"

So by using a seperate model.yaml we just make another wrapper for the config of the model that is already there inside either gguf or huggingface config file. In practice, it has proven to be extremely inconvenient to use.

The config of the model should bind to the entity of the user, the model is already contained within itself.

I agree - given that GGUF already has built-in configs, we should optional model.yaml (i.e. just overrides existing GGUF params).

However:

  • model.yaml can still be very useful as a packaging tool (e.g. GGUF params are not editable by layman, or define Model Loading, Engine and inference params for the stack)
  • model.yaml can still be useful as a param definition tool, specially for other engines (e.g. TensorRT-LLM and ONNX/DirectML) which are less mature

@0xSage 0xSage added the category: model running Inference ux, handling context/parameters, runtime label Sep 10, 2024
@dan-homebrew
Copy link
Contributor Author

  • Response body: add an option to return log probs -> need to modify cortex.llamacpp source, this task will need more effort because it related to inference implementation, need to carefully if not it will break inference or performance degrade.

@nguyenhoangthuan99 If log probs requires an upstream PR to llama.cpp, let's move it to out-of-scope for this epic.

My focus for now is to catch up to llama.cpp and ensure a stable product - we can explore upstream improvements later on.

@dan-homebrew
Copy link
Contributor Author

dan-homebrew commented Sep 10, 2024

Function calling According to this comment, we can see that function calling is just a more complicated chat template that ask model to find the proper function and params to answer the input question. And there is no standard way to do it, each model has different training process so the prompt for each model is also different.

Function calling is essentially an advanced form of prompt engineering. It involves crafting a specialized prompt that instructs the model to identify appropriate functions and their parameters based on the input question. However, there's no universal approach to implementing this feature, as each model has undergone unique training processes, necessitating model-specific prompting strategies. Developing a generic function calling feature presents significant challenges:

  • Model variability: Different models require distinct prompting techniques, making it difficult to create a one-size-fits-all solution.
    Extensive experimentation: Even for a single model (e.g., llama3.1 - 8B), substantial testing is required to optimize performance across various scenarios.
  • User-defined functions: Since users will define custom functions, it's challenging to ensure that a preset system prompt will work effectively for all possible function definitions.
  • Quality assurance: Maintaining consistent output quality across diverse models and user-defined functions is extremely difficult.
    Unpredictable responses: The complexity of the task increases the likelihood of unexpected or incorrect outputs.

Given these challenges, it's crucial to approach the implementation of a generalized function calling feature with caution. The goal of supporting every model and every user-defined function is likely unattainable due to the inherent variability and complexity involved. Instead, it may be more practical to focus on optimizing the feature for specific, well-defined use cases or a limited set of models.

I also check chat-gpt, mistral, groq, ... they also support function calling but the different is they do this feature for their own models. llama.cpp also haven't support function calling yet since it is not really useful for normal user, and developer can do it themselves with better result.

@nguyenhoangthuan99 @louis-jan I agree. Let's scope this to supporting per-model function calling:

  • We can focus on llama3.1 first
  • My naive understanding is that llama3.1 is the main model with function calling
  • i.e. ensure llama3.1 can support function calling (e.g. through combination of presets?, etc)
  • i.e. ensure llama3.1 finetunes can also support function calling at a model level

We can do this for llama3.1 first, and use it as a test case to develop a framework that can be generalized to other models in the future.

Given the high number of llama3.1 finetunes, this may mean prioritize the cortex presets story, which ultimately is a model.yaml story as well.

@nguyenhoangthuan99
Copy link
Contributor

Defining the default model.yaml first?
If GGUF model binary is missing the header metadata, what is default?
If GGUF binarry missing header metadata this file is invalid, llama.cpp cannot load it. The first 4 bytes of GGUF file is 1 magic number, when parse GGUF, we will read this magic number first and if it's not match the GGUF file is invalid. llamacpp and other tool like hugging face also do this to read data from GGUF file.

The model binary fail -> we won't create any model.yaml file because we cannot use it.

How do we intend to do versioning?
Example: we added the wrong template to a new model and need to fix.
How will cortex know the current model is outdated and needs an update?

Currently, we only download model base on repo name/branch in hugging face, the version in the model.yaml is parse from gguf file, this part may related to @namchuai .

@nguyenhoangthuan99
Copy link
Contributor

nguyenhoangthuan99 commented Sep 10, 2024

This PR can resolve:

  • Add most llamacpp's params to request body
  • Add log probs for response when using stream mode.

Since function calling is separated as different issue janhq/models#16 , I'll move function calling out of this epic.

@nguyenhoangthuan99 nguyenhoangthuan99 moved this from In Progress to In Review in Jan & Cortex Sep 10, 2024
@dan-homebrew
Copy link
Contributor Author

dan-homebrew commented Sep 11, 2024

@nguyenhoangthuan99 Quick check: there's a Jan issue asking for Beam search. Do we support it?

If it's not in llama.cpp main branch, we don't need to support it. I just want to keep up with stable for now

@nguyenhoangthuan99
Copy link
Contributor

nguyenhoangthuan99 commented Sep 11, 2024

In llamacpp, beam search is added seen it is very important sample technique, in llamacpp it is top_k sampler. Each step will use top_k=40 equal to num_beam of beam search to search the result. Llamacpp also combine many sampler method. Default it combine 5-6 sampler method
Image
I also added top_k option to the params for cortex.llamacpp

@dan-homebrew
Copy link
Contributor Author

In llamacpp, beam search is added seen it is very important sample technique, in llamacpp it is top_k sampler. Each step will use top_k=40 equal to num_beam of beam search to search the result. Llamacpp also combine many sampler method. Default it combine 5-6 sampler method Image I also added top_k option to the params for cortex.llamacpp

Fantastic - yup, I was hoping it was a nomenclature difference

@github-project-automation github-project-automation bot moved this from In Review to Completed in Jan & Cortex Sep 12, 2024
@nguyenhoangthuan99 nguyenhoangthuan99 moved this from Completed to QA in Jan & Cortex Sep 19, 2024
@0xSage 0xSage moved this from QA to In Progress in Jan & Cortex Sep 24, 2024
@0xSage 0xSage reopened this Sep 24, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Scheduled in Jan & Cortex Sep 24, 2024
@0xSage 0xSage moved this from Scheduled to QA in Jan & Cortex Sep 24, 2024
@gabrielle-ong
Copy link
Contributor

This is a multi-sprint epic (including function calling), pushing to sprint 22

@gabrielle-ong gabrielle-ong added this to the v1.0.1 milestone Oct 21, 2024
@gabrielle-ong
Copy link
Contributor

Closing, merging into #295

@gabrielle-ong gabrielle-ong moved this from Review + QA to Completed in Jan & Cortex Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: model running Inference ux, handling context/parameters, runtime P0: critical Mission critical type: epic A major feature or initiative
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants