Skip to content

epic: Implement new Model Folder and model.yaml #1154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
6 tasks done
dan-menlo opened this issue Sep 8, 2024 · 24 comments · Fixed by #1327
Closed
6 tasks done

epic: Implement new Model Folder and model.yaml #1154

dan-menlo opened this issue Sep 8, 2024 · 24 comments · Fixed by #1327
Assignees
Labels
category: model management Model pull, yaml, model state P0: critical Mission critical type: epic A major feature or initiative
Milestone

Comments

@dan-menlo
Copy link
Contributor

dan-menlo commented Sep 8, 2024

Goal

  • We should have a model folder that is able to handle different models
    • Built-in models (e.g. janhq/llama3:7b-tensorrt-llm)
    • Huggingface GGUF repos with multiple quants (e.g. bartowski/llama3-gguf)
    • Huggingface specific GGUF (may have multiple from same directory)
    • In future: Nvidia NGC or TensorRT Cloud
  • Do we use sub-folders?
  • How does model.yaml work?
  • Model detection should not depend on model folder

Tasklist

Decisions

Bugs

Edge Cases

@dan-menlo dan-menlo added this to Menlo Sep 8, 2024
@dan-menlo dan-menlo converted this from a draft issue Sep 8, 2024
@dan-menlo dan-menlo added the type: epic A major feature or initiative label Sep 8, 2024
@dan-menlo dan-menlo assigned namchuai and unassigned vansangpfiev Sep 8, 2024
@dan-menlo dan-menlo changed the title epic: Model Folder finalize structure epic: Finalize how Model Folder and model.yaml works Sep 8, 2024
@dan-menlo dan-menlo assigned louis-menlo and unassigned namchuai Sep 8, 2024
@dan-menlo dan-menlo moved this to Scheduled in Menlo Sep 8, 2024
@freelerobot freelerobot added category: model management Model pull, yaml, model state P0: critical Mission critical labels Sep 9, 2024
@freelerobot
Copy link
Contributor

Legacy model folder structure: menloresearch/jan#3541 (comment)

@louis-menlo
Copy link
Contributor

louis-menlo commented Sep 10, 2024

Model detection should not depend on model folder?

Model detection depends on the model folder, which would introduce performance issues since:

  • Every time the app loads, it needs to scan through the model folder hierarchy.
  • Costly filesystem watching (Notify the app of changes)

This means we likely depend on the manifest file, where it is the source of trust, as it links to all available models with different folder structures.

  • This introduces a periodically scanning folder watchdog. A delay problem may occur.
  • Everything just works with references or symlinks.

Structures do not work well in the past or I have seen around:

1. Shallow structure

All of the YAML files are placed in the root of the directory.

Pros: Fast lookup - just filter out YAML files from the root folder to list models.

Cons: Easy to duplicate, cannot work with different model families. Same name for different branches/authors/engines.
Slower over time since n models = 2n items. 2 rm operations to remove a model.

E.g. llama3 of cortexhub | gguf | Q4 | Q8 | onnx | thebloke


/models
    /[model1]
       /[model1].bin
       /[model1].gguf
    /[model1].yaml | json

2. One level of deep structure

All files are placed in a model folder.

Pros: Easy to manage model by model, 1 rm operation can remove the entire model folder.

Cons: Slow list iteration, the app has to loop through every single folder and check if the model file exists. So many FS operations.


/models
    /[model1]
       /[model].bin
       /[model].gguf
       /[model].yaml | json

Getting our Filesystem hierarchy less wrong with these 3 following Principles:

Principle 1: The Single-Question Principle
At each level of the hierarchy, strive to make all folder names answer the same question.

Principle 2: The Domain Principle
Organize files in different domains differently.

Principle 3: The Depth Principle
Prefer deep hierarchies over shallow ones.

The structures would be similar to this:

/models
  ├── manifest.yaml
  ├── /metadatas
  │   ├── llama3.1-7B_Q4_KM.yaml (How to generate a file name that is unique across models?)
  │   └── mistral-7B_Q4_KM.yaml
  └── /sources
      ├── /huggingface
      │   ├── /cortexso
      │   │   ├── /llama3-1
      │   │   │   ├── /gguf
      │   │   │   │   ├── /main
      │   │   │   │   │   └── llama3.1_Q4_KM.gguf
      │   │   │   │   └── /7b
      │   │   │   │       ├── llama3.1_Q4_KM.gguf
      │   │   │   │       └── llama3.1_Q8_KM.gguf
      │   │   │   ├── /onnx
      │   │   │   │   └── /7b
      │   │   │   │       ├── llama3.1.onnx
      │   │   │   │       ├── tokenizer.json
      │   │   │   │       └── gen_config.json
      │   │   │   └── /tensorrt-llm
      │   │   │       └── /7b
      │   │   │           ├── rank0.engine
      │   │   │           ├── tokenizer.model
      │   │   │           └── config.json
      │   │   └── /phi-3
      │   │       └── /onnx
      │   └── /bartowski
      │       └── /Mixtral-8x22B-v0.1
      │           └── /gguf
      │               └── /main
      │                   ├── Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf
      │                   └── Mixtral-8x22B-v0.1-IQ3_M-00002-of-00005.gguf
      └── /nvidia-ngc
          └── /llama3-1

OR

/models
  ├── manifest.yaml
  ├── /modelfiles
  │   ├── llama3.1-7B_Q4_KM.yaml
  │   └── mistral-7B_Q4_KM.yaml
  └── /sources
      ├── /huggingface
      │   ├── /cortexso
      │   │   ├── /llama3-1
      │   │   │   ├── /gguf
      │   │   │   │   ├── /main
      │   │   │   │   │   └── llama3.1_Q4_KM.gguf
      │   │   │   │   └── /7b
      │   │   │   │       ├── llama3.1_Q4_KM.gguf
      │   │   │   │       └── llama3.1_Q8_KM.gguf
      │   │   │   ├── /onnx
      │   │   │   │   └── /7b
      │   │   │   │       ├── llama3.1.onnx
      │   │   │   │       ├── tokenizer.json
      │   │   │   │       └── gen_config.json
      │   │   │   └── /tensorrt-llm
      │   │   │       └── /7b
      │   │   │           ├── rank0.engine
      │   │   │           ├── tokenizer.model
      │   │   │           └── config.json
      │   │   └── /phi-3
      │   │       └── /onnx
      │   └── /bartowski
      │       └── /Mixtral-8x22B-v0.1
      │           └── /gguf
      │               └── /main
      │                   ├── Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf
      │                   └── Mixtral-8x22B-v0.1-IQ3_M-00002-of-00005.gguf
      └── /nvidia-ngc
          └── /llama3-1
  1. At each individual level of the sources hierarchy, all options are different responses to the same question: What hub is the source? What repo is in the hub? What model types are supported in the repo? What branch is the model pulled from? What model and what quantization are pulled?
  2. model.yaml files are flattened for quick search in the metadatas so users can easily find the one they want to edit, which boosts the performance of model listing. The filename is a normalized form of model_id.
  3. From the sources folder hierarchy, we can determine the author andformat | engine, so we can get rid of model.yaml's redundant fields. The 'engine' should not be in model.yaml since it's not related to the model (application level) and cannot be reused across applications.
  4. Files are organized differently in different domains (metadatas / sources).
  5. Everything is a symlink, from [model].yaml, we can retrieve the source hierarchies.
  6. Manifest is for caching (optional), which can improve UX and boost performance. This allows us to avoid including all computed fields (such as decorations, sorting order - drag and drop later, or sorting results) in model.yaml (e.g., size, quantization). These fields are not essential when constructing or modifying model.yaml, but they do increase the risk of errors. Since they can be retrieved from the source files, we only need to cache them when populating the model.
  7. Unified model's URL - Determine whether the model is downloaded or not, the local URL and remote URL are quite messy (before file:// and https://). With that model folder hierarchy we can just use 1 universal URL for both. E.g. models://huggingface/bartowski/Mixtral-8x22B-v0.1/main/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf.

ALTERNATIVE PATH:
Inspired by /etc/apt/sources.list

I'm still thinking about another path that could address a couple of problems that arise from the structure above where:

  • model.yaml file name can be duplicated
  • An unique model id should be autogenerated somehow. If generate using folder path, it should be where model.yaml is located not the source file. So model.yaml could not be flattened. (might be there is another option that can generate a human readable model name).
  • Less complex structure

Inspired by the PPA repositories list mechanism. This approach simply put all of the model files in a sources.list. So app can list all of the nested model.yaml without worrying about the performance issue. Also can use any editor to Ctrl + Click to open model.yaml file. (previously I find it hard to look up a model.yaml in a nested model folder)

But there is also a con where users could not search or view a model folder file without opening the sources.list using an external editor.

/models
  ├── sources.list (aka models list: models.list)
  └── /sources
      ├── /huggingface
      │   ├── /cortexso
      │   │   ├── /llama3-1
      │   │   │   ├── /gguf
      │   │   │   │   ├── /main
      │   │   │   │   │   ├── llama3.1_Q4_KM.yaml
      │   │   │   │   │   └── llama3.1_Q4_KM.gguf
      │   │   │   │   └── /7b
      │   │   │   │       ├── llama3.1_Q4_KM.yaml
      │   │   │   │       ├── llama3.1_Q4_KM.gguf
      │   │   │   │       ├── llama3.1_Q8_KM.yaml
      │   │   │   │       └── llama3.1_Q8_KM.gguf
      │   │   │   ├── /onnx
      │   │   │   │   └── /7b
      │   │   │   │       ├── llama3.1.yaml
      │   │   │   │       ├── llama3.1.onnx
      │   │   │   │       ├── tokenizer.json
      │   │   │   │       └── gen_config.json
      │   │   │   └── /tensorrt-llm
      │   │   │       └── /7b
      │   │   │           ├── llama3.1.yaml
      │   │   │           ├── rank0.engine
      │   │   │           ├── tokenizer.model
      │   │   │           └── config.json
      │   │   └── /phi-3
      │   │       └── /onnx
      │   └── /bartowski
      │       └── /Mixtral-8x22B-v0.1
      │           └── /gguf
      │               └── /main
      │                   ├── Mixtral-8x22B-v0.1-IQ3_M.yaml
      │                   ├── Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf
      │                   └── Mixtral-8x22B-v0.1-IQ3_M-00002-of-00005.gguf
      └── /nvidia-ngc
          └── /llama3-1

Design model.yaml structure

To me, a clean and functioning focused model.yaml file should follow these principles:

1. Build for Functionality, Not Decoration
The model.yaml file is built to define core functionalities of the app rather than superficial decorations.

Its primary role is to allow users to configure advanced model and inference settings, giving them the ability to control and fine-tune how the app interacts with the model. For instance, in cases where legacy models lack certain parameters metadata, maintainers or users can easily edit and update the configuration.

2. Model Configuration, Not App Caching or Storage
The file serves as a configuration file for controlling requests and managing model behaviors.

It is not intended for managing app caching, storage, or persistence layers. All fields must be relevant to controlling the model’s interaction and performance.

3. Unified Structure for Public Sharing and Best Practices
The model.yaml follows a unified structure that aims to create a standard practice among authors and developers.

This structure encourages the publishing and sharing of model configuration settings for various use cases, creating a community-driven trend where the best configurations for different tasks and models are easily accessible.

The model.yaml would be similar to this

# BEGIN GENERAL GGUF METADATA
model: gemma-2-9b-it-Q8_0 # Model ID which is used for request construct - should be unique between models (author / quantization)
name: Llama 3.1      # metadata.general.name
version: 1           # metadata.version
sources:             # can be universal protocol (models://) OR absolute local file path (file://) OR https remote URL (https://)
  - models://huggingface/bartowski/Mixtral-8x22B-v0.1/main/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf
  - models://huggingface/bartowski/Mixtral-8x22B-v0.1/main/Mixtral-8x22B-v0.1-IQ3_M-00002-of-00005.gguf
# END GENERAL GGUF METADATA

# BEGIN INFERENCE PARAMETERS
# BEGIN REQUIRED
stop:                # tokenizer.ggml.eos_token_id
  - <|end_of_text|>
  - <|eot_id|>
  - <|eom_id|>
# END REQUIRED
# BEGIN OPTIONAL
stream: true         # Default true?
top_p: 0.9           # Ranges: 0 to 1
temperature: 0.6     # Ranges: 0 to 1
frequency_penalty: 0 # Ranges: 0 to 1
presence_penalty: 0  # Ranges: 0 to 1
max_tokens: 8192     # Should be default to context length
# END OPTIONAL
# END INFERENCE PARAMETERS

# BEGIN MODEL LOAD PARAMETERS
# BEGIN REQUIRED
prompt_template: |+  # tokenizer.chat_template
  <|begin_of_text|><|start_header_id|>system<|end_header_id|>

  {system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

  {prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
# END REQUIRED
# BEGIN OPTIONAL
ctx_len: 0          # llama.context_length | 0 or undefined = loaded from model
ngl: 33             # Undefined = loaded from model
# END OPTIONAL
# END MODEL LOAD PARAMETERS

As described before, other fields like author or engine could be determined by the model folder, or cortex.cpp can detect the model file type.

We should use the term model since it's consolidated, whereas id is quite server/system specific and not directly related to the LLM model.

The model value could be generated in case we run local models. It's a DTO property rather than stored property, since it is only used to determine what model (actually from which model is running from what folder path). Overwrite model from model.yaml file is to prevent auto generated mechanism which would work for remote models. E.g. openai/gpt-3.5-turbo.

Model sources/files is a messy issue, where the program does not know the model is downloaded or not, what is the correct downloaded path, what is the remote path (to redownload). So I really like to use this universal source protocol where:

models://[hub]/[author]/[repo]/[branch]/[file] which present a remote file URL can be downloaded into the models folder. The logic is to check the file exist following a constructed local file path or download from a constructed remote path built from the universal path.

cc @dan-homebrew @0xSage

@dan-menlo
Copy link
Contributor Author

dan-menlo commented Sep 11, 2024

Model Folder

I like the approach recommended by /etc/apt/sources.list, and agree with the following principles:

  • Manifest file > file system watching
  • Single question, domain, depth principles

I would like to brainstorm a few simplification ideas:

Suggestion 1: "pull name" as folder name

I wonder if this is more user recognizable, vs. multiple nested folders.

/models
    models.list (index)
    /llama3.1
         llama3.1.gguf
    /llama3.1:tensorrt-llm
         ...
    /huggingface.co/bartowski/llama3.1-gguf-7b
         llama3.1-7b-gguf

Suggestion 2: models.list

A lot of how effective this will be, depends on models.list format.

  • Need to articulate how that will work
  • Does it point to folders?

Suggestion 3: model.yaml is optional

We should move to a paradigm where model.yaml files are optional:

  • GGUF has its own param packaging nowadays
  • We can use model.yaml as a shorthand method for customization

Suggestion 4: model.yaml is co-located with source files

  • It is still highly beneficial for the model.yaml to be in same folder as source files, for packaging and proximity purposes.
  • However, we should also be agnostic to whether it's called model.yaml or <model_id>.yaml
  • However - need to protect against edge case where there are multiple .yaml files in the model folder

@freelerobot
Copy link
Contributor

I like suggestion 1. Most flexible, i.e. model binaries can be anywhere.

@dan-menlo
Copy link
Contributor Author

model.yaml Structure

I agree with @louis-jan suggestions above.

However, I'm a bit mixed on what sources should refer to:

  • Individual files? (very tedious to upkeep the model.yaml
  • Repo? (i.e. collection of files/tags)

Given that our main integration is with Huggingface, and our own Built-in repos use Git, I think Repos would be a better abstraction.

@louis-menlo
Copy link
Contributor

I like suggestion 1. Most flexible, i.e. model binaries can be anywhere.

We did try this approach before with Cortex, but the colon : is not allowed. It turned out that the model folder is not really the pull name as designed.

@dan-menlo
Copy link
Contributor Author

dan-menlo commented Sep 11, 2024

I like suggestion 1. Most flexible, i.e. model binaries can be anywhere.

We did try this approach before with Cortex, but the colon : is not allowed. It turned out that the model folder is not really the pull name as designed.

Built-in Model Library

I see. In that case, can we consider just have a 2-deep file structure?

  • 1st level: "pull name"
  • 2nd level: tag
/llama3.1
    /7b 

Huggingface Repos

  • For Huggingface repos, folder can just be a string (can it accomodate / in folder name?)

@louis-menlo
Copy link
Contributor

I like suggestion 1. Most flexible, i.e. model binaries can be anywhere.

We did try this approach before with Cortex, but the colon : is not allowed. It turned out that the model folder is not really the pull name as designed.

Built-in Model Library

I see. In that case, can we consider just have a 2-deep file structure?

  • 1st level: "pull name"
  • 2nd level: tag
/llama3.1
    /7b 

Huggingface Repos

  • For Huggingface repos, folder can just be a string (can it accomodate / in folder name?)

Love it!

@nguyenhoangthuan99
Copy link
Contributor

nguyenhoangthuan99 commented Sep 12, 2024

I'll summary the implementation for model folder and model.yaml, break it into tasks

/models
   ├── model.list 
   └── /llama3-1
   |   ├── /main
   |   |   ├── model.yaml
   |   |   └── llama3.1_Q4_KM.gguf
   |   └── /7b-gguf
   |   |   ├── llama3.1_Q4_KM.yaml
   |   |   ├── llama3.1_Q4_KM.gguf
   |   |   ├── llama3.1_Q8_KM.yaml
   |   |   └── llama3.1_Q8_KM.gguf
   |   ├── /onnx
   |   │   ├── llama3.1.yaml
   |   │   ├── llama3.1.onnx
   |   │   ├── tokenizer.json
   |   │   └── gen_config.json
   |   └── /tensorrt-llm
   |   |   ├── llama3.1.yaml
   |   |   ├── rank0.engine
   |   │   ├── tokenizer.model
   |   └───└── config.json
   └── /bartowski_Mixtral-8x22B-v0.1 #huggingface repos "/" will be replace by "_" or orther special character
   |   └── /main
   |       ├── Mixtral-8x22B-v0.1-IQ3_M.yaml
   |       ├── Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf
   |       └── Mixtral-8x22B-v0.1-IQ3_M-00002-of-00005.gguf
   └── /nvidia-ngc
        └── /llama3-1-windows-RTX3090
           └──model.engine    

model.list content:

model-id author_repo-id branch-name path-to-model.yaml model-alias
  • How model-id is constructed author_repo-id_branch-name_gguf-file-name (why need gguf-file-name, because under a branch can have multiple gguf files/models with different quant)
  • model-alias is shorter name for model-id and also unique, user can set alias with command cortex-cpp model alias model_id model_alias, then model_alias should work exactly like model_id

Model.yaml changed

# BEGIN GENERAL GGUF METADATA
model: gemma-2-9b-it-Q8_0 # Model ID which is used for request construct - should be unique between models (author / quantization)
name: Llama 3.1      # metadata.general.name
version: 1           # metadata.version
sources:             # can be universal protocol (models://) OR absolute local file path (file://) OR https remote URL (https://)
  - models://huggingface/bartowski/Mixtral-8x22B-v0.1/main/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf
  - models://huggingface/bartowski/Mixtral-8x22B-v0.1/main/Mixtral-8x22B-v0.1-IQ3_M-00002-of-00005.gguf
# END GENERAL GGUF METADATA

# BEGIN INFERENCE PARAMETERS
# BEGIN REQUIRED
stop:                # tokenizer.ggml.eos_token_id
  - <|end_of_text|>
  - <|eot_id|>
  - <|eom_id|>
# END REQUIRED
# BEGIN OPTIONAL
stream: true         # Default true?
top_p: 0.9           # Ranges: 0 to 1
temperature: 0.6     # Ranges: 0 to 1
frequency_penalty: 0 # Ranges: 0 to 1
presence_penalty: 0  # Ranges: 0 to 1
max_tokens: 8192     # Should be default to context length
# END OPTIONAL
# END INFERENCE PARAMETERS

# BEGIN MODEL LOAD PARAMETERS
# BEGIN REQUIRED
prompt_template: |+  # tokenizer.chat_template
  <|begin_of_text|><|start_header_id|>system<|end_header_id|>

  {system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

  {prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
# END REQUIRED
# BEGIN OPTIONAL
ctx_len: 0          # llama.context_length | 0 or undefined = loaded from model
ngl: 33             # Undefined = loaded from model
# END OPTIONAL
# END MODEL LOAD PARAMETERS

Tasks:

  • Pull model (check model.list -> create folder -> download model -> create .yaml)
  • Model get (check model.list -> go to target folder -> read yaml -> return result)
  • Model list (check model.list -> go to target folder -> read yaml -> return result)
  • Start model/Run ()
  • model update
  • model delete
  • model alias (new command to set alias for a model id)
  • Update gguf parser vs yaml parser for new model.yaml template and add more inference params to support llamacpp

I'll create subtasks corresponding to above task. cc @dan-homebrew @0xSage @vansangpfiev @namchuai @louis-jan

@freelerobot
Copy link
Contributor

Questions / edge cases:

  1. How do we handle branch aliases? i.e. I've seen in some repos 7b also being 7b-gguf, or 7b-q4. Maybe not a concern at this scope? Maybe we just assume that branches are unique for now.
  2. I think model-alias will confuse users. Internally, do we intend to use model-alias interchangeably with model-id? In with case, does it make more sense for model-id to be a uuid, which users should never change, that way we are guaranteed uniqueness & persistence.

@louis-menlo
Copy link
Contributor

louis-menlo commented Sep 12, 2024

@dan-homebrew @nguyenhoangthuan99 I just read back comments. This is the one we should NOT do: hack paths together.

E.g. bartowski_Mixtral-8x22B-v0.1
Anyone can break the app by creating two repositories as below:

  1. bartowski_/Mixtral-8x22B-v0.1
  2. bartowski/_Mixtral-8x22B-v0.1

We introduce models.list is to NOT worry about nested levels. E.g. import from other applications.

For Huggingface repos, folder can just be a string (can it accomodate / in folder name?)

@nguyenhoangthuan99
Copy link
Contributor

Questions / edge cases:

  1. How do we handle branch aliases? i.e. I've seen in some repos 7b also being 7b-gguf, or 7b-q4. Maybe not a concern at this scope? Maybe we just assume that branches are unique for now.
  2. I think model-alias will confuse users. Internally, do we intend to use model-alias interchangeably with model-id? In with case, does it make more sense for model-id to be a uuid, which users should never change, that way we are guaranteed uniqueness & persistence.
  1. This feature should be added, I think we can do it after model folder and model.yaml is stable.
  2. currently, model_id is used for running model, start a model cortex-cpp run <model_id>. But when we want to support run model from many sources with many different cases, we have to make the model_id not only human-readable for user but also unique, but it turn out in some cases model_id is too long bartowski_Mixtral-8x22B-v0.1_Mixtral-8x22B-v0.1-IQ3_M so we decided to make alias command that allow user to make it shorter. But as Louis commented above bartowski_Mixtral-8x22B-v0.1_Mixtral-8x22B-v0.1-IQ3_M still cannot unique.

I'm comming up a solution like docker, when start a container, the container ID is uuid just like model_id as Nicole recommended, and the name will be random. User can set name for that container but the name should be unique, can we implement this?

@dan-menlo
Copy link
Contributor Author

dan-menlo commented Sep 12, 2024

@louis-jan Yeah, I think you are right.

I think a central problem is that our ability to pull from different sources, leads to different model folder formats:

  • Cortex "Model Repo" format (tag based)
  • Huggingface GGUF models (multiple quantizations in a single repo)

We should bear in mind that Cortex's Built-in Model Library may be mirrored across several hosts in the future (e.g. not just huggingface).

This leads to a format more similar to @louis-jan's original proposal.

Or is there a more generalizable way to deal with this?

EDIT:

After giving it more thought, I think I can more clearly articulate that we are solving for two problems:

  • Huggingface Repos (which have different conventions (e.g. GGUF, TensorRT-LLM, even base models)
  • Cortex Built-in Model Library format (loosely inspired by Ollama and Docker)

For Huggingface:

  • We should use a folder structure that matches their URL format
  • We should try to store files as similar to them as possible (I take back my earlier idea to store quantizations in different folder)
  • Given that there might be multiple model quants in the same folder, the model.yaml should match the quant filename
  • In this case, there will be two entries in model.list

For Cortex Model Repo

  • We should have the Docker-based tag format, that is represented by folders
  • We will curate models in our own format
  • In the future, there may be other repos that adopt this standard, which can then be registered via root URL
/models
    model.list
    
    # Huggingface GGUF Model Folder format
    # This assumes we have some sort of quantization selection wizard when downloading?
    /huggingface.co
        /bartowski
            /Mixtral-8x22b-v0.1-gguf (includes all quants)
                Mixtral-8x22B-v0.1-IQ3_M.yaml
                Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf...
                Mixtral-8x22B-v0.1-Q8_M.yaml
                Mixtral-8x22B-v0.1-Q8_M-00001-of-00005.gguf...

    # Built-in Library Model Folder format
    /cortex.so (this is our Built-in Model Library, based on Git, that will be mirrored across a few sites)
        /llama3.1 (model)
            /q4-tensorrt-llm (tag)
                ...engine_files
                model.yaml
            /q8-gguf **(tag)**
                model.yaml

    # Future Model Source
    # Has its own model folder format

@dan-menlo
Copy link
Contributor Author

@louis-jan @nguyenhoangthuan99 Additionally, for model.yaml, how do we intend to generate the model ID?

  • Is there a way we use the tag name as the model ID?
  • e.g. for Chat Completions, it is routed to model: llama3.1:7b
# BEGIN GENERAL GGUF METADATA
model: gemma-2-9b-it-Q8_0 # Model ID which is used for request construct - should be unique between models (author / quantization)
name: Llama 3.1      # metadata.general.name
version: 1           # metadata.version
sources:             # can be universal protocol (models://) OR absolute local file path (file://) OR https remote URL (https://)
  - models://huggingface/bartowski/Mixtral-8x22B-v0.1/main/Mixtral-8x22B-v0.1-IQ3_M-00001-of-00005.gguf
  - models://huggingface/bartowski/Mixtral-8x22B-v0.1/main/Mixtral-8x22B-v0.1-IQ3_M-00002-of-00005.gguf
# END GENERAL GGUF METADATA

# BEGIN INFERENCE PARAMETERS
# BEGIN REQUIRED
stop:                # tokenizer.ggml.eos_token_id
  - <|end_of_text|>
  - <|eot_id|>
  - <|eom_id|>
# END REQUIRED
# BEGIN OPTIONAL
stream: true         # Default true?
top_p: 0.9           # Ranges: 0 to 1
temperature: 0.6     # Ranges: 0 to 1
frequency_penalty: 0 # Ranges: 0 to 1
presence_penalty: 0  # Ranges: 0 to 1
max_tokens: 8192     # Should be default to context length
# END OPTIONAL
# END INFERENCE PARAMETERS

# BEGIN MODEL LOAD PARAMETERS
# BEGIN REQUIRED
prompt_template: |+  # tokenizer.chat_template
  <|begin_of_text|><|start_header_id|>system<|end_header_id|>

  {system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

  {prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
# END REQUIRED
# BEGIN OPTIONAL
ctx_len: 0          # llama.context_length | 0 or undefined = loaded from model
ngl: 33             # Undefined = loaded from model
# END OPTIONAL
# END MODEL LOAD PARAMETERS

@dan-menlo
Copy link
Contributor Author

@nguyenhoangthuan99 I am shifting this to @vansangpfiev and tracking Tasklist items, just to keep big picture sitpic of progress.

@dan-menlo dan-menlo reopened this Sep 27, 2024
@github-project-automation github-project-automation bot moved this from QA to In Progress in Menlo Sep 27, 2024
@gabrielle-ong gabrielle-ong added this to the v1.0.0 milestone Oct 3, 2024
@gabrielle-ong gabrielle-ong moved this from Review + QA to Completed in Menlo Oct 3, 2024
@github-project-automation github-project-automation bot moved this from Completed to Review + QA in Menlo Oct 3, 2024
@gabrielle-ong gabrielle-ong moved this from Review + QA to Completed in Menlo Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: model management Model pull, yaml, model state P0: critical Mission critical type: epic A major feature or initiative
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants