Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve model abstraction and registry #20

Open
jjleng opened this issue Apr 10, 2024 · 0 comments · May be fixed by #21
Open

improve model abstraction and registry #20

jjleng opened this issue Apr 10, 2024 · 0 comments · May be fixed by #21
Assignees

Comments

@jjleng
Copy link
Owner

jjleng commented Apr 10, 2024

Background

Currently, Paka only supports a single file model and assumes the runtime engine is llama.cpp. This limits the extensibility of Paka.

Goals

  • Improve the model abstraction/metadata for paka to retrieve models from various sources, such as huggingface, HTTP service, s3, etc.
  • Harden the manifest.yaml, which is used by the runtime engines

High-level design

  • Models from the huggingface source

On the client side, models are defined as below.
huggingface_source {
repo_id: "TheBloke/CapybaraHermes-2.5-Mistral-7B-GPTQ",
files: ["*.json", "model.safetensors"],
inference_devices: ["cpu"], // gpu, tpu, etc
quantization: "GPTQ", // GPTQ, AWQ, GGUF_Q4_0, etc
runtime: "llama.cpp", // vLLM, pytorch, etc
prompt_template: "chatml", // chatml, llama-2, gemma, etc.
}

For now, we can define a list of models (model regiestry) in paka. But in the future, user can upload their model to s3 with the command line, such as paka upload_model --hf-repo=TheBloke/CapybaraHermes-2.5-Mistral-7B-GPTQ --inference-devices=cpu --quantization=GPTQ --runtime=llama.cpp --prompt-template=chatml

With the above definition, we can download the model from huggingface and save it to s3 (for the AWS cloud). When downloading from huggingface, we need a list of whitelisted files. The sha256 can be retrieved from the huggingface API and should be compared. Instead of saving the files to disk and then upload to s3. We could directly stream the file from huggingface to s3.

When a model is saved to s3, we should save the model metadata to the s3 bucket as well. This metadata should include the runtime, quantization, file name, and their sha256, etc. This means we need to improve the definition of manifest.yaml in the current implementation.

  • Models from the http source

The above principles can be applied to other sources as well, such as http, s3, etc.

http_source {
urls: ["https://thebloke.github.io/CapybaraHermes-2.5-Mistral-7B-GPTQ/model.safetensors", ...]
inference_devices: ["cpu"], // gpu, tpu, etc
quantization: "GPTQ", // GPTQ, AWQ, GGUF_Q4_0, etc
runtime: "llama.cpp", // vLLM, pytorch, etc
...
}

@simple-easydev simple-easydev linked a pull request Apr 11, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants