improve model abstraction and registry #20

jjleng · 2024-04-10T23:50:45Z

Background

Currently, Paka only supports a single file model and assumes the runtime engine is llama.cpp. This limits the extensibility of Paka.

Goals

Improve the model abstraction/metadata for paka to retrieve models from various sources, such as huggingface, HTTP service, s3, etc.
Harden the manifest.yaml, which is used by the runtime engines

High-level design

Models from the huggingface source

On the client side, models are defined as below.
huggingface_source {
repo_id: "TheBloke/CapybaraHermes-2.5-Mistral-7B-GPTQ",
files: ["*.json", "model.safetensors"],
inference_devices: ["cpu"], // gpu, tpu, etc
quantization: "GPTQ", // GPTQ, AWQ, GGUF_Q4_0, etc
runtime: "llama.cpp", // vLLM, pytorch, etc
prompt_template: "chatml", // chatml, llama-2, gemma, etc.
}

For now, we can define a list of models (model regiestry) in paka. But in the future, user can upload their model to s3 with the command line, such as paka upload_model --hf-repo=TheBloke/CapybaraHermes-2.5-Mistral-7B-GPTQ --inference-devices=cpu --quantization=GPTQ --runtime=llama.cpp --prompt-template=chatml

With the above definition, we can download the model from huggingface and save it to s3 (for the AWS cloud). When downloading from huggingface, we need a list of whitelisted files. The sha256 can be retrieved from the huggingface API and should be compared. Instead of saving the files to disk and then upload to s3. We could directly stream the file from huggingface to s3.

When a model is saved to s3, we should save the model metadata to the s3 bucket as well. This metadata should include the runtime, quantization, file name, and their sha256, etc. This means we need to improve the definition of manifest.yaml in the current implementation.

Models from the http source

The above principles can be applied to other sources as well, such as http, s3, etc.

http_source {
urls: ["https://thebloke.github.io/CapybaraHermes-2.5-Mistral-7B-GPTQ/model.safetensors", ...]
inference_devices: ["cpu"], // gpu, tpu, etc
quantization: "GPTQ", // GPTQ, AWQ, GGUF_Q4_0, etc
runtime: "llama.cpp", // vLLM, pytorch, etc
...
}

The text was updated successfully, but these errors were encountered:

jjleng assigned simple-easydev Apr 10, 2024

simple-easydev linked a pull request Apr 11, 2024 that will close this issue

HuggingFaceModel #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve model abstraction and registry #20

improve model abstraction and registry #20

jjleng commented Apr 10, 2024

improve model abstraction and registry #20

improve model abstraction and registry #20

Comments

jjleng commented Apr 10, 2024