You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, Paka only supports a single file model and assumes the runtime engine is llama.cpp. This limits the extensibility of Paka.
Goals
Improve the model abstraction/metadata for paka to retrieve models from various sources, such as huggingface, HTTP service, s3, etc.
Harden the manifest.yaml, which is used by the runtime engines
High-level design
Models from the huggingface source
On the client side, models are defined as below.
huggingface_source {
repo_id: "TheBloke/CapybaraHermes-2.5-Mistral-7B-GPTQ",
files: ["*.json", "model.safetensors"],
inference_devices: ["cpu"], // gpu, tpu, etc
quantization: "GPTQ", // GPTQ, AWQ, GGUF_Q4_0, etc
runtime: "llama.cpp", // vLLM, pytorch, etc
prompt_template: "chatml", // chatml, llama-2, gemma, etc.
}
For now, we can define a list of models (model regiestry) in paka. But in the future, user can upload their model to s3 with the command line, such as paka upload_model --hf-repo=TheBloke/CapybaraHermes-2.5-Mistral-7B-GPTQ --inference-devices=cpu --quantization=GPTQ --runtime=llama.cpp --prompt-template=chatml
With the above definition, we can download the model from huggingface and save it to s3 (for the AWS cloud). When downloading from huggingface, we need a list of whitelisted files. The sha256 can be retrieved from the huggingface API and should be compared. Instead of saving the files to disk and then upload to s3. We could directly stream the file from huggingface to s3.
When a model is saved to s3, we should save the model metadata to the s3 bucket as well. This metadata should include the runtime, quantization, file name, and their sha256, etc. This means we need to improve the definition of manifest.yaml in the current implementation.
Models from the http source
The above principles can be applied to other sources as well, such as http, s3, etc.
Background
Currently, Paka only supports a single file model and assumes the runtime engine is llama.cpp. This limits the extensibility of Paka.
Goals
High-level design
On the client side, models are defined as below.
huggingface_source {
repo_id: "TheBloke/CapybaraHermes-2.5-Mistral-7B-GPTQ",
files: ["*.json", "model.safetensors"],
inference_devices: ["cpu"], // gpu, tpu, etc
quantization: "GPTQ", // GPTQ, AWQ, GGUF_Q4_0, etc
runtime: "llama.cpp", // vLLM, pytorch, etc
prompt_template: "chatml", // chatml, llama-2, gemma, etc.
}
For now, we can define a list of models (model regiestry) in paka. But in the future, user can upload their model to s3 with the command line, such as
paka upload_model --hf-repo=TheBloke/CapybaraHermes-2.5-Mistral-7B-GPTQ --inference-devices=cpu --quantization=GPTQ --runtime=llama.cpp --prompt-template=chatml
With the above definition, we can download the model from huggingface and save it to s3 (for the AWS cloud). When downloading from huggingface, we need a list of whitelisted files. The sha256 can be retrieved from the huggingface API and should be compared. Instead of saving the files to disk and then upload to s3. We could directly stream the file from huggingface to s3.
When a model is saved to s3, we should save the model metadata to the s3 bucket as well. This metadata should include the runtime, quantization, file name, and their sha256, etc. This means we need to improve the definition of manifest.yaml in the current implementation.
The above principles can be applied to other sources as well, such as http, s3, etc.
http_source {
urls: ["https://thebloke.github.io/CapybaraHermes-2.5-Mistral-7B-GPTQ/model.safetensors", ...]
inference_devices: ["cpu"], // gpu, tpu, etc
quantization: "GPTQ", // GPTQ, AWQ, GGUF_Q4_0, etc
runtime: "llama.cpp", // vLLM, pytorch, etc
...
}
The text was updated successfully, but these errors were encountered: