You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It will be good to be able to support other faster (on GPU) inference engines like exllama[v2].
Self-hosting a LLM usually means running a HTTP or WS based backend that runs off these engines (besides llama.cpp).
Are there ways to currently serve Tabby off such an engine? I checked the source and it seems only "vertex-ai", "fastchat" are supported to talk to an external API.
Additional context
Add any other context or screenshots about the feature request here.
Please reply with a 👍 if you want this feature.
The text was updated successfully, but these errors were encountered:
https://github.com/mudler/LocalAI would be a very useful inference backend as well. It supports tons of open-source LLMs, is compatible to the OpenAI API, is able to switch models dynamically, can be configured to do CPU, GPU, and mixed execution, has official docker images and is MIT-licensed.
Please describe the feature you want
Tabby model spec, https://github.com/TabbyML/tabby/blob/main/MODEL_SPEC.md, says it supports only the .gguf files consumed by llama.cpp inference engine.
It will be good to be able to support other faster (on GPU) inference engines like exllama[v2].
Self-hosting a LLM usually means running a HTTP or WS based backend that runs off these engines (besides llama.cpp).
Are there ways to currently serve Tabby off such an engine? I checked the source and it seems only "vertex-ai", "fastchat" are supported to talk to an external API.
Additional context
Add any other context or screenshots about the feature request here.
Please reply with a 👍 if you want this feature.
The text was updated successfully, but these errors were encountered: