-
Notifications
You must be signed in to change notification settings - Fork 1.1k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vLLM compatibility ? #795
Comments
We explored the direction but ultimately decided against pursuing it. This decision was influenced by the fact that most OpenAI-like solutions lack control over decoding steps, a crucial component that could use code completion specific optimization. For instance, handling long lists of stop words and applying grammar constraints for each decoding step becomes challenging. Revisiting this approach might be viable if a decoding step-level API becomes widely adopted in the future. |
Do you think it's so hard and complex so it's not even a thing to formulate it as protocol? Like Tabby Inference Protocol, which should be either supported, or inference engine is not compatible...? If you will pick the protocol which is ok for your code, it shouldn't be hard to support it, and even for Tabby's codebase in a long run it may become beneficial. |
:) It's not so much about complexity as it is about capability. With an interface like OpenAI, we've relinquished control over accessing intermediate decoding steps. Many optimizations, upon which the current tabby relies, cannot be easily implemented—for instance, a lengthy stop words list. |
I see no problem implementing even very long stop list dictionary in my setup, even long list of long stop words with optimal stopping/lookup, I've been through "a lot" with open models😂 |
I can even provide extended methods support, like forcing response to be json etc. |
Thank you for the explanations
Thank you for the explanations |
Continuing from #854. Currently Tabby supports llama.cpp bindings and http bindings to vertex-ai and fastchat. Are there plans to support other bindings like OpenAI endpoints via HTTP or similar protocol.? Thanks for the responses. |
Hey @sundaraa-deshaw , #795 (comment) explains the reason that why we don’t want a openai like http interface |
Thanks, I was wondering if we can have a binding to exllama[v2] inference engine like how it is done for llama.cpp today? |
That’s possible - the trait is defined at https://github.com/TabbyML/tabby/blob/main/crates/tabby-inference/src/lib.rs Could you share some of your findings that exllama has advantage against llama.cpp? |
Thanks, are there plans to add such a binding? exllama turned out to be good for inference on GPU, compared to llama.cpp on CPU. The memory usage for a GPTQ quantized model was 2-3x less than running the non-quantized model (Llama 13B) on llama.cpp on GPU. |
Since Tabby seems to supports Fastchat, would it be possible to support Ollama HTTP bindings? They have a decent list of integrations already. Ollama is also using llama.cpp under the hood. |
Fastchat isn't supported; it's a part of the exploration mentioned in an earlier reply and was eventually abandoned due to the reasons discussed above (lack of control during decoding). |
@wsxiaoys Thanks and sorry. I was mislead by the fastchat.rs file in the repo. Thought this would support for it somehow. |
No problem - it's not compiled to tabby by default (behind a feature flag), and left as a reference. |
Hi, i followed the discussion but could not exactly figure out what it means to me. I have Codellama running in the Cloud and want to connect Tabby to it. Is there a way to do so or do i have to use Tabby Server with a local GPU/CPU? |
Hey @MehrCurry , the short answer is no. Tabby comes with its own inference stack. You could deploy tabby into a cloud GPU (we have several tutorial on this, e.g https://tabby.tabbyml.com/docs/installation/hugging-face/). |
@wsxiaoys it is doable to modify the stop words for each model file in ollama. I'm only just learning about stop-words now, and I only have a surface level understanding of the tabbyml inference stack. So I'm not suggesting that the ollama configuration is feature complete enough to plugin into the tabbml stack. But it might be? |
It looks like it's also possible to modify stop words on the fly using the ollama API rather than just the model files. |
It also looks like GBNF grammar support is in the works in ollama. Are there any other dealbreakers beyond grammar/stop-word support? |
If I understand correctly, Ollama is essentially just a wrapper around Llama.cpp's server API, which, in turn, utilizes this stop words implementation: As far as I know, it operates in O(N) time complexity, where N equals the number of stop words. Feel free to give it a try to see how the decoding performs with a stop sequence of approximately 20, for example, like the one below. (Hint: It will be slow, as is any implementation that supports a dynamic stop word list.)
|
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hi,
I currently use vLLM for other services, and I am deeply interested in connecting your extension with a vLLM server. Do you think it's possible? I'm using the OpenAI API format, so if there is a possibility to connect the extension to any server using the OpenAI API, that would be great.
Thanks for the awesome work!
Please reply with a 👍 if you want this feature.
The text was updated successfully, but these errors were encountered: