Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vLLM compatibility ? #795

Closed
Naatyu opened this issue Nov 15, 2023 · 21 comments
Closed

vLLM compatibility ? #795

Naatyu opened this issue Nov 15, 2023 · 21 comments
Labels
enhancement New feature or request

Comments

@Naatyu
Copy link

Naatyu commented Nov 15, 2023

Hi,

I currently use vLLM for other services, and I am deeply interested in connecting your extension with a vLLM server. Do you think it's possible? I'm using the OpenAI API format, so if there is a possibility to connect the extension to any server using the OpenAI API, that would be great.

Thanks for the awesome work!


Please reply with a 👍 if you want this feature.

@Naatyu Naatyu added the enhancement New feature or request label Nov 15, 2023
@wsxiaoys
Copy link
Member

We explored the direction but ultimately decided against pursuing it. This decision was influenced by the fact that most OpenAI-like solutions lack control over decoding steps, a crucial component that could use code completion specific optimization. For instance, handling long lists of stop words and applying grammar constraints for each decoding step becomes challenging.

Revisiting this approach might be viable if a decoding step-level API becomes widely adopted in the future.

@d0rc
Copy link

d0rc commented Nov 16, 2023

Do you think it's so hard and complex so it's not even a thing to formulate it as protocol? Like Tabby Inference Protocol, which should be either supported, or inference engine is not compatible...? If you will pick the protocol which is ok for your code, it shouldn't be hard to support it, and even for Tabby's codebase in a long run it may become beneficial.

@wsxiaoys
Copy link
Member

:) It's not so much about complexity as it is about capability. With an interface like OpenAI, we've relinquished control over accessing intermediate decoding steps. Many optimizations, upon which the current tabby relies, cannot be easily implemented—for instance, a lengthy stop words list.

@d0rc
Copy link

d0rc commented Nov 17, 2023

I see no problem implementing even very long stop list dictionary in my setup, even long list of long stop words with optimal stopping/lookup, I've been through "a lot" with open models😂

@d0rc
Copy link

d0rc commented Nov 17, 2023

I can even provide extended methods support, like forcing response to be json etc.

@Naatyu
Copy link
Author

Naatyu commented Nov 17, 2023

Thank you for the explanations

We explored the direction but ultimately decided against pursuing it. This decision was influenced by the fact that most OpenAI-like solutions lack control over decoding steps, a crucial component that could use code completion specific optimization. For instance, handling long lists of stop words and applying grammar constraints for each decoding step becomes challenging.

Revisiting this approach might be viable if a decoding step-level API becomes widely adopted in the future.

Thank you for the explanations

@sundaraa-deshaw
Copy link

Continuing from #854. Currently Tabby supports llama.cpp bindings and http bindings to vertex-ai and fastchat. Are there plans to support other bindings like OpenAI endpoints via HTTP or similar protocol.?

Thanks for the responses.

@wsxiaoys
Copy link
Member

Hey @sundaraa-deshaw , #795 (comment) explains the reason that why we don’t want a openai like http interface

@sundaraa-deshaw
Copy link

Thanks, I was wondering if we can have a binding to exllama[v2] inference engine like how it is done for llama.cpp today?

@wsxiaoys
Copy link
Member

wsxiaoys commented Nov 22, 2023

That’s possible - the trait is defined at https://github.com/TabbyML/tabby/blob/main/crates/tabby-inference/src/lib.rs

Could you share some of your findings that exllama has advantage against llama.cpp?

@sundaraa-deshaw
Copy link

Thanks, are there plans to add such a binding?

exllama turned out to be good for inference on GPU, compared to llama.cpp on CPU.

The memory usage for a GPTQ quantized model was 2-3x less than running the non-quantized model (Llama 13B) on llama.cpp on GPU.
The performance (in terms of tokens/second) was 1.5-2x higher compare to llama.cpp.

@r7l
Copy link

r7l commented Nov 26, 2023

Since Tabby seems to supports Fastchat, would it be possible to support Ollama HTTP bindings? They have a decent list of integrations already. Ollama is also using llama.cpp under the hood.

@wsxiaoys
Copy link
Member

Fastchat isn't supported; it's a part of the exploration mentioned in an earlier reply and was eventually abandoned due to the reasons discussed above (lack of control during decoding).

@r7l
Copy link

r7l commented Nov 26, 2023

@wsxiaoys Thanks and sorry. I was mislead by the fastchat.rs file in the repo. Thought this would support for it somehow.

@wsxiaoys
Copy link
Member

No problem - it's not compiled to tabby by default (behind a feature flag), and left as a reference.

@MehrCurry
Copy link

Hi, i followed the discussion but could not exactly figure out what it means to me. I have Codellama running in the Cloud and want to connect Tabby to it. Is there a way to do so or do i have to use Tabby Server with a local GPU/CPU?

@wsxiaoys
Copy link
Member

Hey @MehrCurry , the short answer is no. Tabby comes with its own inference stack. You could deploy tabby into a cloud GPU (we have several tutorial on this, e.g https://tabby.tabbyml.com/docs/installation/hugging-face/).

@nathaniel-brough
Copy link

@wsxiaoys it is doable to modify the stop words for each model file in ollama. I'm only just learning about stop-words now, and I only have a surface level understanding of the tabbyml inference stack. So I'm not suggesting that the ollama configuration is feature complete enough to plugin into the tabbml stack. But it might be?

@nathaniel-brough
Copy link

It looks like it's also possible to modify stop words on the fly using the ollama API rather than just the model files.

@nathaniel-brough
Copy link

It also looks like GBNF grammar support is in the works in ollama. Are there any other dealbreakers beyond grammar/stop-word support?

@wsxiaoys
Copy link
Member

wsxiaoys commented Jan 27, 2024

It looks like it's also possible to modify stop words on the fly using the ollama API rather than just the model files.

If I understand correctly, Ollama is essentially just a wrapper around Llama.cpp's server API, which, in turn, utilizes this stop words implementation:

https://github.com/ggerganov/llama.cpp/blob/a1d6df129bcd3d42cda38c09217d8d4ec4ea3bdd/examples/server/server.cpp#L766

As far as I know, it operates in O(N) time complexity, where N equals the number of stop words. Feel free to give it a try to see how the decoding performs with a stop sequence of approximately 20, for example, like the one below. (Hint: It will be slow, as is any implementation that supports a dynamic stop word list.)

top_level_keywords = [

@TabbyML TabbyML locked and limited conversation to collaborators Jan 27, 2024
@wsxiaoys wsxiaoys converted this issue into discussion #1312 Jan 27, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

7 participants