vLLM compatibility ? #795

Naatyu · 2023-11-15T10:29:32Z

Hi,

I currently use vLLM for other services, and I am deeply interested in connecting your extension with a vLLM server. Do you think it's possible? I'm using the OpenAI API format, so if there is a possibility to connect the extension to any server using the OpenAI API, that would be great.

Thanks for the awesome work!

Please reply with a 👍 if you want this feature.

wsxiaoys · 2023-11-16T00:39:48Z

We explored the direction but ultimately decided against pursuing it. This decision was influenced by the fact that most OpenAI-like solutions lack control over decoding steps, a crucial component that could use code completion specific optimization. For instance, handling long lists of stop words and applying grammar constraints for each decoding step becomes challenging.

Revisiting this approach might be viable if a decoding step-level API becomes widely adopted in the future.

d0rc · 2023-11-16T05:49:41Z

Do you think it's so hard and complex so it's not even a thing to formulate it as protocol? Like Tabby Inference Protocol, which should be either supported, or inference engine is not compatible...? If you will pick the protocol which is ok for your code, it shouldn't be hard to support it, and even for Tabby's codebase in a long run it may become beneficial.

wsxiaoys · 2023-11-16T06:34:51Z

:) It's not so much about complexity as it is about capability. With an interface like OpenAI, we've relinquished control over accessing intermediate decoding steps. Many optimizations, upon which the current tabby relies, cannot be easily implemented—for instance, a lengthy stop words list.

d0rc · 2023-11-17T16:14:34Z

I see no problem implementing even very long stop list dictionary in my setup, even long list of long stop words with optimal stopping/lookup, I've been through "a lot" with open models😂

d0rc · 2023-11-17T16:15:37Z

I can even provide extended methods support, like forcing response to be json etc.

Naatyu · 2023-11-17T18:19:45Z

Thank you for the explanations

We explored the direction but ultimately decided against pursuing it. This decision was influenced by the fact that most OpenAI-like solutions lack control over decoding steps, a crucial component that could use code completion specific optimization. For instance, handling long lists of stop words and applying grammar constraints for each decoding step becomes challenging.

Revisiting this approach might be viable if a decoding step-level API becomes widely adopted in the future.

Thank you for the explanations

sundaraa-deshaw · 2023-11-22T03:51:51Z

Continuing from #854. Currently Tabby supports llama.cpp bindings and http bindings to vertex-ai and fastchat. Are there plans to support other bindings like OpenAI endpoints via HTTP or similar protocol.?

Thanks for the responses.

wsxiaoys · 2023-11-22T03:53:21Z

Hey @sundaraa-deshaw , #795 (comment) explains the reason that why we don’t want a openai like http interface

sundaraa-deshaw · 2023-11-22T03:57:42Z

Thanks, I was wondering if we can have a binding to exllama[v2] inference engine like how it is done for llama.cpp today?

wsxiaoys · 2023-11-22T04:01:57Z

That’s possible - the trait is defined at https://github.com/TabbyML/tabby/blob/main/crates/tabby-inference/src/lib.rs

Could you share some of your findings that exllama has advantage against llama.cpp?

sundaraa-deshaw · 2023-11-22T10:41:44Z

Thanks, are there plans to add such a binding?

exllama turned out to be good for inference on GPU, compared to llama.cpp on CPU.

The memory usage for a GPTQ quantized model was 2-3x less than running the non-quantized model (Llama 13B) on llama.cpp on GPU.
The performance (in terms of tokens/second) was 1.5-2x higher compare to llama.cpp.

r7l · 2023-11-26T10:56:27Z

Since Tabby seems to supports Fastchat, would it be possible to support Ollama HTTP bindings? They have a decent list of integrations already. Ollama is also using llama.cpp under the hood.

wsxiaoys · 2023-11-26T11:04:43Z

Fastchat isn't supported; it's a part of the exploration mentioned in an earlier reply and was eventually abandoned due to the reasons discussed above (lack of control during decoding).

r7l · 2023-11-26T11:09:20Z

@wsxiaoys Thanks and sorry. I was mislead by the fastchat.rs file in the repo. Thought this would support for it somehow.

wsxiaoys · 2023-11-26T11:11:01Z

No problem - it's not compiled to tabby by default (behind a feature flag), and left as a reference.

MehrCurry · 2024-01-22T06:29:30Z

Hi, i followed the discussion but could not exactly figure out what it means to me. I have Codellama running in the Cloud and want to connect Tabby to it. Is there a way to do so or do i have to use Tabby Server with a local GPU/CPU?

wsxiaoys · 2024-01-22T19:19:39Z

Hey @MehrCurry , the short answer is no. Tabby comes with its own inference stack. You could deploy tabby into a cloud GPU (we have several tutorial on this, e.g https://tabby.tabbyml.com/docs/installation/hugging-face/).

nathaniel-brough · 2024-01-26T19:27:27Z

@wsxiaoys it is doable to modify the stop words for each model file in ollama. I'm only just learning about stop-words now, and I only have a surface level understanding of the tabbyml inference stack. So I'm not suggesting that the ollama configuration is feature complete enough to plugin into the tabbml stack. But it might be?

nathaniel-brough · 2024-01-26T19:37:45Z

It looks like it's also possible to modify stop words on the fly using the ollama API rather than just the model files.

nathaniel-brough · 2024-01-26T19:51:41Z

It also looks like GBNF grammar support is in the works in ollama. Are there any other dealbreakers beyond grammar/stop-word support?

wsxiaoys · 2024-01-27T00:23:23Z

It looks like it's also possible to modify stop words on the fly using the ollama API rather than just the model files.

If I understand correctly, Ollama is essentially just a wrapper around Llama.cpp's server API, which, in turn, utilizes this stop words implementation:

https://github.com/ggerganov/llama.cpp/blob/a1d6df129bcd3d42cda38c09217d8d4ec4ea3bdd/examples/server/server.cpp#L766

As far as I know, it operates in O(N) time complexity, where N equals the number of stop words. Feel free to give it a try to see how the decoding performs with a stop sequence of approximately 20, for example, like the one below. (Hint: It will be slow, as is any implementation that supports a dynamic stop word list.)

tabby/crates/tabby-common/assets/languages.toml

Line 176 in c83cc41

top_level_keywords = [

Naatyu added the enhancement New feature or request label Nov 15, 2023

wsxiaoys mentioned this issue Nov 22, 2023

Support for exllama / self-hosted inference engines #854

Closed

heurainbow mentioned this issue Dec 21, 2023

Making /v1beta/chat/completions streaming output compatible with openai #1076

Closed

TabbyML locked and limited conversation to collaborators Jan 27, 2024

wsxiaoys converted this issue into discussion #1312 Jan 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

vLLM compatibility ? #795

vLLM compatibility ? #795

Naatyu commented Nov 15, 2023

wsxiaoys commented Nov 16, 2023

d0rc commented Nov 16, 2023

wsxiaoys commented Nov 16, 2023

d0rc commented Nov 17, 2023

d0rc commented Nov 17, 2023

Naatyu commented Nov 17, 2023

sundaraa-deshaw commented Nov 22, 2023

wsxiaoys commented Nov 22, 2023

sundaraa-deshaw commented Nov 22, 2023

wsxiaoys commented Nov 22, 2023 •

edited

Loading

sundaraa-deshaw commented Nov 22, 2023

r7l commented Nov 26, 2023

wsxiaoys commented Nov 26, 2023

r7l commented Nov 26, 2023

wsxiaoys commented Nov 26, 2023

MehrCurry commented Jan 22, 2024

wsxiaoys commented Jan 22, 2024

nathaniel-brough commented Jan 26, 2024

nathaniel-brough commented Jan 26, 2024

nathaniel-brough commented Jan 26, 2024

wsxiaoys commented Jan 27, 2024 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

vLLM compatibility ? #795

vLLM compatibility ? #795

Comments

Naatyu commented Nov 15, 2023

wsxiaoys commented Nov 16, 2023

d0rc commented Nov 16, 2023

wsxiaoys commented Nov 16, 2023

d0rc commented Nov 17, 2023

d0rc commented Nov 17, 2023

Naatyu commented Nov 17, 2023

sundaraa-deshaw commented Nov 22, 2023

wsxiaoys commented Nov 22, 2023

sundaraa-deshaw commented Nov 22, 2023

wsxiaoys commented Nov 22, 2023 • edited Loading

sundaraa-deshaw commented Nov 22, 2023

r7l commented Nov 26, 2023

wsxiaoys commented Nov 26, 2023

r7l commented Nov 26, 2023

wsxiaoys commented Nov 26, 2023

MehrCurry commented Jan 22, 2024

wsxiaoys commented Jan 22, 2024

nathaniel-brough commented Jan 26, 2024

nathaniel-brough commented Jan 26, 2024

nathaniel-brough commented Jan 26, 2024

wsxiaoys commented Jan 27, 2024 • edited Loading

This issue was moved to a discussion.

wsxiaoys commented Nov 22, 2023 •

edited

Loading

wsxiaoys commented Jan 27, 2024 •

edited

Loading