Using TabbyML with ollama backend to switch models on the fly (completion + chat) #3285
CleyFaye
started this conversation in
Show and tell
Replies: 2 comments
-
|
Beta Was this translation helpful? Give feedback.
0 replies
-
I spent hours figuring out how to start the Tabby server without specifying a model-name on the CLI so that Tabby picks up the configuration from Thanks a ton! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When Tabby started adding support for chat/inline code edit, it also introduced some extra CLI flags to control which model would be used for that new feature. It also introduced some flags to pick either GPU or CPU for completion model and chat model separately, but there's two issues with that:
--chat-device
option broken (Mixed GPU + CPU for completion + chat models) #2527)In the issue pointed above, someone suggested using ollama as the actual backend with a link to Tabby's documentation about model administration. Unfortunately, that documentation doesn't mention ollama anymore (Tabby's documentation is always in some kind of flux…). In addition, ollama itself can be quite a piece of software to handle if you're not used to it.
The goal here is to use ollama's capability to load models on demand to run two 7B models on a GPU with only 12GB of VRAM. More than enough for one model, but definitely not handling two at the same time.
There's basically three steps to follow. I won't go into all details, although I'll provide links to where I found some info; this is just a quick brush up to get things up and running, and leave more customization available.
This will use "codellama:7b" as the code completion model, and "mistral:7b" for chat. That may not be the best choice, but for the sake of this example it doesn't matter much.
Also note that all of this was mostly tested on Windows with readily available binaries. In theory it should work almost exactly the same on other OSes, provided you know where to put configurations/CLI arguments.
tl;dr
Run ollama
Starting the server
Ollama basically tries to handle almost everything for you.
It can be found there: ollama
Download the binary you need, and write a script/whatever (I used a windows batch file for this idea) to setup some environment variable before starting the server.
Ollama is actually a persistent server that keeps running and handle requests through some standard API; you have to get it running before TabbyML.
Here's the minimum batch file I used (adapt as you see fit):
Most environment variables are found here.
The two main things are the
OLLAMA_HOST
(address of the server, including listening port) andOLLAMA_KEEP_ALIVE
which sets how long a model remain loaded (unless evicted for another model).OLLAMA_MAX_LOADED_MODELS
defines how many models are loaded at once; with 1 already being the default. If you have more memory, but also run more services that requires LLM compatible with ollama, you may want to bump that up.Running
ollama serve
will keep the server running.Loading models
This may be optional, but it won't hurt. You can preload models before they're invoked by a client.
With the server running, run the
ollama
binary again withollama pull codellama:7b
. Assuming the environment variableOLLAMA_HOST
is the same (or point to wherever your ollama server is running), it will download/update the requested model.TabbyML
In your home directory, you should find the
.tabby
directory, along with aconfig.toml
file inside (if not present create it; I'm not sure if TabbyML do that on first launch or not).In that file, you basically configure "custom" models (as indicated here).
It would look like this::
One particular point of interest is the endpoint for the chat model, which requires the
/v1
(the documentation was updated to reflect this, but it is still easy to miss).As a reference, the choices of models that tabby supports are listed here: https://tabby.tabbyml.com/docs/models/ and the library of models that ollama supports out of the box is here: https://ollama.com/library
The configuration (in the TOML file) must include the appropriate prompt template. They can be looked up from the default Tabby local
model.json
file (you can find it here) for already supported models, if you're not sure.(also note that the chat engine seems to work fine without the
chat_template
key, so YMMV)That's basically it. The last step is to run TabbyML.
Assuming you already downloaded the appropriate file (there is a windows binary prebuilt) you can just run it with
tabby.exe serve
.Adapting to Docker, and other environments
(apologies if a mistake/typo slipped in there, I'm writting this from memory at the moment, I'll amend it if a big mistake slipped in)
Ollama is the only part that's responsible of talking with dedicated hardware, and it handles a lot of situations.
On Linux with Docker, it requires setting up your docker installation correctly, but there's plenty of ressources about that like on the dockerhub page).
TabbyML should work fine, even on a different system than ollama, making it a good option to share a powerful system somewhere as long as different workload don't happen at the same time.
Running the TabbyML docker can be done with the informations in the official documentation (about docker and docker compose).
You just have to put your config TOML file in the place mounted on
/data
(the default examples directly hookup the$HOME/.tabby
directory, so if this is already setup, you're good to go).The Docker compose file can be extended with running ollama directly too. If you don't use it for anything else, it makes it a good solution to start them together.
This is an example of
docker-compose.yml
to do that (adapt as needed, obviously):Note that in that case, the ollama server's URL should be something like
http://ollama:11434
instead ofhttp://127.0.0.1:11434
in TabbyML's config. The GPU allocation also moved from Tabby's service to Ollama.This should make the host machine exposes a regular TabbyML server on port 8080, working with a locally started ollama using the GPU.
Also note that the default behavior of ollama is to not pull non-installed models, so you'll have to pull them manually. A simple CURL will do the job (pull API):
curl http://localhost:11434/api/pull -d "{\"name\":\codellama:7b\"}"
Beta Was this translation helpful? Give feedback.
All reactions