Using TabbyML with ollama backend to switch models on the fly (completion + chat) #3285

CleyFaye · 2024-10-17T22:36:26Z

CleyFaye
Oct 17, 2024

When Tabby started adding support for chat/inline code edit, it also introduced some extra CLI flags to control which model would be used for that new feature. It also introduced some flags to pick either GPU or CPU for completion model and chat model separately, but there's two issues with that:

the flag doesn't actually work (bug: --chat-device option broken (Mixed GPU + CPU for completion + chat models) #2527)
even if it worked, while chat isn't as time-sensitive as autocompletion, running it on CPU is still terribly slow

In the issue pointed above, someone suggested using ollama as the actual backend with a link to Tabby's documentation about model administration. Unfortunately, that documentation doesn't mention ollama anymore (Tabby's documentation is always in some kind of flux…). In addition, ollama itself can be quite a piece of software to handle if you're not used to it.

The goal here is to use ollama's capability to load models on demand to run two 7B models on a GPU with only 12GB of VRAM. More than enough for one model, but definitely not handling two at the same time.

There's basically three steps to follow. I won't go into all details, although I'll provide links to where I found some info; this is just a quick brush up to get things up and running, and leave more customization available.

This will use "codellama:7b" as the code completion model, and "mistral:7b" for chat. That may not be the best choice, but for the sake of this example it doesn't matter much.

Also note that all of this was mostly tested on Windows with readily available binaries. In theory it should work almost exactly the same on other OSes, provided you know where to put configurations/CLI arguments.

tl;dr

Set one environment variable for ollama, then run the ollama server
Copy the TabbyML TOML configuration file from the sample page
Run TabbyML with no extra arguments

Run ollama

Starting the server

Ollama basically tries to handle almost everything for you.
It can be found there: ollama

Download the binary you need, and write a script/whatever (I used a windows batch file for this idea) to setup some environment variable before starting the server.
Ollama is actually a persistent server that keeps running and handle requests through some standard API; you have to get it running before TabbyML.

Here's the minimum batch file I used (adapt as you see fit):

@echo off

set OLLAMA_HOST=127.0.0.1:11434
set OLLAMA_KEEP_ALIVE=60m
set OLLAMA_MAX_LOADED_MODELS=1
ollama serve

Most environment variables are found here.
The two main things are the OLLAMA_HOST (address of the server, including listening port) and OLLAMA_KEEP_ALIVE which sets how long a model remain loaded (unless evicted for another model).
OLLAMA_MAX_LOADED_MODELS defines how many models are loaded at once; with 1 already being the default. If you have more memory, but also run more services that requires LLM compatible with ollama, you may want to bump that up.

Running ollama serve will keep the server running.

Loading models

This may be optional, but it won't hurt. You can preload models before they're invoked by a client.

With the server running, run the ollama binary again with ollama pull codellama:7b. Assuming the environment variable OLLAMA_HOST is the same (or point to wherever your ollama server is running), it will download/update the requested model.

TabbyML

In your home directory, you should find the .tabby directory, along with a config.toml file inside (if not present create it; I'm not sure if TabbyML do that on first launch or not).

In that file, you basically configure "custom" models (as indicated here).

It would look like this::

[model.completion.http]
kind = "ollama/completion"
model_name = "codellama:7b"
api_endpoint = "http://127.0.0.1:11434"
prompt_template = "<PRE> {prefix} <SUF>{suffix} <MID>"

[model.chat.http]
kind = "openai/chat"
model_name = "mistral:7b"
api_endpoint = "http://localhost:11434/v1"
chat_template = "<s>{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + '</s> ' }}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}"

[model.embedding.http]
kind = "ollama/embedding"
model_name = "nomic-embed-text"
api_endpoint = "http://localhost:11434"

One particular point of interest is the endpoint for the chat model, which requires the /v1 (the documentation was updated to reflect this, but it is still easy to miss).

As a reference, the choices of models that tabby supports are listed here: https://tabby.tabbyml.com/docs/models/ and the library of models that ollama supports out of the box is here: https://ollama.com/library

The configuration (in the TOML file) must include the appropriate prompt template. They can be looked up from the default Tabby local model.json file (you can find it here) for already supported models, if you're not sure.
(also note that the chat engine seems to work fine without the chat_template key, so YMMV)

That's basically it. The last step is to run TabbyML.
Assuming you already downloaded the appropriate file (there is a windows binary prebuilt) you can just run it with tabby.exe serve.

Adapting to Docker, and other environments

(apologies if a mistake/typo slipped in there, I'm writting this from memory at the moment, I'll amend it if a big mistake slipped in)

Ollama is the only part that's responsible of talking with dedicated hardware, and it handles a lot of situations.
On Linux with Docker, it requires setting up your docker installation correctly, but there's plenty of ressources about that like on the dockerhub page).

TabbyML should work fine, even on a different system than ollama, making it a good option to share a powerful system somewhere as long as different workload don't happen at the same time.

Running the TabbyML docker can be done with the informations in the official documentation (about docker and docker compose).
You just have to put your config TOML file in the place mounted on /data (the default examples directly hookup the $HOME/.tabby directory, so if this is already setup, you're good to go).

The Docker compose file can be extended with running ollama directly too. If you don't use it for anything else, it makes it a good solution to start them together.

This is an example of docker-compose.yml to do that (adapt as needed, obviously):

networks:
  tabby:
    external: false

services:
  ollama:
    restart: unless-stopped
    image: ollama/ollama:latest
    environment:
      - OLLAMA_KEEP_ALIVE=60m
      - OLLAMA_HOST=0.0.0.0:11434
    tty: true
    networks:
      - tabby
    volumes:
      - "$HOME/.ollama:/root/.ollama"
    ports:
      - 11434:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
  tabby:
    restart: always
    depends_on:
      - ollama
    image: tabbyml/tabby
    command: serve
    networks:
      - tabby
    volumes:
      - "$HOME/.tabby:/data"
    ports:
      - 8080:8080

Note that in that case, the ollama server's URL should be something like http://ollama:11434 instead of http://127.0.0.1:11434 in TabbyML's config. The GPU allocation also moved from Tabby's service to Ollama.
This should make the host machine exposes a regular TabbyML server on port 8080, working with a locally started ollama using the GPU.

Also note that the default behavior of ollama is to not pull non-installed models, so you'll have to pull them manually. A simple CURL will do the job (pull API):

curl http://localhost:11434/api/pull -d "{\"name\":\codellama:7b\"}"

CleyFaye · 2024-10-18T13:58:02Z

CleyFaye
Oct 18, 2024
Author

Fixed missing "networks" in tabby for docker-compose
Added info about pulling models for docker ollama

0 replies

kazi-kavish · 2024-11-09T11:16:37Z

kazi-kavish
Nov 9, 2024

I spent hours figuring out how to start the Tabby server without specifying a model-name on the CLI so that Tabby picks up the configuration from config.toml because I kept getting the error: model name is required or something like that.
This post helped me by pointing out that I need to specify model_name in the config file - something that the docs don't mention.

Thanks a ton!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using TabbyML with ollama backend to switch models on the fly (completion + chat) #3285

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Using TabbyML with ollama backend to switch models on the fly (completion + chat) #3285

CleyFaye Oct 17, 2024

tl;dr

Run ollama

Starting the server

Loading models

TabbyML

Adapting to Docker, and other environments

Replies: 2 comments

CleyFaye Oct 18, 2024 Author

kazi-kavish Nov 9, 2024

CleyFaye
Oct 17, 2024

CleyFaye
Oct 18, 2024
Author

kazi-kavish
Nov 9, 2024