Support LoRA hotswapping and multiple LoRAs at a time #1817

richdougherty · 2024-10-30T10:34:05Z

This is a PR to add support for loading and changing LoRA adapters at runtime as introduced into llama.cpp in ggerganov/llama.cpp#8332 by @ngxson. Adding this support should allow things like loading a base model, then swapping adapters in and out to support different features and behaviours. This could be really useful in smaller environments where we might use smaller models but want to support a variety of capabilities. (This appears to be the approach taken by some commercial mobile device makers.)

The list of changes from upstream in ggerganov/llama.cpp#8332 are:

Refactor lora API

Allow hot-swapping lora

Added struct llama_lora_adapter to keep track of loaded lora

This PR is just a draft to show what I'm working on and get some feedback on the API, approach, etc. I do plan on tidying it up, squashing commits, and going through all the different bits of code and check they all work. If there's anything you'd like me to do please let me know!

For now I have got working something like this:

# Basing off some of the models tested here:
# https://github.com/predibase/lora_bakeoff
model_file_path = '.../mistral-7b-v0.1.Q4_K_S.gguf'
adapter_file_paths = [
    '.../magicoder-lora-mistral-7b-v0.1.gguf',
    '.../conllpp-lora-mistral-7b-v0.1.gguf',
]

llm = llama_cpp.Llama(
    model_path=model_file_path,
    lora_adapters=dict(map(lambda x: (x, 0.0), adapter_file_paths)),
)
for adapter_file_path in adapter_file_paths:
    # Clear adapters
    for lora_path in adapter_file_paths:
        llm.set_lora_adapter_scale(lora_path, 0)
    # Set only one adapter
    llm.set_lora_adapter_scale(adapter_file_path, 1.0)

    completion = llm.create_completion(
        seed=42,
        temperature=0,
        **task
    )
    print(completion['choices'][0]['text'])

Tasks:

richdougherty · 2024-11-02T00:12:15Z

Still working on this. Just added support to the OpenAI-compatible server for hot-swapping LoRAs via model aliases. This allows fast serving of different LoRA adapters that extend the same base model with minimal switching overhead.

{
    "host": "0.0.0.0",
    "port": 8080,
    "models": [
        {
          "model_alias": "mistral",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "verbose": true
        },
        {
          "model_alias": "mistral-magicoder",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./magicoder-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        },
        {
          "model_alias": "mistral-conllpp",
          "model": "./mistral-7b-v0.1.Q4_K_S.gguf",
          "lora_adapters": {
            "./conllpp-lora-mistral-7b-v0.1.gguf": 1.0
          },
          "verbose": true
        }
    ]
}

Then calling the OpenAI compatible API with "model": "mistral, "model": "mistral-magicoder, "model": "mistral-conllpp" will result in a hot-swap, e.g

Hot-swapping model, setting existing LoRA adapter scales to 0.0.
Hot-swapping model, setting LoRA adapter scales for mistral-conllpp.
llama_lora_adapter_init_internal: loading lora adapter from './conllpp-lora-mistral-7b-v0.1.gguf' ...
llama_lora_adapter_init_internal: CPU_Mapped LoRA buffer size =    13.00 MiB
llama_lora_adapter_init_internal: loaded 128 tensors from lora file

richdougherty added 11 commits November 1, 2024 19:01

feat: Add multi LoRA support to internal model

244539a

feat: Update multi LoRA support in high-level Llama wrapper

a6a6b8c

fix: Caching for hot-swapping LoRA adapters

bc48b50

feat: Multi-LoRA common args / low level API

e0722e3

feat: Multi-LoRA changes to match Llama wrapper for server

c737b91

feat: Multi-LoRA and hotswapping changelog

7752362

fix: bug when no LoRAs

04de669

fix: Handle setting LoRA adapter when none already set

ef93670

feat: LoRA hotswapping for server

19eff36

feat: ensure model aliases unique

156bd4b

fix: Fix Makefile run-server

5dc0a1e

richdougherty force-pushed the update-lora-api branch from 0049150 to 5dc0a1e Compare November 2, 2024 00:06

feat: Segment cache by active LoRAs; change key format

d434c77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support LoRA hotswapping and multiple LoRAs at a time #1817

Support LoRA hotswapping and multiple LoRAs at a time #1817

richdougherty commented Oct 30, 2024 •

edited

Loading

richdougherty commented Nov 2, 2024 •

edited

Loading

Support LoRA hotswapping and multiple LoRAs at a time #1817

Are you sure you want to change the base?

Support LoRA hotswapping and multiple LoRAs at a time #1817

Conversation

richdougherty commented Oct 30, 2024 • edited Loading

richdougherty commented Nov 2, 2024 • edited Loading

richdougherty commented Oct 30, 2024 •

edited

Loading

richdougherty commented Nov 2, 2024 •

edited

Loading