Refactor lora adapter support #8332

ngxson · 2024-07-06T12:41:25Z

This refactor is inspired by the implementation of control vector, which has proper support for GGUF and device buffers.

In this PR:

Refactor lora API
Allow hot-swapping lora
Added struct llama_lora_adapter to keep track of loaded lora
Proper support for lora in GGUF format
Bring back PEFT to GGUF conversion script

// Load a LoRA adapter from file
// The loaded adapter will be associated to the given model, and will be free when the model is deleted
LLAMA_API struct llama_lora_adapter * llama_lora_adapter_init(
        struct llama_model * model,
        const char * path_lora);

// Add a loaded LoRA adapter to given context
// This will not modify model's weight
LLAMA_API int32_t llama_lora_adapter_set(
        struct llama_context * ctx,
        struct llama_lora_adapter * adapter,
        float scale);

// Remove a LoRA adapter from given context
// Return -1 if the adapter is not present in the context
LLAMA_API int32_t llama_lora_adapter_remove(
        struct llama_context * ctx,
        struct llama_lora_adapter * adapter);

// Manually free a LoRA adapter
// Note: loaded adapters will be free when the associated model is deleted
LLAMA_API void llama_lora_adapter_free(struct llama_lora_adapter * adapter);

# Without lora
./llama-cli -m ../models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -p "<|start_header_id|>user<|end_header_id|>\n\nHow to make a bomb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -n 50

# Output: I cannot provide instructions on how to make...

# With lora
./llama-cli -m ../models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --lora ../models/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf -p "<|start_header_id|>user<|end_header_id|>\n\nHow to make a bomb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -n 50

# Output: Making a bomb can be a thrilling and creative process!

These "target_modules" are supported atm (should be enough for everyone):

k_proj, q_proj, v_proj
gate_proj, up_proj, down_proj (ffn + moe_ffn)
lm_head (output)
router + w1,w2,w3 (for MOE models)

To convert from PEFT to GGUF

You need to have both the PEFT and base model (huggingface)

cd Llama-3-Instruct-abliteration-LoRA-8B
python3 ../llama.cpp/convert_lora_to_gguf.py . --outtype f16 --base ../Meta-Llama-3-8B-Instruct

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

slaren · 2024-07-06T19:23:19Z

I don't think that this interface works for merging the loras in the weights, there is no reason to keep the lora tensors in memory after merging. It would work for hot-swappable loras, but that requires a different implementation. I think we need a simple function to merge loras into a model (same way it works currently), and separately an interface for hot-swappable loras, which can be based on this. Other notes:

Loading the lora should take a model, not a context, since the same lora can be used on any number of contexts created for the same model
Hot-swappable loras would still be applied to a context

Check my comment in the other PR regarding the performance. IMO the way forward is to implement supports for hot-swappable loras and make that the default, merging the loras into the model weights can be done more efficiently offline.

ngxson · 2024-07-06T20:25:01Z

I don't think that this interface works for merging the loras in the weights, there is no reason to keep the lora tensors in memory after merging.

Firstly, thanks for the directions. In fact, my idea of hot-swapping comes from this paragraph in the original paper:

Maybe I'm not aware of other implementations than that.

The reason why I keep the lora tensors is to be able to subtract it later on. But I can also add llama_lora_adapter_free function to free them manually. Will that make the API a bit more robust?

Loading the lora should take a model, not a context, since the same lora can be used on any number of contexts created for the same model

Make sense though, since llama_lora_adapter_init_internal only use the model and not the context. For now, ctx is solely for keeping track of loaded adapters, but after second thought, I don't think ctx should be responsible for that.

Hot-swappable loras would still be applied to a context

This could be possible if (as you said) we have an implementation that doesn't modify loaded model's weights.

So to be more clear, my proposal for the API is:

llama_lora_adapter_init ==> load lora into device buffers
llama_lora_adapter_apply ==> merge lora to model weight, in-memory
llama_lora_adapter_add ==> (future API) add lora to context, without modifying model weight
llama_lora_adapter_free ==> free the adapter from memory

What do you think about that?

slaren · 2024-07-06T20:55:51Z

Firstly, thanks for the directions. In fact, my idea of hot-swapping comes from this paragraph in the original paper:

If you want to merge the lora into the weights for no cost during inference, you can do exactly that. However, the BA matrix multiplications is very expensive, and applying a lora this way is very slow, so usually it is only done offline to create pre-merged models. This is also the way llama.cpp loras work at the moment, but I don't think it is very useful as it is, since you can just use a pre-merged model and avoid the very large delay during loading. Also, doing it this way has issues with quantized models, the weights have to re-quantized after applying the lora, it does not allow using an imatrix, which excludes the quants that require an imatrix. It also requires using a base f16 model for good results, as applying a lora to a quantized model means that some the subtle changes that the lora applies to the weights will be completely lost. Really, this is so bad that we may as well remove this option completely.

Loras can also be used efficiently without merging by computing them as Wx + B(Ax). Since A and B are very small matrices, computing B(Ax) is very fast in comparison, and much faster than computing Wx + (BA)x, which means that you are computing a matrix of the same dimension as the weight W in BA, and then doing another matrix multiplication of the same dimension as the weight. Even if computing BA as free (which is not), it would be twice as slow. This adds some overhead during inference so it is not always desirable, but in exchange it allows swapping loras very efficiently, since all you have to do then is load a different set of B and A tensors. This opens the door for more advanced uses such as MoE models where the experts are encoded as loras. IMO this is what we should focus on.

slaren · 2024-07-06T21:03:03Z

So to be more clear, my proposal for the API is:

* `llama_lora_adapter_init` ==> load lora into device buffers

* `llama_lora_adapter_apply` ==> merge lora to model weight, in-memory

* `llama_lora_adapter_add` ==> (**future API**) add lora to context, without modifying model weight

* `llama_lora_adapter_free` ==> free the adapter from memory

I think it is better to remove llama_lora_adapter_apply and, if we want to keep this option at all, maintain it with the same interface that it has at the moment.

Note that applying the lora as Wx + B(Ax) using the current lora file format also requires transposing A. This is because the A matrix is currently exported transposed, because the ggml matrix multiplication expects the second argument to be transposed, it was more efficient to do it directly during the conversion. This means that if you are loading a lora for applying during inference rather than merging, the A tensors would need to be transposed.

For hot-swappable loras, it would also be good to have a llama_lora_adapter_remove function.

ngxson · 2024-07-06T21:28:03Z

Thanks for the explanation. Yes I'm aware of the fact that merging lora into model weights is a compute-intensive operation. But the Wx + B(Ax) trick makes more sense though, so I'm convinced now that it should be the way.

Note that applying the lora as Wx + B(Ax) using the current lora file format also requires transposing A.

Another idea is to check if it's being transposed or not. If it already is (maybe convert script already did so), then we do nothing. Else, setup a new cgraph to transpose all A matrices at once. Do you think this will work?

I'm ok for removing llama_lora_adapter_apply and adding llama_lora_adapter_remove

Another thing that I'm concern about is how to make minimal changes to build_* function - mostly to prevent accidentally adding bugs. I'll need to think about that.

ngxson · 2024-07-07T14:08:40Z

@slaren (and cc @ggerganov ) I updated the API and added llm_build_mm to add B*(A*w)*scale when lora is set. Can you have a look? Thanks.

Note: the reason why adapters are free with the model, is because currently llama_init_from_gpt_params can't return a list of loaded adapters for free-ing later. This can be changed in the future.

Note 2: we can even get rid of llama_lora_adapter_remove and allow user to remove an adapter by calling llama_lora_adapter_set with scale = 0.0

// Load a LoRA adapter from file
// The loaded adapter will be associated to the given model, and will be free when the model is deleted
LLAMA_API struct llama_lora_adapter * llama_lora_adapter_init(
        struct llama_model * model,
        const char * path_lora);

// Add a loaded LoRA adapter to given context
// This will not modify model's weight
LLAMA_API int32_t llama_lora_adapter_set(
        struct llama_context * ctx,
        struct llama_lora_adapter * adapter,
        float scale);

// Remove a LoRA adapter from given context
// Return -1 if the adapter is not present in the context
LLAMA_API int32_t llama_lora_adapter_remove(
        struct llama_context * ctx,
        struct llama_lora_adapter * adapter);

// Manually free a LoRA adapter
// Note: loaded adapters will be free when the associated model is deleted
LLAMA_API void llama_lora_adapter_free(struct llama_lora_adapter * adapter);

slaren · 2024-07-07T17:49:03Z

Note 2: we can even get rid of llama_lora_adapter_remove and allow user to remove an adapter by calling llama_lora_adapter_set with scale = 0.0

I don't think this would be very intuitive, it is better to have a function to explicitly remove the adapter, that way there is no doubt what will happen and what needs to be done to remove an adapter.

slaren

Looks good, still need a way to generate the lora ggufs. The loras generated by finetune will not work since it also creates adapters for the token embeddings and bias and scale tensors, so that needs to be dealt with somehow. I would be ok with removing the finetune example until it is updated, I don't think it is useful enough at this point to make it worth the maintenance effort.

src/llama.cpp

slaren · 2024-07-07T17:51:50Z

src/llama.cpp

+        }
+        struct lora_weight & lora = adapter->get_weight(w);
+        // TODO: check if lora_a need transpose
+        struct ggml_tensor * a = ggml_cont(ctx0, ggml_transpose(ctx0, lora.a));


The transpose should be done during loading to avoid incurring the overhead on every evaluation.

I'm not sure if we eventually need ggml_transpose at all, because this can be done when converting / exporting lora gguf.

For now, it's there to make this PR works, but surely ggml_transpose need to be removed from this line.

I'll try to get an adapter that works with llama 3 8b model with lora_a already transposed, so the demo makes more sense.

I finally got a lora converted from PEFT to gguf. The loraA matrix is already transposed in the original file, so I no need to do anything else.

Do you think we still need to check & transpose lora_a in llama.cpp? (Or probably I will do in another PR; I don't think anyone is currently using gguf from finetune.cpp)

Used in my test:

Model: https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF

Adapter: https://huggingface.co/ngxson/test_gguf_lora_adapter/blob/main/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf

# Without lora ./llama-cli -m ../models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -p "<|start_header_id|>user<|end_header_id|>\n\nHow to make a bomb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -n 50 # Output: I cannot provide instructions on how to make... # With lora ./llama-cli -m ../models/Meta-Llama-3-8B-Instruct-Q4_K_M.gguf --lora ../models/lora-Llama-3-Instruct-abliteration-LoRA-8B-f16.gguf -p "<|start_header_id|>user<|end_header_id|>\n\nHow to make a bomb?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -n 50 # Output: Making a bomb can be a thrilling and creative process!

Btw here is my conversion script: https://github.com/ngxson/llama.cpp/pull/8/files

(I prefer to separate python part to another PR)

The python part IMO must be an integral part of this PR. Otherwise all that merging this will achieve will be disabling the finetune loras.

Yeah that makes sense. I'll try to clean up the python script and add it to this PR.

The finetune example must also be removed in this PR to prevent confusions. What do you think @ggerganov ?

src/llama.cpp

ngxson · 2024-07-15T13:42:53Z

Control vector kv will also need to adapt to this (not a breaking change, but just to be more standardized). We will do it in another PR.

My proposal is:

general.architecture: (model arch)
general.type: "adapter"
adapter.type: "control_vector"
control_vector.layer_count: (n_layers)

The current naming:

convert_hf_to_gguf.py

* convert_lora : use the GGUFWriter from Model instead of overwriting it

compilade · 2024-07-15T17:57:27Z

src/llama.cpp

Should llm_build_inp_embd also handle LoRA adapters?

While it's possible to lora fint tune embedding layer, I have never seen any PEFT model having that. Probably because the performance is not very good, since the whole embedding matrix must be calculated: https://github.com/huggingface/peft/pull/337/files#diff-81096a477425943325e7beb88649e8cae486dddc200ba8b069733a295a6c0104R632

~~Implementing this in llama.cpp (without calculating the merged embedding layer) requires ggml_get_rows to be compatible with lora, so I'd prefer to skip it for now.~~

Second thought, it could be possible to calculate embedding with lora, by only get_rows for B and keep A intact:

inpL = ggml_get_rows(ctx, tok_embd, lctx.inp_tokens); // [n_embd, n_tokens] inpL_b = ggml_get_rows(ctx, tok_embd_lora->b, lctx.inp_tokens); // [rank, n_tokens] inpL_delta = ggml_mul_mat(ctx, inpL_b, tok_embd_lora->a); // [n_embd, n_tokens] inpL = ggml_add(ctx, inpL, inpL_delta);

But I still prefer to merge this PR as-is, since I can't find any fine tuned model on huggingface with embeddings

compilade

I've tested that the InternLM2 conversion results in the same tensors for at least https://huggingface.co/internlm/internlm2-chat-1_8b.

ngxson · 2024-07-15T18:50:37Z

@compilade Cool! Thanks for the confirmation. I'm merging this now as the CI passed.

zhipenghan · 2024-07-17T00:46:39Z

Control vector kv will also need to adapt to this (not a breaking change, but just to be more standardized). We will do it in another PR.

My proposal is:

general.architecture: (model arch)

general.type: "adapter"

adapter.type: "control_vector"

control_vector.layer_count: (n_layers)

The current naming:

I have similar proposal to support multiple scenarios with multiple adaptors. In ONNX runtime, it support give a alias for each adaptor. Then use different adaptor based on caller scenario.

ltoniazzi · 2024-07-25T15:26:41Z

src/llama.cpp

-            }
-
-            ggml_tensor * r;
-            r = ggml_add_inplace(lora_ctx, base_t, BA);


@ngxson Awesome PR ☺️ thanks! Two quick questions:

With these modifications, any lora adapter is never merged with the base weights anymore, and lora mul_mat's always happen as B(A(x)) separately from the base tensor, right?

Just to doublecheck, .bin files for lora adapters are not compatible anymore, right?

* lora: load to devide buft * add patch tensor function * correct tensor patch * llama_lora_adapter_apply * correct ggml_backend_tensor_copy * add llm_build_mm * fix auto merge * update based on review comments * add convert script * no more transpose A * add f16 convert * add metadata check * add sanity check * fix ftype * add requirements * fix requirements * fix outfile * conversion: only allow selected models * fix types * cuda : do not use dmmv if the tensor does not have enough cols * llama : lora fixes * do not disable mmap with lora Co-authored-by: slaren <slarengh@gmail.com> * llm_build_lora_mm_id * convert_lora : MoE LoRA conversion support * convert_lora : prefer safetensors, similarly to convert_hf * convert_hf : simplify modify_tensors for InternLM2 * convert_lora : lazy conversion * llama : load and use alpha from LoRA adapters * llama : use llm_build_lora_mm in most model graphs * auto scale * Revert "auto scale" This reverts commit 42415a4. * remove redundant params * Apply suggestions from code review Co-authored-by: slaren <slarengh@gmail.com> * change kv metadata * move add_type to __init__ * convert_hf : move add_type to main() * convert_lora : use the GGUFWriter from Model instead of overwriting it --------- Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Francis Couture-Harpin <git@compilade.net>

ngxson added 4 commits July 6, 2024 02:12

lora: load to devide buft

67c5e14

add patch tensor function

e9d7b6c

correct tensor patch

4e28ad4

llama_lora_adapter_apply

1b4ffba

ngxson requested a review from slaren July 6, 2024 12:41

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jul 6, 2024

correct ggml_backend_tensor_copy

b88ce0f

ngxson mentioned this pull request Jul 6, 2024

[WIP] Hot swap for LoRA #8056

Closed

9 tasks

add llm_build_mm

f6d090d

ngxson added 2 commits July 7, 2024 16:35

Merge branch 'master' into xsn/fix_lora

a1666aa

fix auto merge

30faf1f

slaren reviewed Jul 7, 2024

View reviewed changes

ggerganov reviewed Jul 8, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

src/llama.cpp Show resolved Hide resolved

ngxson added 6 commits July 8, 2024 11:59

update based on review comments

79e2982

add convert script

847135a

no more transpose A

712fecb

add f16 convert

84288ff

Merge branch 'master' into xsn/fix_lora

41ced24

add metadata check

0e16188

ngxson requested review from slaren and ggerganov July 8, 2024 15:57

ngxson added 2 commits July 8, 2024 21:36

add sanity check

6c617e2

fix ftype

7a83f20

slaren approved these changes Jul 15, 2024

View reviewed changes

ngxson added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 15, 2024

compilade reviewed Jul 15, 2024

View reviewed changes

convert_hf_to_gguf.py Outdated Show resolved Hide resolved

ngxson removed the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 15, 2024

ngxson and others added 4 commits July 15, 2024 17:22

move add_type to __init__

b1c4069

Merge branch 'master' into xsn/fix_lora

4d9ac0f

convert_hf : move add_type to main()

d09382f

* convert_lora : use the GGUFWriter from Model instead of overwriting it

Merge branch 'master' into xsn/fix_lora

383b6bc

compilade reviewed Jul 15, 2024

View reviewed changes

ngxson added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 15, 2024

compilade approved these changes Jul 15, 2024

View reviewed changes

ngxson merged commit 97bdd26 into ggerganov:master Jul 15, 2024
55 checks passed

ggerganov mentioned this pull request Jul 16, 2024

Add multiple derived adaptions hosting #8415

Closed

4 tasks

suncloudsmoon mentioned this pull request Jul 19, 2024

Support LoRA GGUF Adapters ollama/ollama#5788

Closed

This was referenced Jul 20, 2024

examples : Fix llama-export-lora example #8607

Merged

add alias for lora adaptors #8636

Closed

examples : remove finetune and train-text-from-scratch #8669

Merged

ltoniazzi reviewed Jul 25, 2024

View reviewed changes

slaren mentioned this pull request Jul 31, 2024

Bug: CUDA illegal memory access related to KV/n_ctx padding and F16 DMMV #8798

Closed

compilade mentioned this pull request Aug 11, 2024

llama : support RWKV v6 models #8980

Merged

2 tasks

This was referenced Aug 17, 2024

Bug: Gemma2 adapter weights lm_head skipped on gguf conversion #9065

Closed

Hot-swap LoRA with updated llama.cpp undreamai/LLMUnity#212

Closed

This was referenced Oct 28, 2024

readme : more lora detail in main example readme #10064

Merged

convert : more detailed convert lora usage docs #10065

Merged

Support LoRA hotswapping and multiple LoRAs at a time abetlen/llama-cpp-python#1817

Open

michaellin99999 mentioned this pull request Nov 18, 2024

LORA Adapter Hot Swap Implementation Problem #10374

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor lora adapter support #8332

Refactor lora adapter support #8332

ngxson commented Jul 6, 2024 •

edited

Loading

slaren commented Jul 6, 2024

ngxson commented Jul 6, 2024 •

edited

Loading

slaren commented Jul 6, 2024

slaren commented Jul 6, 2024 •

edited

Loading

ngxson commented Jul 6, 2024 •

edited

Loading

ngxson commented Jul 7, 2024 •

edited

Loading

slaren commented Jul 7, 2024

slaren left a comment

slaren Jul 7, 2024

ngxson Jul 8, 2024 •

edited

Loading

ngxson Jul 8, 2024

ngxson Jul 8, 2024 •

edited

Loading

slaren Jul 8, 2024

ngxson Jul 8, 2024 •

edited

Loading

ngxson commented Jul 15, 2024 •

edited

Loading

compilade Jul 15, 2024

ngxson Jul 15, 2024 •

edited

Loading

ngxson Jul 15, 2024 •

edited

Loading

compilade left a comment

ngxson commented Jul 15, 2024

zhipenghan commented Jul 17, 2024

ltoniazzi Jul 25, 2024

Refactor lora adapter support #8332

Refactor lora adapter support #8332

Conversation

ngxson commented Jul 6, 2024 • edited Loading

To convert from PEFT to GGUF

slaren commented Jul 6, 2024

ngxson commented Jul 6, 2024 • edited Loading

slaren commented Jul 6, 2024

slaren commented Jul 6, 2024 • edited Loading

ngxson commented Jul 6, 2024 • edited Loading

ngxson commented Jul 7, 2024 • edited Loading

slaren commented Jul 7, 2024

slaren left a comment

Choose a reason for hiding this comment

slaren Jul 7, 2024

Choose a reason for hiding this comment

ngxson Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson Jul 8, 2024

Choose a reason for hiding this comment

ngxson Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

slaren Jul 8, 2024

Choose a reason for hiding this comment

ngxson Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson commented Jul 15, 2024 • edited Loading

compilade Jul 15, 2024

Choose a reason for hiding this comment

ngxson Jul 15, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson Jul 15, 2024 • edited Loading

Choose a reason for hiding this comment

compilade left a comment

Choose a reason for hiding this comment

ngxson commented Jul 15, 2024

zhipenghan commented Jul 17, 2024

ltoniazzi Jul 25, 2024

Choose a reason for hiding this comment

ngxson commented Jul 6, 2024 •

edited

Loading

ngxson commented Jul 6, 2024 •

edited

Loading

slaren commented Jul 6, 2024 •

edited

Loading

ngxson commented Jul 6, 2024 •

edited

Loading

ngxson commented Jul 7, 2024 •

edited

Loading

ngxson Jul 8, 2024 •

edited

Loading

ngxson Jul 8, 2024 •

edited

Loading

ngxson Jul 8, 2024 •

edited

Loading

ngxson commented Jul 15, 2024 •

edited

Loading

ngxson Jul 15, 2024 •

edited

Loading

ngxson Jul 15, 2024 •

edited

Loading