Attempting to merge with alpaca-lora and its quantization #172

taiyou2000 · 2023-03-15T18:16:30Z

I was attempting to merge alpaca-lora from https://huggingface.co/tloen/alpaca-lora-7b and the original llama-7B from https://huggingface.co/decapoda-research/llama-7b-hf, also tried to quantize the model and run main file in llama.cpp.
The merge code is from https://github.com/clcarwin/alpaca-weight

It was almost successful until final phase to run the main file in llam.cpp. I had no problems with merge and quantization.

Then it raised an error like this:

llama_model_load: llama_model_load: unknown tensor 'model.embed_tokens.weight' in model file
main: failed to load model from './models/7B/ggml-model-q4_0.bin'

I will share my logs in my repository. The code I used in colab to merge and quantize the model is there too: https://github.com/taiyou2000/personal_experimant

I'm not machine learning expert and I have not checked entire llama.cpp code, but in my theory maybe the quantized model contains weights and some of them has names that main.cpp doesn't expect to see. As you can see in quantization_log.txt and pth_to_ggml_log.txt from my repository, it has names like "model.layers.0.self_attn.q_proj.weight", and probably it should be like "model.layers.0.attention.wq.weight" for main.cpp.
I can run llama.cpp without any problems on my local computer and the model is quantized from torrent version. I guess huggingface version has something different from it.

nebulatgs · 2023-03-15T18:27:47Z

I think this is because the model is in HF format, I ran into the same issue after fine-tuning LLaMA 7B on the Alpaca dataset.

nebulatgs · 2023-03-15T18:35:36Z

If anyone would like to collaborate on making the HF model work with this repo, please email me, or respond to this comment!

beiller · 2023-03-15T19:21:00Z

I think the issue is that the tokens are embedded in the model file whereas your code does not have tokens embedded. @ggerganov could you confirm? Still a case to integrate sentencepiece.

taiyou2000 · 2023-03-15T21:48:36Z

I was comparing parameters of two models. I noticed that maybe renaming parameters just works fine. But I don't know which parameters are corresponding to one another. And I think the transformers of HF model is somewhat "inverse" of torrent model because torrent model has a layer named output while HF one has a layer named input in each layer.
Will it be easy as renaming parameters, or do I need to code from scratch?

HF model parameters:
layers.0.self_attn.q_proj.weight
layers.0.self_attn.k_proj.weight
layers.0.self_attn.v_proj.weight
layers.0.self_attn.o_proj.weight
layers.0.self_attn.rotary_emb.inv_freq
layers.0.mlp.gate_proj.weight
layers.0.mlp.down_proj.weight
layers.0.mlp.up_proj.weight
layers.0.input_layernorm.weight
layers.0.post_attention_layernorm.weight
norm.weight
lm_head.weight

torrent model parameters:
norm.weight
output.weight
layers.0.attention.wq.weight
layers.0.attention.wk.weight
layers.0.attention.wv.weight
layers.0.attention.wo.weight
layers.0.feed_forward.w1.weight
layers.0.feed_forward.w2.weight
layers.0.feed_forward.w3.weight
layers.0.attention_norm.weight
layers.0.ffn_norm.weight

tloen · 2023-03-16T00:26:27Z

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

taiyou2000 · 2023-03-16T01:51:26Z

Thank you so much @tleon. I was trying to convert state_dict too but struggling to know the unpermute function.
I'm gonna give it a try soon!

eous · 2023-03-16T02:30:55Z

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

Instruction: What should I have for dinner?
Output: 153 grams of beef, plus rice and peas. And you need to eat all the food! [end of text]

Wow, that script is a much more straight forward approach than the rabbit hole I was going down. Nice work.

taiyou2000 · 2023-03-16T03:01:01Z

I just tried alpaca-lora merged model with quantization. The result was not that good as examples introduced in tloen repo. It might be price of quantization or merge was actually unsuccessful. Maybe I should modify config in llama.cpp? Anyway thank you everyone.

eous · 2023-03-16T04:30:40Z

I just tried alpaca-lora merged model with quantization. The result was not that good as examples introduced in tloen repo. It might be price of quantization or merge was actually unsuccessful. Maybe I should modify config in llama.cpp? Anyway thank you everyone.

Yeh, quantization wasn't great but running it with mixed fp16/fp32 gave expected performance.

thement · 2023-03-16T20:13:04Z

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

Ha! You've got almost the same code for turning HF back to llama format as me :)

And I just got a thought. With the 4-bit GQPT quantized 7B/13B/... model in HF format one could unpack it to float-16 and turn it into a llama model and the requantize it with llama.cpp-quantize which would preserve the quantization hopefully.

totoCZ · 2023-03-16T22:10:15Z

So we now have https://github.com/antimatter15/alpaca.cpp but it only has endless chat-mode.
Someone needs to merge everything together with this repo so it can be run with dalai

thement · 2023-03-16T22:20:46Z

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

Anyway, here's a script that also does unquantization of 4bit models so then can be requantized later (but would work only with q4_1 and with fix that the min/max is calculated over the whole row, not just the QK=32 large batch)

https://gist.github.com/thement/90be08be928e0489aba7ecae9f0b352a

If you think this is useful I can maybe upgrade the convert_pth script.

namliz · 2023-03-18T02:39:20Z

Anyway, here's a script that also does unquantization of 4bit models so then can be requantized later (but would work only with q4_1 and with fix that the min/max is calculated over the whole row, not just the QK=32 large batch)

@thement wait, what? It can losslessly roundtrip?

linonetwo · 2023-03-18T12:20:18Z

Could we load LoRA with llama.cpp? Some languages are not well supported in original llama, but may be provided via LoRA.

tarruda · 2023-03-21T13:44:19Z

Could we load LoRA with llama.cpp? Some languages are not well supported in original llama, but may be provided via LoRA.

Yes, you just need to use the script by @tloen to obtain the pytorch weights, then convert to ggml using the steps described in the README.

Note that the output with llama.cpp is not the same as with pytorch, even without quantization. Here's a simple test I did locally with Alpaca lora:

Same request to llama.cpp using the converted weights (no quantization). Also modified alpaca.sh to pass similar arguments: ./main -m ./models/7BLora/ggml-model-f16.bin --color -f ./prompts/alpaca.txt -ins --top_k 40 --top_p 0.75 --temp 0.1 --repeat_penalty 1 -t 4 -n 2000 (not sure if missed something, I don't really know what most of these parameters mean). Here's the result:

Started well but ended up messing up later. Funny thing is that the llama.cpp implementation would have been a faster (if it was correct) since it only loops the array once.

It really got messed up when I tried with the 4-bit quantized weights:

Good news is that the non-quantized version is faster and used less memory than the pytorch version using CPU. Though maybe the pytorch slowdown was because it loaded the fine-tuning at runtime? @tloen might know. It will be a huge win if we can get llama.cpp to produce the same output as pytorch. @ggerganov might know more about this difference in outputs.

tarruda · 2023-03-21T14:07:28Z

Here's a comparison with GPT 3.5:

I tried to raise "temp" to 0.7 to match that of GPT 3.5 but it resulted in a worse solution (even though the goal of sorting "in place" was good 😄 ):

xloem · 2023-03-26T23:43:20Z

When you see a disparity between outputs in two important engines that should be identical, if you know how to use a debugger, it’s quite helpful to debug both in parallel and see where the numbers start diverging.

IIIIIIIllllllllIIIII · 2023-04-10T12:02:36Z

I am using text-generation-webui to successfully train loras for/from llama7b (8bit). Is there any way to merge the trained lora with the llama 7b weights? My goal is to train a lora, merge, train again on something else, merge again, etc….. does someone in here maybe have an idea how to achieve the merging part?

ghqing0310 · 2023-05-16T02:07:41Z

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

The code doesn't work for llama-65B. Seems that the params in the code are different from 7B and 65B. How can I get the correct params?

gjmulder added enhancement New feature or request help wanted Extra attention is needed labels Mar 15, 2023

This was referenced Mar 16, 2023

Stanford Alpaca support #99

Closed

Alpaca and Llama #203

Closed

gjmulder mentioned this issue Mar 17, 2023

Can this code base be extended to support other transformer-based LLMs such as Pythia or its instruction-tuned version Open Assistant? #219

Closed

edwios mentioned this issue Mar 29, 2023

Support for Loading a Subset of Tensors for LoRA Models #399

Closed

ziwang-com mentioned this issue May 20, 2023

尝试与羊驼-洛拉合并及其量化 @alpaca作者tloen在这里 ziwang-com/zero-lora#19

Open

bruvduroiu mentioned this issue Jun 13, 2023

Updates transformers library and fixes missing Seq2SeqTrainerArguments clcarwin/alpaca-weight#3

Merged

ggerganov closed this as completed Jul 28, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempting to merge with alpaca-lora and its quantization #172

Attempting to merge with alpaca-lora and its quantization #172

taiyou2000 commented Mar 15, 2023

nebulatgs commented Mar 15, 2023

nebulatgs commented Mar 15, 2023

beiller commented Mar 15, 2023

taiyou2000 commented Mar 15, 2023

tloen commented Mar 16, 2023 •

edited

Loading

taiyou2000 commented Mar 16, 2023

eous commented Mar 16, 2023 •

edited

Loading

taiyou2000 commented Mar 16, 2023

eous commented Mar 16, 2023

thement commented Mar 16, 2023 •

edited

Loading

totoCZ commented Mar 16, 2023

thement commented Mar 16, 2023

namliz commented Mar 18, 2023

linonetwo commented Mar 18, 2023

tarruda commented Mar 21, 2023

tarruda commented Mar 21, 2023

xloem commented Mar 26, 2023

IIIIIIIllllllllIIIII commented Apr 10, 2023

ghqing0310 commented May 16, 2023

Attempting to merge with alpaca-lora and its quantization #172

Attempting to merge with alpaca-lora and its quantization #172

Comments

taiyou2000 commented Mar 15, 2023

nebulatgs commented Mar 15, 2023

nebulatgs commented Mar 15, 2023

beiller commented Mar 15, 2023

taiyou2000 commented Mar 15, 2023

tloen commented Mar 16, 2023 • edited Loading

taiyou2000 commented Mar 16, 2023

eous commented Mar 16, 2023 • edited Loading

taiyou2000 commented Mar 16, 2023

eous commented Mar 16, 2023

thement commented Mar 16, 2023 • edited Loading

totoCZ commented Mar 16, 2023

thement commented Mar 16, 2023

namliz commented Mar 18, 2023

linonetwo commented Mar 18, 2023

tarruda commented Mar 21, 2023

tarruda commented Mar 21, 2023

xloem commented Mar 26, 2023

IIIIIIIllllllllIIIII commented Apr 10, 2023

ghqing0310 commented May 16, 2023

tloen commented Mar 16, 2023 •

edited

Loading

eous commented Mar 16, 2023 •

edited

Loading

thement commented Mar 16, 2023 •

edited

Loading