Skip to content

Attempting to merge with alpaca-lora and its quantization #172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
taiyou2000 opened this issue Mar 15, 2023 · 19 comments
Closed

Attempting to merge with alpaca-lora and its quantization #172

taiyou2000 opened this issue Mar 15, 2023 · 19 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@taiyou2000
Copy link

I was attempting to merge alpaca-lora from https://huggingface.co/tloen/alpaca-lora-7b and the original llama-7B from https://huggingface.co/decapoda-research/llama-7b-hf, also tried to quantize the model and run main file in llama.cpp.
The merge code is from https://github.com/clcarwin/alpaca-weight

It was almost successful until final phase to run the main file in llam.cpp. I had no problems with merge and quantization.

Then it raised an error like this:

llama_model_load: llama_model_load: unknown tensor 'model.embed_tokens.weight' in model file
main: failed to load model from './models/7B/ggml-model-q4_0.bin'

I will share my logs in my repository. The code I used in colab to merge and quantize the model is there too: https://github.com/taiyou2000/personal_experimant

I'm not machine learning expert and I have not checked entire llama.cpp code, but in my theory maybe the quantized model contains weights and some of them has names that main.cpp doesn't expect to see. As you can see in quantization_log.txt and pth_to_ggml_log.txt from my repository, it has names like "model.layers.0.self_attn.q_proj.weight", and probably it should be like "model.layers.0.attention.wq.weight" for main.cpp.
I can run llama.cpp without any problems on my local computer and the model is quantized from torrent version. I guess huggingface version has something different from it.

@nebulatgs
Copy link
Contributor

I think this is because the model is in HF format, I ran into the same issue after fine-tuning LLaMA 7B on the Alpaca dataset.

@nebulatgs
Copy link
Contributor

If anyone would like to collaborate on making the HF model work with this repo, please email me, or respond to this comment!

@beiller
Copy link
Contributor

beiller commented Mar 15, 2023

I think the issue is that the tokens are embedded in the model file whereas your code does not have tokens embedded. @ggerganov could you confirm? Still a case to integrate sentencepiece.

@gjmulder gjmulder added enhancement New feature or request help wanted Extra attention is needed labels Mar 15, 2023
@taiyou2000
Copy link
Author

I was comparing parameters of two models. I noticed that maybe renaming parameters just works fine. But I don't know which parameters are corresponding to one another. And I think the transformers of HF model is somewhat "inverse" of torrent model because torrent model has a layer named output while HF one has a layer named input in each layer.
Will it be easy as renaming parameters, or do I need to code from scratch?

HF model parameters:
layers.0.self_attn.q_proj.weight
layers.0.self_attn.k_proj.weight
layers.0.self_attn.v_proj.weight
layers.0.self_attn.o_proj.weight
layers.0.self_attn.rotary_emb.inv_freq
layers.0.mlp.gate_proj.weight
layers.0.mlp.down_proj.weight
layers.0.mlp.up_proj.weight
layers.0.input_layernorm.weight
layers.0.post_attention_layernorm.weight
norm.weight
lm_head.weight

torrent model parameters:
norm.weight
output.weight
layers.0.attention.wq.weight
layers.0.attention.wk.weight
layers.0.attention.wv.weight
layers.0.attention.wo.weight
layers.0.feed_forward.w1.weight
layers.0.feed_forward.w2.weight
layers.0.feed_forward.w3.weight
layers.0.attention_norm.weight
layers.0.ffn_norm.weight

@tloen
Copy link

tloen commented Mar 16, 2023

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

@taiyou2000
Copy link
Author

Thank you so much @tleon. I was trying to convert state_dict too but struggling to know the unpermute function.
I'm gonna give it a try soon!

@eous
Copy link

eous commented Mar 16, 2023

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

Instruction: What should I have for dinner?
Output: 153 grams of beef, plus rice and peas. And you need to eat all the food! [end of text]

Wow, that script is a much more straight forward approach than the rabbit hole I was going down. Nice work.

@taiyou2000
Copy link
Author

I just tried alpaca-lora merged model with quantization. The result was not that good as examples introduced in tloen repo. It might be price of quantization or merge was actually unsuccessful. Maybe I should modify config in llama.cpp? Anyway thank you everyone.

@eous
Copy link

eous commented Mar 16, 2023

I just tried alpaca-lora merged model with quantization. The result was not that good as examples introduced in tloen repo. It might be price of quantization or merge was actually unsuccessful. Maybe I should modify config in llama.cpp? Anyway thank you everyone.

Yeh, quantization wasn't great but running it with mixed fp16/fp32 gave expected performance.

This was referenced Mar 16, 2023
@thement
Copy link
Contributor

thement commented Mar 16, 2023

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

Ha! You've got almost the same code for turning HF back to llama format as me :)

And I just got a thought. With the 4-bit GQPT quantized 7B/13B/... model in HF format one could unpack it to float-16 and turn it into a llama model and the requantize it with llama.cpp-quantize which would preserve the quantization hopefully.

@totoCZ
Copy link

totoCZ commented Mar 16, 2023

So we now have https://github.com/antimatter15/alpaca.cpp but it only has endless chat-mode.
Someone needs to merge everything together with this repo so it can be run with dalai

@thement
Copy link
Contributor

thement commented Mar 16, 2023

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

Anyway, here's a script that also does unquantization of 4bit models so then can be requantized later (but would work only with q4_1 and with fix that the min/max is calculated over the whole row, not just the QK=32 large batch)

https://gist.github.com/thement/90be08be928e0489aba7ecae9f0b352a

If you think this is useful I can maybe upgrade the convert_pth script.

@namliz
Copy link

namliz commented Mar 18, 2023

Anyway, here's a script that also does unquantization of 4bit models so then can be requantized later (but would work only with q4_1 and with fix that the min/max is calculated over the whole row, not just the QK=32 large batch)

@thement wait, what? It can losslessly roundtrip?

@linonetwo
Copy link

Could we load LoRA with llama.cpp? Some languages are not well supported in original llama, but may be provided via LoRA.

@tarruda
Copy link

tarruda commented Mar 21, 2023

Could we load LoRA with llama.cpp? Some languages are not well supported in original llama, but may be provided via LoRA.

Yes, you just need to use the script by @tloen to obtain the pytorch weights, then convert to ggml using the steps described in the README.

Note that the output with llama.cpp is not the same as with pytorch, even without quantization. Here's a simple test I did locally with Alpaca lora:

alpaca-lora-quicksort-pytorch

Same request to llama.cpp using the converted weights (no quantization). Also modified alpaca.sh to pass similar arguments: ./main -m ./models/7BLora/ggml-model-f16.bin --color -f ./prompts/alpaca.txt -ins --top_k 40 --top_p 0.75 --temp 0.1 --repeat_penalty 1 -t 4 -n 2000 (not sure if missed something, I don't really know what most of these parameters mean). Here's the result:

image

Started well but ended up messing up later. Funny thing is that the llama.cpp implementation would have been a faster (if it was correct) since it only loops the array once.

It really got messed up when I tried with the 4-bit quantized weights:

image

Good news is that the non-quantized version is faster and used less memory than the pytorch version using CPU. Though maybe the pytorch slowdown was because it loaded the fine-tuning at runtime? @tloen might know. It will be a huge win if we can get llama.cpp to produce the same output as pytorch. @ggerganov might know more about this difference in outputs.

@tarruda
Copy link

tarruda commented Mar 21, 2023

Here's a comparison with GPT 3.5:
image

I tried to raise "temp" to 0.7 to match that of GPT 3.5 but it resulted in a worse solution (even though the goal of sorting "in place" was good 😄 ):

image

@xloem
Copy link
Contributor

xloem commented Mar 26, 2023

When you see a disparity between outputs in two important engines that should be identical, if you know how to use a debugger, it’s quite helpful to debug both in parallel and see where the numbers start diverging.

@IIIIIIIllllllllIIIII
Copy link

I am using text-generation-webui to successfully train loras for/from llama7b (8bit). Is there any way to merge the trained lora with the llama 7b weights? My goal is to train a lora, merge, train again on something else, merge again, etc….. does someone in here maybe have an idea how to achieve the merging part?

@ghqing0310
Copy link

Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :)

The code doesn't work for llama-65B. Seems that the params in the code are different from 7B and 65B. How can I get the correct params?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests