-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Attempting to merge with alpaca-lora and its quantization #172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this is because the model is in HF format, I ran into the same issue after fine-tuning LLaMA 7B on the Alpaca dataset. |
If anyone would like to collaborate on making the HF model work with this repo, please email me, or respond to this comment! |
I think the issue is that the tokens are embedded in the model file whereas your code does not have tokens embedded. @ggerganov could you confirm? Still a case to integrate sentencepiece. |
I was comparing parameters of two models. I noticed that maybe renaming parameters just works fine. But I don't know which parameters are corresponding to one another. And I think the transformers of HF model is somewhat "inverse" of torrent model because torrent model has a layer named output while HF one has a layer named input in each layer. HF model parameters: torrent model parameters: |
Alpaca-lora author here. I've added a script to merge and convert weights to state_dict in my repo (link). Curious to see it run on llama.cpp :) |
Thank you so much @tleon. I was trying to convert state_dict too but struggling to know the unpermute function. |
Wow, that script is a much more straight forward approach than the rabbit hole I was going down. Nice work. |
I just tried alpaca-lora merged model with quantization. The result was not that good as examples introduced in tloen repo. It might be price of quantization or merge was actually unsuccessful. Maybe I should modify config in llama.cpp? Anyway thank you everyone. |
Yeh, quantization wasn't great but running it with mixed fp16/fp32 gave expected performance. |
Ha! You've got almost the same code for turning HF back to llama format as me :) And I just got a thought. With the 4-bit GQPT quantized 7B/13B/... model in HF format one could unpack it to float-16 and turn it into a llama model and the requantize it with llama.cpp-quantize which would preserve the quantization hopefully. |
So we now have https://github.com/antimatter15/alpaca.cpp but it only has endless chat-mode. |
Anyway, here's a script that also does unquantization of 4bit models so then can be requantized later (but would work only with q4_1 and with fix that the min/max is calculated over the whole row, not just the QK=32 large batch) https://gist.github.com/thement/90be08be928e0489aba7ecae9f0b352a If you think this is useful I can maybe upgrade the convert_pth script. |
@thement wait, what? It can losslessly roundtrip? |
Could we load LoRA with llama.cpp? Some languages are not well supported in original llama, but may be provided via LoRA. |
Yes, you just need to use the script by @tloen to obtain the pytorch weights, then convert to ggml using the steps described in the README. Note that the output with llama.cpp is not the same as with pytorch, even without quantization. Here's a simple test I did locally with Alpaca lora: Same request to llama.cpp using the converted weights (no quantization). Also modified Started well but ended up messing up later. Funny thing is that the llama.cpp implementation would have been a faster (if it was correct) since it only loops the array once. It really got messed up when I tried with the 4-bit quantized weights: Good news is that the non-quantized version is faster and used less memory than the pytorch version using CPU. Though maybe the pytorch slowdown was because it loaded the fine-tuning at runtime? @tloen might know. It will be a huge win if we can get llama.cpp to produce the same output as pytorch. @ggerganov might know more about this difference in outputs. |
When you see a disparity between outputs in two important engines that should be identical, if you know how to use a debugger, it’s quite helpful to debug both in parallel and see where the numbers start diverging. |
I am using text-generation-webui to successfully train loras for/from llama7b (8bit). Is there any way to merge the trained lora with the llama 7b weights? My goal is to train a lora, merge, train again on something else, merge again, etc….. does someone in here maybe have an idea how to achieve the merging part? |
The code doesn't work for llama-65B. Seems that the params in the code are different from 7B and 65B. How can I get the correct params? |
I was attempting to merge alpaca-lora from https://huggingface.co/tloen/alpaca-lora-7b and the original llama-7B from https://huggingface.co/decapoda-research/llama-7b-hf, also tried to quantize the model and run main file in llama.cpp.
The merge code is from https://github.com/clcarwin/alpaca-weight
It was almost successful until final phase to run the main file in llam.cpp. I had no problems with merge and quantization.
Then it raised an error like this:
llama_model_load: llama_model_load: unknown tensor 'model.embed_tokens.weight' in model file
main: failed to load model from './models/7B/ggml-model-q4_0.bin'
I will share my logs in my repository. The code I used in colab to merge and quantize the model is there too: https://github.com/taiyou2000/personal_experimant
I'm not machine learning expert and I have not checked entire llama.cpp code, but in my theory maybe the quantized model contains weights and some of them has names that main.cpp doesn't expect to see. As you can see in quantization_log.txt and pth_to_ggml_log.txt from my repository, it has names like "model.layers.0.self_attn.q_proj.weight", and probably it should be like "model.layers.0.attention.wq.weight" for main.cpp.
I can run llama.cpp without any problems on my local computer and the model is quantized from torrent version. I guess huggingface version has something different from it.
The text was updated successfully, but these errors were encountered: