-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge LLM #5
Comments
Yes I have this question too, do you simply merge the adapter weights from your fine-tuning by averaging with other base/instruction-FT models? Or do you do a weighted average with the weight tuned on a val set? Also did you try merging multiple LoRAs from different fine-tuned models, and does that improve / degrade performance? |
Seems like they use: Line 37 in 885003b
which is based on simple additive merging (from the code here): https://github.com/huggingface/peft/blob/a916465ad0970944f3241305071d9b79fae55b59/src/peft/tuners/lora.py#L794-L802 Could you please confirm this? |
Thanks for your interest. That is correct, it is a simple linear merge (for now...). We played around with the different types of LoRA modules, how the training data affects the outcome of the merge, how merging fine-tunes that used different LoRA modules works, etc. From our experience, the outcome of merging two (or more) LoRA based models is very much dependent on 1) the LoRA modules both merged models were fine-tuned with (i.e. did one model use up/down/gate proj and the other k/v/q/o proj 2) the training data, 3) the performance of both original models on whatever benchmarks you're using, and 4) (I think, but am still working on quantitative tests to explore this) the order of the LoRA merge. I believe the order of the merge also affects the "expertise" of the model. |
Thanks for the prompt response. It is interesting that the order of the merge seems to play a role. I wouldn't have guessed that since additive merging seems permutation invariant (or maybe I misunderstood something), do you have an intuitive justification for why order seems to play a role? I would be very curious to know more about the quantitative results too! |
That was my thought too, initially (that order wouldn't matter, which is why it is not discussed in the paper we recently released). I only started looking into it because when we originally merged Platypus-70B model with Dolphin, it was the only merge we had at the time that actually did worse than its original counterpart (the rest of our merges were better than both originals). Thanks again for your interest, follow-up with me in a week and hopefully I'll have additional insight and experiments to share! |
Thanks for your great work! I am also a little confused about the way of merging, is it merging the LoRA modules (i.e. merging the low-rank decomposition matrices B and A separately) or merging the entire two fine-tuned LLMs? |
I think they directly use the LoRA module merging, from this code snippet: |
sorry, I can't see them call the function peft.lora.merge() in this repo, am I missing anything? |
They call the peft wrapper function here: Line 37 in 885003b
This then calls the merge function linked above internally I think! |
That's just a normal merge() operation for LoRA, which is used to merge the learned LoRA module into the original model. In this way, it seems no more novel things than otherwise. |
Right, I agree with you that it is the typical merging strategy used. However, I'm not sure I fully get the novelty aspect---I did not get the impression from the paper that they used a novel merging strategy, rather that merging with already instruction-fine-tuned models brought them the gains they see. I might be mistaken though, happy to hear your perspective on this! Maybe @arielnlee could pitch in too. |
Really cool paper! Regarding for the merging, maybe the procedure / method from LoraHub can give some inspiration: https://github.com/sail-sg/lorahub |
First of all - I love this model! Great work from your team :) I've got a dumb question about merging models and I'm wondering if someone would be able to help me. How do you merge models when you have a LoRA adapter for one model (e.g. an adapter trained on the Platypus dataset using frozen Llama 2 weights) and only the base weights of a second model (e.g. OpenOrca)? While I understand mixing two LoRA adapters, wouldn't the relationship between the weights and the outputs that the adapter learns not hold when you apply it to another fine-tuned model (like OpenOrca) that may have quite different weights to Llama 2? Your model is clearly excellent, I just want to understand how. As a secondary side question: Can you merge the weights of models without using LoRA adapters and get good results? I'd love to be able to merge Stable Platypus 2 with a checkpoint of Llama 2 that has been extensively trained on Japanese so that it could potentially become as smart as Stable Platypus 2 but in Japanese instead of English. I know Stable Platypus 2 is already pretty damn good at Japanese, I'd just like to make it even better. Thanks again! |
Hi, thanks for the great work! I have some tiny questions about the approaches of the paper.
Please let me know if I misunderstand anything |
Hi, Glad to see your model are on the top of Open LLM leaderboard!
Could you plz share your methods of merging LLMs?
Just a simple mixture of weight like https://github.com/donaldafeith/Pytorch_Merge?
The text was updated successfully, but these errors were encountered: