Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge LLM #5

Open
0three opened this issue Aug 11, 2023 · 14 comments
Open

Merge LLM #5

0three opened this issue Aug 11, 2023 · 14 comments

Comments

@0three
Copy link

0three commented Aug 11, 2023

Hi, Glad to see your model are on the top of Open LLM leaderboard!

Could you plz share your methods of merging LLMs?

Just a simple mixture of weight like https://github.com/donaldafeith/Pytorch_Merge?

@vishaal27
Copy link

Yes I have this question too, do you simply merge the adapter weights from your fine-tuning by averaging with other base/instruction-FT models? Or do you do a weighted average with the weight tuned on a val set? Also did you try merging multiple LoRAs from different fine-tuned models, and does that improve / degrade performance?

@vishaal27
Copy link

vishaal27 commented Aug 12, 2023

Seems like they use:

model = model.merge_and_unload()

which is based on simple additive merging (from the code here):
https://github.com/huggingface/peft/blob/a916465ad0970944f3241305071d9b79fae55b59/src/peft/tuners/lora.py#L794-L802
Could you please confirm this?

@arielnlee
Copy link
Owner

Thanks for your interest. That is correct, it is a simple linear merge (for now...). We played around with the different types of LoRA modules, how the training data affects the outcome of the merge, how merging fine-tunes that used different LoRA modules works, etc.

From our experience, the outcome of merging two (or more) LoRA based models is very much dependent on 1) the LoRA modules both merged models were fine-tuned with (i.e. did one model use up/down/gate proj and the other k/v/q/o proj 2) the training data, 3) the performance of both original models on whatever benchmarks you're using, and 4) (I think, but am still working on quantitative tests to explore this) the order of the LoRA merge. I believe the order of the merge also affects the "expertise" of the model.

@vishaal27
Copy link

Thanks for the prompt response. It is interesting that the order of the merge seems to play a role. I wouldn't have guessed that since additive merging seems permutation invariant (or maybe I misunderstood something), do you have an intuitive justification for why order seems to play a role? I would be very curious to know more about the quantitative results too!

@arielnlee
Copy link
Owner

That was my thought too, initially (that order wouldn't matter, which is why it is not discussed in the paper we recently released). I only started looking into it because when we originally merged Platypus-70B model with Dolphin, it was the only merge we had at the time that actually did worse than its original counterpart (the rest of our merges were better than both originals). Thanks again for your interest, follow-up with me in a week and hopefully I'll have additional insight and experiments to share! ☺️

@A11en0
Copy link

A11en0 commented Aug 14, 2023

Thanks for your great work! I am also a little confused about the way of merging, is it merging the LoRA modules (i.e. merging the low-rank decomposition matrices B and A separately) or merging the entire two fine-tuned LLMs?

@vishaal27
Copy link

I think they directly use the LoRA module merging, from this code snippet:
https://github.com/huggingface/peft/blob/a916465ad0970944f3241305071d9b79fae55b59/src/peft/tuners/lora.py#L794-L802

@A11en0
Copy link

A11en0 commented Aug 16, 2023

sorry, I can't see them call the function peft.lora.merge() in this repo, am I missing anything?

@vishaal27
Copy link

They call the peft wrapper function here:

model = model.merge_and_unload()

This then calls the merge function linked above internally I think!

@A11en0
Copy link

A11en0 commented Aug 16, 2023

That's just a normal merge() operation for LoRA, which is used to merge the learned LoRA module into the original model. In this way, it seems no more novel things than otherwise.

@vishaal27
Copy link

Right, I agree with you that it is the typical merging strategy used. However, I'm not sure I fully get the novelty aspect---I did not get the impression from the paper that they used a novel merging strategy, rather that merging with already instruction-fine-tuned models brought them the gains they see. I might be mistaken though, happy to hear your perspective on this! Maybe @arielnlee could pitch in too.

@SivilTaram
Copy link

Really cool paper! Regarding for the merging, maybe the procedure / method from LoraHub can give some inspiration: https://github.com/sail-sg/lorahub

@Peter-Devine
Copy link

First of all - I love this model! Great work from your team :)

I've got a dumb question about merging models and I'm wondering if someone would be able to help me.

How do you merge models when you have a LoRA adapter for one model (e.g. an adapter trained on the Platypus dataset using frozen Llama 2 weights) and only the base weights of a second model (e.g. OpenOrca)? While I understand mixing two LoRA adapters, wouldn't the relationship between the weights and the outputs that the adapter learns not hold when you apply it to another fine-tuned model (like OpenOrca) that may have quite different weights to Llama 2?
To the best of my knowledge, I understand that OpenOrca is not trained using LoRA, but by directly training the weights, so will the weights of the model not negatively affect the projection that the LoRA adapters have learned? Or is the assumption that even after fine-tuning, the weights of OpenOrca are similar enough to Llama 2 to allow the adapter to work well.

Your model is clearly excellent, I just want to understand how.

As a secondary side question: Can you merge the weights of models without using LoRA adapters and get good results? I'd love to be able to merge Stable Platypus 2 with a checkpoint of Llama 2 that has been extensively trained on Japanese so that it could potentially become as smart as Stable Platypus 2 but in Japanese instead of English. I know Stable Platypus 2 is already pretty damn good at Japanese, I'd just like to make it even better.

Thanks again!

@eric8607242
Copy link

Hi, thanks for the great work!

I have some tiny questions about the approaches of the paper.

  1. If I do not misunderstand the paper, after fine-tuning the base model (e.g., LLaMA-v2) with LoRA, we can directly merge the adapter with another instruct-tuning model (e.g., OpenOrcaxOpenChat) to improve the performance. But why don't we fine-tune the instruct-tuning model (e.g., OpenOrcaxOpenChat) on the proposed dataset directly? Do you have any performance comparison results about these two approaches (merge with another tuned model v.s. directly fine-tune another tuned model), of course under the same training budget? Or the experiment results about the performance gain if we merge more than two different instruct-tuned models.
  2. Are there any performance gaps between merging entire model weights and merging the adapter only?

Please let me know if I misunderstand anything
Thanks for the great work again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants