Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine multiple (IA)^3 Adapters and delete (IA)^3 adapters #980
Combine multiple (IA)^3 Adapters and delete (IA)^3 adapters #980
Changes from all commits
149aeb3
31237bd
eeb1b38
0e55abc
e98e148
46b3ad8
7f567a0
43483b7
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this is a bug in the existing code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like having a separate method for this, but the name is not quite fitting. This combines different module names, right? So could you please adjust the name to reflect that? Also, please add a sentence to the docstring that explains what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is not correct: When using IA³, the IA³ weights have to be multiplied, not added, right? I.e. they should be initialized as 1.0 and then each IA³ weight is multiplied on top, not added. See how it's accomplished in the forward method of IA³:
peft/src/peft/tuners/ia3/layer.py
Lines 177 to 182 in 0c16918
If this is correct, we encounter a second problem, namely that the
weights
argument makes little sense: Since we just multiply each IA³ weight and each weight fromweights
, due to commutativity, the order inweights
doesn't matter. Whether a user passesweights=[2, 3]
orweights=[3, 2]
makes no difference.We could still leave it as is for consistency, but I would be afraid that this would confuse many users. Instead, we could also 1) remove the
weights
argument entirely for IA³ or 2) only pass a single scalar toweights
, which is applied once to all weights (could be set as the initial value). WDYT?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback and review!
This is true in the forward pass. The learned vectors$l$ are multiplied with (in the case of attention) $K^T$ and $Q$ . However, here we are not considering the Key and Value matrices, only learned vectors $l$ (as far as I understand), so my approach here was to compute a linear combination of the vectors (which is what we do in LoRA?).
Let's assume we have two adapters that target$K$ and $V$ with associated vectors $l_K$ and $l_V$ , and weights
[0.6, 0.4]
. The way I wanted to combine this adapters on a new adapter was:If we also target the FF layers, we would compute the resulting vector using the same procedure.
If we multiply vectors, yes. However, that would not result in a linear combination of vectors, which was my goal.
Let me know if this makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure. Let's work with scalars for a second. Let's say we have one IA³ weight with value 2 and one with value 3. As they are multiplied consecutively on the input, I would expect that we should multiply by 6, not by their sum 5. Am I missing something?
Anyway, I thought why not just test if the results are right or not. For this, I changed the test you added to do this instead:
As you can see, we test the outputs from an IA³ model with the 3 adapters active but unmerged vs merged vs merged using
add_weighted_adapter
(your implementation) vs merged usingadd_weighted_adapter_mul
(my implementation using multiply). When I run the tests, the multiply version passes but the addition version fails, which makes me think that multiplying is the way to go.If you want to replicate this result, it will require a few steps because our code isn't really set up to work with multiple active adapters yet, so I had to make a few ad hoc changes to even get this far. I created a PR on top of your branch containing those changes:
https://github.com/alexrs/peft/pull/1/files
Obviously, it should not be merged, it's just to show you what steps I took. WDYT, is this plausible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point! However, I'm not sure this is consistent with the LoRA implementation. As far as I understand, there are two different scenarios here:
1. Stacking Adapters: When using
set_adapter
on multiple adapters, what we are doing is stacking adapters. That's how it works right now, and how it works in LoRA (I think!). This is equivalent to usingcombination_type=cat
in LoRA'sadd_weighted_adapter
(peft/tests/test_custom_models.py
Lines 819 to 827 in e98df91
2. Linear combination of Adapters: In this case, we are not stacking adapters but combining them to create a new adapter that is a linear combination of the input adapters and the input weights. This is equivalent to
combination_type=linear
in LoRA'sadd_weighted_adapter
. If we change the code linked above to uselinear
, the test fails:And same if we decide to give equal weight to both adapters to sum to 1:
I guess a solution is to add the different$(IA)^3$ 's
combination_type
s toadd_weighted_adapter
. Does this sound reasonable? Or do I have the wrong understanding of how this works?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right in the sense that for IA³, it is not quite clear how to interpret the combination of results. Unfortunately, I don't think that there is any existing evidence for IA³ for what the best way for combining adapters is. I agree that we could offer multiple methods and that hopefully, with time, the best method will emerge. When it comes to which default to choose, I'd argue it's a nice property to have the same output for combining the adapters as if they were all active at once, WDYT?
Another possibility that come to mind would be to go for geometric mean, which seems appropriate for a multiplicative operation, but it wouldn't work for negative numbers, so has to be ruled out.
When it comes to naming the combination types, the analogy to LoRA is a bit difficult, because the mathematical operation is different. I think for IA³ it is necessary to think from first principles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
That makes sense! But as discussed above, it is not how it works in LoRA by default, is it?
I guess the way to proceed is to allow both multiplication and linear combination methods using different
combination_types
, and setting the default to multiplication?All in all, given that there is no evidence for what the best way for combining adapters is, I will try to run some experiments using both methods to get more clarity on this topic. Let me know if you have any suggestions or ideas for this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we cannot really compare the two as I mentioned. E.g. it would not make sense to have an "svd" method for IA³, so I think we shouldn't put too much stress on consistency here.
That would be fantastic. Loading and combining multiple LoRAs is mostly a thing in image generation AFAIK, so that's probably what I would investigate, but I'm not sure how well IA³ lends itself to image generation in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linear
forget_delta_weight
method. How should we test that the result is correct?