Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Support Conv2d layers for IA³ #972

Merged

Conversation

BenjaminBossan
Copy link
Member

Adds support for Conv2d layers to the IA³ tuner. Tests are added to check that they work.

Notes:

Unfortunately, when unmerging the Conv2d IA³ layers, there is quite a bit of rounding error. I had to increase the tolerances for this specific test case to make the tests pass. I'm not 100% sure why this is, but I could imagine that for Conv2d, small errors accumulate because of the convolution operation.

I also added tests for IA³ Linear layers for the custom models, which also pass. However, there is an error when using Conv1D. The reason is that merging fails because there is a shape mismatch when fan_in_fan_out=True (which is set automatically for Conv1D). I'm not sure how this should be fixed. For the time being, I commented these tests.

I also noticed that I don't understand what the feedforward_modules parameter does in IA³. AFAICT, it always has to be set for all IA³ layers, i.e. self.is_feedforward always must be True. If it cannot be False, we could as well remove feedforward_modules and automatically set is_feedforward to True.

Adds support for Conv2D layers to the IA³ tuner. Tests are added to
check that they work.

Notes:

Unfortunately, when unmerging the Conv2d IA³ layers, there is quite a
bit of rounding error. I had to increase the tolerances for this
specific test case to make the tests pass. I'm not 100% sure why this
is, but I could imagine that for Conv2d, small errors accumulate because
of the convolution operation.

I also added tests for IA³ Linear layers for the custom models, which
also pass. However, there is an error when using Conv1D. The reason is
that merging fails because there is a shape mismatch when
fan_in_fan_out=True (which is set automatically for Conv1D). I'm not
sure how this should be fixed.

I also noticed that I don't understand what the feedforward_modules
parameter does in IA³. AFAICT, it always has to be set for all IA³
layers, i.e. self.is_feedforward always must be True. If it cannot be
False, we could as well remove feedforward_modules and automatically set
is_feedforward to True.
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 27, 2023

The documentation is not available anymore as the PR was closed or merged.

@BenjaminBossan
Copy link
Member Author

Ping @SumanthRH

@SumanthRH
Copy link
Contributor

SumanthRH commented Sep 28, 2023

Hi @BenjaminBossan, regarding the feedfordward layers, this detail is in the gray area, as we can only go by the T-few paper.
TL; DR : is_feedforward is needed because the IA $^3$ weights get multiplied in different places for attention vs feedforward blocks. The original T-Few implementation doesn't make this distinction, mainly because they use T0, based on T5 1.1, which has a unique 3-layer feedforward block.

In the original IA $^3$ paper, here is the snippet where the equation for the modified model is shown:

image

Following the two equations, the current implementation does the following:

  1. If is_feedforward is False, then the IA $^3$ weights are multiplied to the activations of the linear layer (i.e after mat mul)
  2. If is_feedforward is True, then the IA $^3$ weights are multiplied to the inputs of the linear layer.

This is following the two equations given in the paper (I didn't think we could merge this, because multiplying in the input space gives you a vector with a different dimension than multiplying after matmul).

How do we know this is right?

I initially went over the original implementation from the authors here. Their implementation is for T5 v1.1 only, and the feedforward logic for all the layers is actually the same : the activations (after matmul) are multiplied by the IA $^3$ weights (line of interest). Now, with T5 1.1 the feedforward block is unique, we have three matmuls instead of the usual two (hf source code). If you look at the configuration for $IA^3$ in T-Few, they've applied weights for the wi_1 layer. If you write down the final equation, you get $W_2 * (IA^3_l * W_1(x) * (\gamma (W_0 x)))$ . For all other architectures, I did not see how we could avoid having an is_feedforward flag. This is why the feedforward and the non-feedforward layers are in separate mappings, and the mapping is different for MT5 (based on v1.1) vs T5.

Also, I'm happy to take a look at the Conv1D shape issue.

@BenjaminBossan
Copy link
Member Author

BenjaminBossan commented Sep 28, 2023

Thanks @SumanthRH for the detailed explanation. Mathematically, it was clear to me what the difference is, but there were still sources of confusion. I think the first one stems from the name "feedforward", which can mean different things depending on circumstances. The way it is used in the paper is to designate a very specific part of the model architecture, but this doesn't necessarily generalize to other architectures. I think in hindsight, we should have named the variable for what it actually does, namely decide if the weight is multiplied before or after.

I think it would also have been better to separate the targets_modules and feedforward_modules cleanly, as in the example below:

# currently:
IA3Config(
    targets_modules=["layer1", "layer2", "layer3", "layer4"],
    feedforward_modules=["layer2", "layer4"],
)
# might be better:
IA3Config(
    targets_modules_before=["layer1", "layer3"],
    target_modules_after=["layer2", "layer4"],
)

I think this could have made things more obvious, but changing it now would be backwards breaking, so we should keep it as is.

  1. If is_feedforward is False, then the IA 3 weights are multiplied to the activations of the linear layer (i.e after mat mul)

I think a crucial difference is that it is multiplied after matmul and after the non-linearity, right? This means that if we merge the IA³ weights into the normal weights, the output is incorrect because the merge does not (and cannot) take the non-linearity into account. @SumanthRH do you agree with this?

If true, this would explain why the tests fail when not all modules are included in the feedforward_modules argument, which is what I meant when I wrote earlier:

AFAICT, it always has to be set for all IA³ layers, i.e. self.is_feedforward always must be True. If it cannot be False, we could as well remove feedforward_modules and automatically set is_feedforward to True.

So this statement is not quite correct, is_feedforward can be False, but then we should not allow merging. I'll work on this PR to raise an error when trying to merge with self.is_feedforward=False to avoid returning wrong results.

Edit: Thinking a bit more about this, isn't our implementation even potentially incorrect for is_feedforward=False? We multiply the weight to the output of the Linear layer, but according to the paper, it should be multiplied to the output after applying the non-linearity. This is mathematically not the same. Technically, this cannot be fixed easily because the Linear layer does not know what non-linearity will be applied, so we would have to inspect the graph (not really possible with PyTorch) and check what the subsequent layer is, and, if it's a non-linearity, apply IA³ to that layer instead. I hope that I made a mistake, but if not, we should not allow is_feedforward=False.

- correct merging of conv2d for is_feedforward=True vs False
- correct merging for is_feedforward=False and Linear layer (take bias
  into account)
- extend test examples for custom models
- extend test to include merge_adapter and unmerge_adapter
@BenjaminBossan
Copy link
Member Author

Okay, I made some progress:

Merging IA³ with is_feedforward=False was incorrect: Since it is applied after calling the layer, it has to take into account the bias. Therefore, the bias is now also updated when merging with is_feedforward=False. There was also a small mistake with transposing when merging Conv2d.

I also noticed that we never tested merge_adapter and unmerge_adapter. The former didn't matter so much, because we tested merging elsewhere, but we didn't actually test the case of merge + unmerge. I extended _test_merge_layers to do this now.

I'm still getting errors for Conv1D so if you could take a look @SumanthRH that would be great.

Regarding the question of whether the case is_feedforward=False is implemented correctly: My current understanding is that it is correct only if there is no non-linearity. This may be the case for the architectures studied in the paper, I don't know. If there is a non-linearity, it still works, but it's not the same as shown in the screenshot that Sumanth posted. We could maybe document it as such and leave it in for now. But I think that if we don't apply it after the non-linearity, there is not actually much point in using is_feedforward=False compared to True, as it only affects if the bias is also scaled or not.

@SumanthRH
Copy link
Contributor

SumanthRH commented Sep 29, 2023

Hi @BenjaminBossan,

Regarding your comment

isn't our implementation even potentially incorrect for is_feedforward=False? We multiply the weight to the output of the Linear layer, but according to the paper, it should be multiplied to the output after applying the non-linearity. This is mathematically not the same.

I believe the current implementation is fine as long as you use the right set of target/feedforward modules! IA $^3$ weights are added to the second feedforward layer, not the first! So, when you pass is_feedforward=True, the IA $^3$ vector is multiplied with the input for $W_2$ - if you write that down, that will boil down to $([IA^3_l \cdot \gamma(W_1 x + b) ] W_2)$. And then when we're merging this layer, the learned vector can equivalently be broadcasted and element-wise multiplied with $W_2$.

And you're right about the bug on merging! I should have written out tests right away then! We had tested the merge functionality on a GPT-2 model, but I think for the specific fine-tuned model we used, the logits were unaffected even if you ignored merging the bias term.

I have boiled down the bug to this implementation detail :

  1. is_feedforward=False. In this case, the output is $IA^3_l \cdot (W x + b)$. Thus, when you merge, and unmerge, you have to scale/ unscale the bias term as well by $IA^3_l$
  2. is_feedforward=True. In this case, the output is $W (IA^3_l \cdot x) + b$. Thus, when you merge/ unmerge the bias term is unaffected.

Is this even right? From what I see with the T-few implementation, this looks to be the case. T-few adds trainable parameters to activations after $W_{i1}$ layer in T0, so, if you write down the equation with bias, you get $W_o * [(IA^3_l \cdot (W_{i1} x +b_{i1}) \cdot \gamma {(W_{i0} x + b_{i0})}] + b_o$ . This is equivalent to our code. This definitely makes the implementation a bit non-intuitive, sorry about that! Since we can only modify the Linear layer, I feel like there's no option in case we want to replicate the paper. Maybe we need an experiment to see if this really matters or if we can just stick the learned vector after the activations everywhere. But, if we're replicating the paper itself, it is very clear where the vector should go:

image

The solution for the merge bug would be to have different merging/unmerging depending on the is_feedforward flag, which I see you've already done in this PR!

@BenjaminBossan
Copy link
Member Author

I believe the current implementation is fine as long as you use the right set of target/feedforward modules! IA $^3$ weights are added to the second feedforward layer, not the first! So, when you pass is_feedforward=True, the IA $^3$ vector is multiplied with the input for $W_2$ - if you write that down, that will boil down to $([IA^3_l \cdot \gamma(W_1 x + b) ] W_2)$. And then when we're merging this layer, the learned vector can equivalently be broadcasted and element-wise multiplied with $W_2$.

Ah yes, I see, it's a matter of perspective I guess :) My issue, which maybe I didn't put clearly, is rather that when a user wants to use IA³ on a different model architecture with a layer not set as feedforward_layer (i.e. is_feedforward=False), they may think that we apply the weight after the nonlinearity, which is not true (and would be hard to implement). Yes, they should target the next layer to get the same effect, but it's not obvious IMHO.

Also, I think my point stands that is_feedforward=False, as is, isn't very useful, as the only difference is if the bias is scaled or not, which intuitively doesn't feel helpful. But as you said, more testing would be required to really make sure.

Maybe this can all be solved by better documentation, but I also think the API is not optimal. I would think that most users want is_feedforward=True but that is not the default -- it requires to pass the same list of target modules twice. But the most desired setting should be the default, not vice versa. Not sure if we want to change that, as it's backwards breaking. Let's keep it for another PR if we do decide to change it.

The solution for the merge bug would be to have different merging/unmerging depending on the is_feedforward flag, which I see you've already done in this PR!

Yes, I think merging should be correct now and tests seem to support this. The only setting that is still failing is when using Conv1D, which I would also be fine with fixing in a separate PR (tests are there but commented out).

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Sep 29, 2023
This should resolve the failing slow test
test_4bit_merge_and_disable_lora.

While investigating, I also noticed that merging multiple adapters was
not correct for IA³. I added a test that should catch this bug and
provided a fix for it too. However, the test does not check IA³ at the
moment because the test parameters do not contain IA³. For this, huggingface#972
needs to be merged too, which adds IA³ to the test parameters.
pacman100 pushed a commit that referenced this pull request Oct 3, 2023
* Fix issues with merging multiple adapters

This should resolve the failing slow test
test_4bit_merge_and_disable_lora.

While investigating, I also noticed that merging multiple adapters was
not correct for IA³. I added a test that should catch this bug and
provided a fix for it too. However, the test does not check IA³ at the
moment because the test parameters do not contain IA³. For this, #972
needs to be merged too, which adds IA³ to the test parameters.

* Small adjustments to tests

Previously, tests had some exploding gradients, making them unstable.
@pacman100
Copy link
Contributor

pacman100 commented Oct 3, 2023

Hello, insightful discussion above. I want to add a few points on why the current API is correct.

  1. Let's take Llama as an example:
    target_modules = ["k_proj", "v_proj", "down_proj"] and feedforward_modules=["down_proj"].
    Here, note that attention submodules k_proj,v_proj don't have biases and as such it is unaffected by the bias when merged. Also, note that down_proj is the layer post the non-linearity applied to the output of the up_proj. Here, as the adapter weights are multiplied to the input as it is part of the feedforward sub-block, again not impacted by the biases.

  2. When working with custom models, if someone applies adapters to layers with biases and does not have them in feedforward_modules, it encounters the bias bug. However, such layers are from attention submodule which don't have biases and work as expected in practice.

  3. To put it simply, attention sub-module layers needn't be part of feedforward_modules. Only the second layer of FFN submodule needs to be part of feedforward_modules.

I think the target and feedforward modules for the recently added Falcon model (in the other.py file) are incorrect as it is only targeting the attention submodule and missing the feedforward submodule which needs correction.

- learning rate for IA³ EmbConv1D needs to be increased for changes in
  the output to become detectable
- transpose now works correctly with nn.Parameter

Note: IA³ + Conv1D tests are still commented out because they fail for
different reasons (shape mismatch).
@BenjaminBossan
Copy link
Member Author

Thanks @pacman100 for investigating. To summarize for myself: The way IA³ was used in the specific transformers architectures we explicitly added, it works correctly because there is no bias. The exception is Falcon, which needs an adjustment for the target layers (could you make a PR for that?).

When it comes to applying IA³ to other layers, there could be a bug when merging when the layer has a bias, which is fixed by this PR.

There is still the issue of Conv1D not working when merging due to a shape mismatch, but this can be addressed in a separate PR.

@BenjaminBossan
Copy link
Member Author

@pacman100 @younesbelkada I think the open questions have been discussed, so this PR should be ready for review.

Copy link
Contributor

@pacman100 pacman100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @BenjaminBossan for adding Conv2d layers support for IA3, adding all the tests related to it, and fixing the (un)merge when using IA3 for layers with bias 🚀. LGTM!

Just a mild concern about very high thresholds for tests when using IA3 for Conv2D layers but that can be investigated later on.

@BenjaminBossan
Copy link
Member Author

Just a mild concern about very high thresholds for tests when using IA3 for Conv2D layers but that can be investigated later on.

Yes, for sure. I did a manual check that the results are correct in the sense that they're highly correlated with the expected output (not just some random values). I suspect that the convolution operation, as it is applied repeatedly, accumulates errors, leading to greater total deviation.

@BenjaminBossan BenjaminBossan merged commit d17266d into huggingface:main Oct 9, 2023
@BenjaminBossan BenjaminBossan deleted the enh-ia3-conv2d-support branch October 9, 2023 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants