-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Add] DoRA Embedding #2006
[Add] DoRA Embedding #2006
Conversation
Thanks @ariG23498, this looks good so far. Probably there could be some refactoring in Unfortunately, I don't think there is any reference that we can compare this to, as the DoRA paper does not mention application to embedding layers (unless I missed it). Probably a good next step would be to take one of the existing examples and check what happens when the new DoRA embedding layer is used. I would expect results to be very similar to LoRA embeddings, with (hopefully) a slight edge in scores for DoRA. |
We could also request for a review from the DoRA author (sorry I forgot their name) like we did when we addrd Conv DoRA support. |
@sayakpaul I agree. That would be great. I looked into the previous PRs and got hold of the author's GitHub username On the other hand, I have used the QDoRA finetuning example and added the embedding layer in the LoRA config. Here is the colab notebook. Let me know what you all think. |
Yes, we could ping nbasyl but let's wait until the PR is ready for review or until we have concrete questions to ask. Thanks for providing the example notebook. I left it running for a bit longer (also increased batch size to 2) and it appears that the presence of the DoRA embedding layer destabilizes the loss.
I'm not exactly sure why that is but I checked the adapter weights on the layers and found that the L2 norm of the embedding layer is larger than that of the other layers:
Maybe this means that the lr for embedding layers should be lowered? I haven't checked further, but lora+ could be an option here. Also, it could just be a fluke on this dataset. |
@BenjaminBossan could the L2 norm of the embedding weights be large due to the size of the embedding weights? I created a simple base model with torch and printed the L2 norm of the weights as follows: class BaseModel(torch.nn.Module):
def __init__(self, vocab_size, emb_dims, out_dims):
super().__init__()
self.emb = torch.nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dims)
self.proj = torch.nn.Linear(in_features=emb_dims, out_features=out_dims)
def forward(self, x):
x = self.emb(x)
x = self.proj(x)
return x
model = BaseModel(
vocab_size=100,
emb_dims=16,
out_dims=32,
)
for name, param in model.named_parameters():
print(name, torch.linalg.norm(param, ord=2).item()) emb.weight 13.42896842956543 # this is significantly larger than the projection layer
proj.weight 1.2026439905166626
proj.bias 0.7463299036026001 Do you want me to train the model using the lora+ optimizer? |
Thanks for checking. Note that I didn't check the norm of the embedding layer itself but of the associated LoRA weights. Still, the norm could of course be related to the overall shape of the layer. Maybe this check makes no sense, but I just wanted to get a hint on what could be going wrong. I did another run with LoRA instead of DoRA applied to the embedding and linear layers and there the loss look similarly bad as for DoRA on embedding. So it's most likely more of an issue with adapting the embedding layer and less with DoRA. Still, it would be nice if we could find an example of using DoRA embedding that indeed improves model performance before merging the PR. LoRA+ was just one idea. |
@BenjaminBossan I noticed that @pacman100 had a working E2E example on training an embedding layer with LoRA here. Unfortunately I do not have the GPU compute to run the example. Would you mind running it with the current DoRA implementation? |
Good point. I used this notebook instead (with some minor changes like using bfloat16), which is similar but a bit more up-to-date. Below are some results: As you can see, LoRA, DoRA, and DoRA & LoRA+ are all very similar. I also noticed that re-running the same notebook with a different seed yielded quite different results, so these differences are not significant. I think we can proceed with this PR, as we have no indication so far that DoRA works worse on embeddings than LoRA. |
@BenjaminBossan I have marked the PR for review. |
Would you mind adding some tests for DoRA embeddings? There are already some LoRA embedding test cases in |
src/peft/tuners/lora/layer.py
Outdated
if not self.use_dora[active_adapter]: | ||
after_A = self._embed(x, embedding_A) | ||
result = result + (after_A @ embedding_B) * scaling | ||
else: | ||
result = result + self.lora_magnitude_vector[active_adapter]( | ||
x, | ||
lora_A=embedding_A, | ||
lora_B=embedding_B, | ||
scaling=scaling, | ||
base_layer=self.get_base_layer(), | ||
embed_fn=self._embed, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the previous commits, the DoRA embedding layer did not forward propagate the inputs. With this code snippet, we are not forward propagating the inputs through the DoRAEMbedding Layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan for visibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this code snippet, we are not forward propagating the inputs through the DoRAEMbedding Layer.
I don't understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologise for not being able to communicate well.
I meant, the code snippet added in this commit (the one that is linked) now lets the inputs to propagate into the dora embedding layer. Previous to this, the dora embedding layer was being created, but not used. The inputs were being forward propagated into the Dora Linear Layer (as it was the parent class).
This also means that the results you have noted in this comment should be re-run in order to make use of the embedding adapters properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BenjaminBossan with the current implementation, I have a test case failing. I have isolated the test case and am running it individually like so:
Somehow the What are you thoughts on this? |
Thanks for working on the tests and good job isolating this flaky one. I think this is a case of the learning rate being too small, therefore the changes are too small to exceed the tolerance level of the check. The solution is just to increase the learning rate in this specific setting. I tried with 1.0 and the test passed 10 times. You can check other tests in this file and you'll see that the learning rate has to be adjusted a couple of times, it's just not possible to find one that fits all cases. |
Updated the learning rate. The tests pass on my system. Note: I have kept the lr considerably high |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, this looks pretty good already. I have some minor comments regarding the tests, please check.
tests/test_custom_models.py
Outdated
@@ -874,7 +887,8 @@ def test_only_params_are_updated(self, test_name, model_id, config_cls, config_k | |||
model_before = copy.deepcopy(model) | |||
|
|||
model.train() | |||
optimizer = torch.optim.SGD(model.parameters(), lr=0.5) | |||
lr = 100.0 if config_kwargs.get("use_dora") else 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, is 100 really necessary? For me, even 1.0 seems to work (tested 10x on CPU and GPU).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah!
I ran the 1.0
experiment 1000x on CPU and GPU, and it did not go well. I tried increasing the lr in the following manner 1 -> 2 -> 4 -> 10 -> 50 -> 100. 100 was the only lr that passed for 1000x tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's wild. Hopefully such a high lr is only needed for this dummy model and not an indicator for real world usage (which does not seem so based on the tests above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only required for dora + embedding, right? So let's adjust the check, same as below. And let's add a comment that this high learning rate was found through testing to be necessary to avoid flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
@nbasyl This PR adds DoRA to embedding layers. It's pretty similar to how DoRA is implemented for linear layers. If you have time, it would be great if you could give this a look. |
@BenjaminBossan @ariG23498 Thanks for tagging me and for the implementation; everything seems correct, just a few minor suggestions. From what I understand about adapting the embedding layer, there's no dropout on the input, so this part could potentially be optimized further to improve inference speed as we no longer need to tackle the dropout alignment issue as the adaptation of linear layer: to something like:
and the forward pass of embedding layer becomes:
not quite sure if the speed-up is significant or not, as we are only skipping one forward call of |
@nbasyl thank you for reviewing the PR. I had totally overlooked the optimization opportunity in the inference time. @BenjaminBossan I have made the suggested changes, and all the test pass on my end. |
Yes, good catch @nbasyl. I haven't tested it specifically for DoRA embeddings, but in the past, I did test what would happen if we had that optimization on linear layers- There, it would specifically help reduce memory usage, so it is indeed good to have. I was thinking about adding this optimization to linear layers when:
The disadvantage is of course that it makes the code more complicated and it would not help in most training cases, where it matters most. Btw. I think this part of the DoRA implementation is often wrong when looking at how other packages implement it, but it makes their DoRA layers more efficient than PEFT :-/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the code with the suggested optimization.
I just found a tiny issue in one of the test, otherwise this is good to be merged.
For DoRA, calculate the extra output from LoRA with DoRA applied. This should be added on top of the base layer | ||
output. | ||
""" | ||
lora_weight = (lora_A @ lora_B).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, too bad we can't use the same approach as in the other lora layers but instead have to multiply the parameters directly. This may cause trouble in some situations like with FSDP. But still, it's better than not having support for embeddings at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree!
I tried with a linear layer to mimic the embedding weights, but the code was getting a little complicated.
tests/test_custom_models.py
Outdated
@@ -874,7 +887,8 @@ def test_only_params_are_updated(self, test_name, model_id, config_cls, config_k | |||
model_before = copy.deepcopy(model) | |||
|
|||
model.train() | |||
optimizer = torch.optim.SGD(model.parameters(), lr=0.5) | |||
lr = 100.0 if config_kwargs.get("use_dora") else 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only required for dora + embedding, right? So let's adjust the check, same as below. And let's add a comment that this high learning rate was found through testing to be necessary to avoid flakiness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, LGTM.
Hi @BenjaminBossan , I've been experimenting with DoRA finetuning on tasks that weren't covered in the paper, like Function Calling and Role Play. Based on my results, it doesn't seem to matter much whether dropout is applied or not. Given our previous discussions, I was wondering if we could integrate this more efficient forward pass into the codebase when the dropout rate is set to zero. Let me know your thoughts and how I can assist! |
Also tagging @ariG23498 to see if you'd be interested in helping with this. I'm not entirely sure if making these changes to the forward pass will require significant modifications to other parts of the code. |
I would ask @BenjaminBossan to take the lead on this one! I would be more than happy to contribute if we see that this could potentially lead to a speedup as mentioned here |
@nbasyl Thanks for the additional information. Just to clarify, do you mean that dropout in general is not that helpful, or just specifically with regards to the embedding layer (since you posted in this PR)? I think we can take a look at implementing the more efficient path where we check if the model is in eval mode or if the dropout is 0 (i.e. using |
@BenjaminBossan, I'm talking about in general. You're right—we should reuse the computed output from the base model when dropout is set to 0 or when in evaluation mode. Feel free to let me know if there's any way I can help! |
Great. I created an issue to track this (#2107). |
This PR adds DoRA for the embedding layer.
TODOs:
@BenjaminBossan can you give me an initial review to set the tone of the PR?
Fixes: #1677