Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Support quantization for VeRA using bitsandbytes (#2070) #2076

Merged

Conversation

ZiadHelal
Copy link
Contributor

@ZiadHelal ZiadHelal commented Sep 18, 2024

This PR introduces support for 4-bit and 8-bit quantization in the VeRA method, leveraging bitsandbytes.

Addresses #2070

Changes made:

  • Created bnb.py for both 8-bit & 4-bit linear layers
  • Updated model.py for quantization handling and refactored _find_dim
  • Modified init.py for new imports
  • Ensured VeRA-specific logic in quantized setting
  • Added new documentation

@ZiadHelal ZiadHelal marked this pull request as draft September 18, 2024 16:09
@ZiadHelal
Copy link
Contributor Author

Hey @BenjaminBossan could you take a look at this draft PR? I will remove the separate (quantization) config from config.py in vera folder later on, I just want to know if I'm going in the right direction or not.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on adding 8-bit quantization to VeRA. Super nice!

I haven't done an in-depth review, only a high level check. Here are some observations:

  1. The most important next step IMHO is to add one or a few tests, similar to what we have for LoRA + bnb (see tests/test_common_gpu.py). Let's validate quickly that the approach works.
  2. I see you copied the dispatch_bnb_8bit similar to what we have in LoRA. Note, however, that as of now, LoRA is the only PEFT method that uses that dispatch approach. This is because there are now so many LoRA "backends" that doing otherwise became unwieldy. For VeRA, we're not at that stage yet, so I would remove it. Here is how the code looked like before the dispatch refactor. I would also be fine with adding the dispatch refactor in this PR too, but then we'd need the full refactor, i.e. including dispatch_default.
  3. Do you plan on adding 4 bit as well?
  4. Regarding what you mentioned, yes, I agree:

I will remove the separate (quantization) config from config.py in vera folder later on

@ZiadHelal
Copy link
Contributor Author

Hey @BenjaminBossan,

I've finished the 8-bit quantization and now it works with all tests passed, however, with 4-bit it's a bit tricky due to bnb packing of weights implementation. I've added a work-around (which is not correct) in the forward method and it can now train any model but again I think it's not correct, also it fails for the test_vera_bnb_quantization_from_pretrained_safetensors test in 4-bit.

I would very much appreciate it if you could direct me in which way to proceed for the 4-bit implementation.

@ZiadHelal ZiadHelal marked this pull request as ready for review September 22, 2024 08:11
@BenjaminBossan
Copy link
Member

Thanks for the updates.

I've added a work-around (which is not correct) in the forward method and it can now train any model but again I think it's not correct

Could you give me a pointer what lines of code exactly you mean here?

also it fails for the test_vera_bnb_quantization_from_pretrained_safetensors test in 4-bit.

For me the test fails too, albeit already during initialization, not in the forward pass.

Also pinging @vvvm23 and @dkopi for awareness.

@ZiadHelal ZiadHelal changed the title (#2070) Draft PR for 8-bit quantization support for VeRA using bitsandbytes FEAT: Support quantization for VeRA using bitsandbytes (#2070) Sep 23, 2024
@ZiadHelal
Copy link
Contributor Author

Hey @BenjaminBossan, it now works on my side! thanks for your help.

Could you check if it works now with you?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@BenjaminBossan
Copy link
Member

Thanks @ZiadHelal I can confirm that the tests are now passing on my machine. I think what would be great is if we could take one of the VeRA examples and verify that it works with 4bit and 8bit bnb. Of course, results won't be exactly the same, but we should expect roughly similar outcomes, probably slightly worse. This would be a nice confirmation that the implementation is correct. Is that something you would be willing to tackle?

Apart from that, please run make style on your PR, so that our linter is happy and tests can be run.

@ZiadHelal
Copy link
Contributor Author

Hi @BenjaminBossan, sorry for my late reply!

I've run make style and the code should now be good for the linter. Regarding the tests, I've added several tests primarily for CausalLM but I would be happy to add further for the audio and seq2seq models or if you have any other tests in mind, please let me know.

@ZiadHelal
Copy link
Contributor Author

Are the 11 failing tests related to VeRA?

@BenjaminBossan
Copy link
Member

No, I don't think that they're related. This could possibly be caused by the latest transformers release, not sure. I'll investigate tomorrow.

@ZiadHelal
Copy link
Contributor Author

Okay, thanks!

@BenjaminBossan
Copy link
Member

Small update, it is indeed unrelated, the tests started breaking due to a recent change in transformers. I'm trying to get to the bottom of it.

@ZiadHelal
Copy link
Contributor Author

Ok got it, thanks for the update!

@BenjaminBossan
Copy link
Member

The fix is now merged. Once you merge with/rebase on main, the tests should pass.

@ZiadHelal
Copy link
Contributor Author

Synced with main upstream.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the latest updates. I somehow forgot about this PR, sorry for my silence, don't hesitate to ping me if you don't hear back for a few days.

Overall, this looks really good and is ready to be merged except for two small issues. One I have commented on. The other is that the docs need updates to reflect that VeRA now supports bnb quantization:

- Quantized layers are not supported.

Besides that, I think we should also extend quantization.md. I'm not 100% sure what the best way is to update the document, but maybe the easiest is to add a new section before the "Next steps" section called "Other supported PEFT methods". There, you could mention that besides, LoRA, bnb also works with VeRA.

@@ -0,0 +1,408 @@
# Copyright 2023-present the HuggingFace Inc. team.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update the date to 2024.

@BenjaminBossan
Copy link
Member

Ah, forgot to mention, I ran a test modified from the VeRA.ipynb notebook comparing no quantization vs 8bit vs 4bit. Overall, the results were quite similar, with a bit of degradation for 4bit, but that's expected, so I think the implementation works as expected.

@ZiadHelal
Copy link
Contributor Author

Sorry for late reply. I did the requests you mentioned and added also for the docs the other peft methods that support quantization which weren't added at the first place.

@ZiadHelal
Copy link
Contributor Author

Hey @BenjaminBossan, any update regarding the new documentation changes?

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping. I was out of office for a few days. This PR is almost good to go, you were just a bit overeager when removing the VeRA constraints in the docs.

@@ -22,13 +22,6 @@ When saving the adapter parameters, it's possible to eschew storing the low rank

To handle different shapes of adapted layers, VeRA initializes shared A and B matrices with the largest required size for each dimension. During the forward pass, submatrices A and B for a given layer are sliced out from these shared matrices and used as described in the paper. For example, adapting two linear layers of shapes (100, 20) and (80, 50) will create A and B matrices of shapes (rank, 50) and (100, rank) respectively. Then, to adapt a layer of shape (100, 20), submatrices A and B of shapes (rank, 20) and (100, rank) will be extracted.

VeRA currently has the following constraints:

- Only `nn.Linear` layers are supported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constraint about linear layers still holds true, so let's not delete this line.

@ZiadHelal
Copy link
Contributor Author

Done @BenjaminBossan

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding bitsandbytes quantization to VeRA, very nice addition!

@BenjaminBossan BenjaminBossan merged commit 859fd88 into huggingface:main Oct 7, 2024
14 checks passed
@ZiadHelal
Copy link
Contributor Author

Thanks for your help and effort during this PR!

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Oct 18, 2024
The shared buffers vera_A and vera_B could be on the wrong device when
using multiple GPUs, resulting in an error. This PR moves the them to
the correct device to fix the error.

Since these buffers are shared, I chose *not* to move the whole buffer
to the device. Instead, when we create the slices from those buffers
during forward, I move the devices only there. This could be inefficient
in terms of runtime, but IIUC, the alternative would be to create new
copies of these buffers per device, using more memory.

The failing tests were introduced in huggingface#2076 but the error was already
there beforehand.

I did not discover these failing tests earlier because we had a
concurrent error caused by a transformers issue which looked very
similar and I wrongly assumed that the VeRA error was caused by the same
issue. But now that the issue has been fixed, the error still persists,
prompting me to investigate.
BenjaminBossan pushed a commit to BenjaminBossan/peft that referenced this pull request Oct 22, 2024
…gface#2076)

VeRA can now be used with 4bit and 8bit bnb quantization.
BenjaminBossan added a commit that referenced this pull request Oct 25, 2024
The shared buffers vera_A and vera_B could be on the wrong device when
using multiple GPUs, resulting in an error. This PR moves the them to
the correct device to fix the error.

Since these buffers are shared, I chose *not* to move the whole buffer
to the device. Instead, when we create the slices from those buffers
during forward, I move the devices only there. This could be inefficient
in terms of runtime, but IIUC, the alternative would be to create new
copies of these buffers per device, using more memory.

The failing tests were introduced in #2076 but the error was already
there beforehand.

I did not discover these failing tests earlier because we had a
concurrent error caused by a transformers issue which looked very
similar and I wrongly assumed that the VeRA error was caused by the same
issue. But now that the issue has been fixed, the error still persists,
prompting me to investigate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants