FEAT: VeRA quantization using bitsandbytes (huggingface#2070) (huggin…

…gface#2076) VeRA can now be used with 4bit and 8bit bnb quantization.
BenjaminBossan · Oct 22, 2024 · d10151e · d10151e
1 parent 5a560da
commit d10151e
Show file tree

Hide file tree

Showing 8 changed files with 840 additions and 12 deletions.
diff --git a/docs/source/developer_guides/quantization.md b/docs/source/developer_guides/quantization.md
@@ -187,9 +187,17 @@ peft_config = LoraConfig(...)
 quantized_model = get_peft_model(quantized_model, peft_config)
 ```
 
+## Other Supported PEFT Methods
+
+Besides LoRA, the following PEFT methods also support quantization:
+
+- **VeRA** (supports bitsandbytes quantization)
+- **AdaLoRA** (supports both bitsandbytes and GPTQ quantization)
+- **(IA)³** (supports bitsandbytes quantization)
+
 ## Next steps
 
 If you're interested in learning more about quantization, the following may be helpful:
 
-* Learn more about details about QLoRA and check out some benchmarks on its impact in the [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) blog post.
+* Learn more details about QLoRA and check out some benchmarks on its impact in the [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes) blog post.
 * Read more about different quantization schemes in the Transformers [Quantization](https://hf.co/docs/transformers/main/quantization) guide.
diff --git a/docs/source/package_reference/vera.md b/docs/source/package_reference/vera.md
@@ -22,12 +22,9 @@ When saving the adapter parameters, it's possible to eschew storing the low rank
 
 To handle different shapes of adapted layers, VeRA initializes shared A and B matrices with the largest required size for each dimension. During the forward pass, submatrices A and B for a given layer are sliced out from these shared matrices and used as described in the paper. For example, adapting two linear layers of shapes (100, 20) and (80, 50) will create A and B matrices of shapes (rank, 50) and (100, rank) respectively. Then, to adapt a layer of shape (100, 20), submatrices A and B of shapes (rank, 20) and (100, rank) will be extracted.
 
-VeRA currently has the following constraints:
+VeRA currently has the following constraint:
 
 - Only `nn.Linear` layers are supported.
-- Quantized layers are not supported.
-
-If these constraints don't work for your use case, use LoRA instead.
 
 The abstract from the paper is:
 

diff --git a/src/peft/helpers.py b/src/peft/helpers.py
@@ -168,7 +168,8 @@ def rescale_adapter_scale(model, multiplier):
 
     Args:
         model: The model containing `LoraLayer` modules whose scaling is to be adjusted.
-        multiplier (float or int): The multiplier that rescales the `scaling` attribute. Must be of type float or int.
+        multiplier (float or int):
+            The multiplier that rescales the `scaling` attribute. Must be of type float or int.
 
     Raises:
         ValueError: If the model does not contain any `LoraLayer`

diff --git a/src/peft/tuners/vera/__init__.py b/src/peft/tuners/vera/__init__.py
@@ -12,9 +12,25 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+from peft.import_utils import is_bnb_4bit_available, is_bnb_available
+
 from .config import VeraConfig
 from .layer import Linear, VeraLayer
 from .model import VeraModel
 
 
 __all__ = ["VeraConfig", "VeraLayer", "Linear", "VeraModel"]
+
+
+def __getattr__(name):
+    if (name == "Linear8bitLt") and is_bnb_available():
+        from .bnb import Linear8bitLt
+
+        return Linear8bitLt
+
+    if (name == "Linear4bit") and is_bnb_4bit_available():
+        from .bnb import Linear4bit
+
+        return Linear4bit
+
+    raise AttributeError(f"module {__name__} has no attribute {name}")