adapter-hub · lenglaender · Aug 2, 2024 · May 8, 2024 · Jun 30, 2024 · Jun 30, 2024
diff --git a/README.md b/README.md
@@ -36,9 +36,9 @@ A Unified Library for Parameter-Efficient and Modular Transfer Learning
 [![GitHub](https://img.shields.io/github/license/adapter-hub/adapters.svg?color=blue)](https://github.com/adapter-hub/adapters/blob/main/LICENSE)
 [![PyPI](https://img.shields.io/pypi/v/adapters)](https://pypi.org/project/adapters/)
 
-_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [various adapter methods](https://docs.adapterhub.ml/overview.html) into [state-of-the-art pre-trained language models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference.
+_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [10+ adapter methods](https://docs.adapterhub.ml/overview.html) into [20+ state-of-the-art Transformer models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference. _Adapters_ provides a unified interface for efficient fine-tuning and modular transfer learning, supporting a myriad of features like full-precision or quantized training (e.g. [Q-LoRA, Q-Bottleneck Adapters, or Q-PrefixTuning](https://github.com/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)), [adapter merging via task arithmetics](https://docs.adapterhub.ml/adapter_composition.html#merging-adapters) or the composition of multiple adapters via [composition blocks](https://docs.adapterhub.ml/adapter_composition.html), allowing advanced research in parameter-efficient transfer learning for NLP tasks.
+
 
-> **Note**: The _Adapters_ library has replaced the `adapter-transformers` package. All previously trained adapters are compatible with the new library. For transitioning, please read: https://docs.adapterhub.ml/transitioning.html.
 
 ## Installation
 
@@ -57,6 +57,9 @@ cd adapters
 pip install .
 ```
 
+> **Note**: The _Adapters_ library has replaced the [`adapter-transformers`](https://github.com/adapter-hub/adapter-transformers-legacy) package. All previously trained adapters are compatible with the new library. For transitioning, please read: https://docs.adapterhub.ml/transitioning.html.
+
+
 ## Quick Tour
 
 #### Load pre-trained adapters:
@@ -156,7 +159,7 @@ Currently, adapters integrates all architectures and methods listed below:
 | UniPELT | [Mao et al. (2022)](https://arxiv.org/pdf/2110.07577.pdf) | [Docs](https://docs.adapterhub.ml/method_combinations.html#unipelt) |
 | Prompt Tuning | [Lester et al. (2021)](https://aclanthology.org/2021.emnlp-main.243/) | [Docs](https://docs.adapterhub.ml/methods.html#prompt-tuning) |
 | QLoRA | [Dettmers et al. (2023)](https://arxiv.org/pdf/2305.14314.pdf) | [Notebook](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb) |
-| ReFT | [Wu et al. (2024)](https://arxiv.org/pdf/2404.03592) | [Docs](https://docs.adapterhub.ml/methods.html#reft) |
+| ReFT | [Wu et al. (2024)](https://arxiv.org/pdf/2404.03592) | [Docs](https://docs.adapterhub.ml/methods.html#reft) |                                                                                                                                                  |
 
 ## Supported Models
 

diff --git a/docs/adapter_composition.md b/docs/adapter_composition.md
@@ -40,17 +40,17 @@ The basic building blocks of the more advanced setups are objects derived from `
 each representing a different possibility to combine single adapters.
 The following table gives an overview on the supported composition blocks and their support by different adapter methods.
 
-| Block | Bottleneck<br> Adapters | Prefix<br> Tuning | Compacter | LoRA | (IA)³ | Prompt Tuning |
-| --- | --- | --- | --- | --- | --- | --- |
-| [`Stack`](#stack) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) |  |
-| [`Fuse`](#fuse) | ✅ |  | ✅ |  |  |  |
-| [`Split`](#split) | ✅ |  | ✅ |  |  |  |
-| [`BatchSplit`](#batchsplit) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) |  |
-| [`Parallel`](#parallel) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) |  |
-| [Output averaging](#output-averaging) | ✅ |  | ✅ | ✅(*) | ✅(*) |  |
-| [Parameter averaging](#parameter-averaging) | ✅ | ✅ | ✅ | ✅ | ✅ |  |
-
-(*) except for Deberta-v1, GPT-2.
+| Block                                       | Bottleneck<br> Adapters | Prefix<br> Tuning | Compacter | LoRA | (IA)³ | Prompt Tuning |
+| ------------------------------------------- | ----------------------- | ----------------- | --------- | ---- | ----- | ------------- |
+| [`Stack`](#stack)                           | ✅                       | ✅                 | ✅         | ✅(*) | ✅(*)  |               |
+| [`Fuse`](#fuse)                             | ✅                       |                   | ✅         |      |       |               |
+| [`Split`](#split)                           | ✅                       |                   | ✅         |      |       |               |
+| [`BatchSplit`](#batchsplit)                 | ✅                       | ✅                 | ✅         | ✅(*) | ✅(*)  |               |
+| [`Parallel`](#parallel)                     | ✅                       | ✅                 | ✅         | ✅(*) | ✅(*)  |               |
+| [Output averaging](#output-averaging)       | ✅                       |                   | ✅         | ✅(*) | ✅(*)  |               |
+| [Parameter averaging](#parameter-averaging) | ✅                       | ✅                 | ✅         | ✅    | ✅     | ✅             |
+
+(*) except for Deberta and GPT-2.
 
 Next, we present all composition blocks in more detail.
 
@@ -240,7 +240,7 @@ print("MRPC adapter output:", bool(torch.argmax(output2[0]).item()))
 
 Following approaches of ensembling full models at inference time for better generalization, recent work on adapters has explored methods of averaging pre-trained adapters.
 This includes averaging output representations of adapters ([Wang et al., 2021](https://arxiv.org/pdf/2109.04877.pdf)) as well as averaging adapter parameters ([Wang et al., 2022](https://arxiv.org/pdf/2205.12410.pdf), [Chronopoulou et al., 2023](https://aclanthology.org/2023.findings-eacl.153.pdf)).
-`adapters` provides built-in support for both types of inference time averaging methods.
+_Adapters_ provides built-in support for both types of inference time averaging methods.
 
 ### Output averaging
 
@@ -260,25 +260,74 @@ model.add_adapter("o")
 model.active_adapters = ac.Average("m", "n", "o", weights=[0.1, 0.6, 0.3])
 ```
 
-### Parameter averaging
-
-Parameter averaging enables creating a new adapter via weighted averaging of the parameters of multiple pre-trained adapters.
-As this process is typically not done dynamically at runtime, `adapters` provides `average_adapter()` as a dedicated method for parameter averaging.
-In the example below, the parameters of the adapters `m`, `n` and `o` are averaged (with weights `0.1` `0.6` and `0.3`, respectively) to create a new adapter `avg`.
-Note that for this to succeed, all averaged adapters must use the same adapter configuration.
+### Merging Adapters
+We can create new adapters by combining the parameters of multiple trained adapters, i.e. merging multiple existing adapters into a new one. The `average_adapter()` method provides this functionality:
 
 ```python
-model.add_adapter("m")
-model.add_adapter("n")
-model.add_adapter("o")
+model.add_adapter("bottleneck_1", "seq_bn")
+model.add_adapter("bottleneck_2", "seq_bn")
+model.add_adapter("bottleneck_3", "seq_bn")
+
+model.average_adapter(adapter_name="avg", adapter_list=["bottleneck_1", "bottleneck_2", "bottleneck_3"], weights=[-1, 1.2, 0.8])
+```
+In this example, the parameters of the three added bottleneck adapters are merged (with weights `-1`, `1.2` and `0.8`, respectively) to create a new adapter `avg`.
+Note that for this to succeed, all averaged adapters must use the same adapter configuration. Compared to output averaging, parameter averaging of adapters has the advantage of not inducing any additional inference time relative to using a single adapter.
+
+All [adapter methods](https://docs.adapterhub.ml/overview.html#table-of-adapter-methods) support linear merging. In linear merging, the weights of the trained adapters are linearly combined: Let us have *n* adapters and let $\Phi_i$ be all the parameters of adapter *i*, and $\lambda_i$ be the corresponding weight. The merged adapter parameters $\Phi_{merged}$ are calculated as:
+
+$$
+\Phi_{merged} = \sum_{i=0}^{N} \lambda_i \Phi_i
+$$
+
+The `average_adapter` method only merges the weights of the adapters but does not create a new head. To average the weights of heads, use the `average_head` method.
+
+#### Merging LoRA Adapters
+LoRA introduces $A$ and $B$ matrixes with $\Delta W = BA$. Since the B and A matrices are strongly dependent on each other, there are several ways to merge the weights of LoRA adapters. You can choose the combination method by passing the `combine_strategy` parameter to the `average_adapter` method:
 
-model.average_adapter("avg", ["m", "n", "o"], weights=[0.1, 0.6, 0.3])
+1. `combine_strategy = "linear"`: Linear Combination (default). This has been proposed for LoRA by [Chronopoulou et al. (2023)](https://arxiv.org/abs/2311.09344). With $\Phi = \{A, B\}$:
+
+    $$
+    \Phi_{merged} = \sum_{i=0}^{N} \lambda_i \Phi_i
+    $$
+
+2. `combine_strategy = "lora_linear_only_negate_b"` Following [Zhang et al. (2023)](https://proceedings.neurips.cc/paper_files/paper/2023/hash/299a08ee712d4752c890938da99a77c6-Abstract-Conference.html), this method only uses negative weights for the B-matrix if the weight is negative:
+
+    $$
+    A_{merged} &= \sum_{i=0}^{N} |\lambda_i| A_i\\
+    B_{merged} &= \sum_{i=0}^{N} \lambda_i B_i
+    $$
+
+3. `combine_strategy = "lora_delta_w_svd"`: This method merges the $\Delta W_i$ of each adapter and then performs a singular value decomposition (SVD) to obtain the *A* and *B* LoRA matrices:
+    1. For every adapter *i* we calculate: $\Delta W_i = B_i \cdot A_i$
+    2. $\Delta W_{new} = \sum_{i=0}^N \lambda_i \cdot W_i$ 
+    3. Perform SVD on $\text{SVD}(\Delta W_{new})$ to obtain $A_{new}$ and $B_{new}$
+
+`lora_delta_w_svd` is not supported by Deberta and GPT-2. Example usage of these LoRA-specific merging strategies:
+
+```python
+model.add_adapter("lora_1", "seq_bn")
+model.add_adapter("lora_2", "seq_bn")
+model.add_adapter("lora_3", "seq_bn")
+
+model.average_adapter(
+    adapter_name="lora_avg",
+    adapter_list=["lora_1", "lora_2", "lora_3"],
+    weights=[1, -1, 1],
+    combine_strategy="lora_delta_w_svd",
+    svd_rank=8
+)
+# Note that "lora_delta_w_svd" requires the "svd_rank" parameter, which determines the r (rank) of the resulting LoRA adapter after SVD
 ```
 
-Compared to output averaging, parameter averaging of adapters has the advantage of not inducing any additional inference time relative to using a single adapter.
+For both output and parameter averaging, passed weights are normalized by default. To disable normalization, pass `normalize_weights=False`.
+For more detailed examples and explanations, refer to our [Task Arithmetic notebook](https://github.com/adapter-hub/adapters/tree/main/notebooks/task_arithmetics_in_adapter.ipynb).
+
+
+```{eval-rst}
+.. tip::
+    Adding more adapter merging methods is easy: You have to simply modify the ``average_adapter`` method. Most adapter-methods use the default implementation that only supports linear merging in `model_mixin.py <https://github.com/adapter-hub/adapters/blob/main/src/adapters/model_mixin.py>`_. Others like LoRA overwrite this method to add new merging methods like "lora_delta_w_svd", have a look at `lora.py <https://github.com/adapter-hub/adapters/blob/main/src/adapters/methods/lora.py>`_.
+```
 
-For both output and parameter averaging, passed weights are normalized by default.
-To disable normalization, pass `normalize_weights=False`.
 
 ## Nesting composition blocks
 
@@ -293,13 +342,13 @@ model.active_adapters = ac.Stack("a", ac.Split("b", "c", splits=60))
 
 However, combinations of adapter composition blocks cannot be arbitrarily deep. All currently supported possibilities are visualized in the table below.
 
-|Block|Supported Nesting|
-|---|---|
-| [`Stack`](#stack)|[str, Fuse, Split, Parallel, BatchSplit, Average]|
-| [`Fuse`](#fuse)|[str, Stack]|
-|[`Split`](#split)|[str, Split, Stack, BatchSplit, Average]|
-|[`Parallel`](#parallel)|[str, Stack, BatchSplit, Average]|
-|[`BatchSplit`](#batchsplit)|[str, Stack, Split, BatchSplit, Average]|
-|[`Average`](#output-averaging)|[str, Stack, Split, BatchSplit]|
+| Block                          | Supported Nesting                                 |
+| ------------------------------ | ------------------------------------------------- |
+| [`Stack`](#stack)              | [str, Fuse, Split, Parallel, BatchSplit, Average] |
+| [`Fuse`](#fuse)                | [str, Stack]                                      |
+| [`Split`](#split)              | [str, Split, Stack, BatchSplit, Average]          |
+| [`Parallel`](#parallel)        | [str, Stack, BatchSplit, Average]                 |
+| [`BatchSplit`](#batchsplit)    | [str, Stack, Split, BatchSplit, Average]          |
+| [`Average`](#output-averaging) | [str, Stack, Split, BatchSplit]                   |
 
 In the table, `str` represents an adapter, e.g. adapter "a" in the nesting example above. Depending on the individual model, some nested compositions might not be possible.