Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Task Arithmetics #698

Merged
merged 20 commits into from
Aug 2, 2024
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,9 @@ A Unified Library for Parameter-Efficient and Modular Transfer Learning
[![GitHub](https://img.shields.io/github/license/adapter-hub/adapters.svg?color=blue)](https://github.com/adapter-hub/adapters/blob/main/LICENSE)
[![PyPI](https://img.shields.io/pypi/v/adapters)](https://pypi.org/project/adapters/)

_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [various adapter methods](https://docs.adapterhub.ml/overview.html) into [state-of-the-art pre-trained language models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference.
_Adapters_ is an add-on library to [HuggingFace's Transformers](https://github.com/huggingface/transformers), integrating [10+ adapter methods](https://docs.adapterhub.ml/overview.html) into [20+ state-of-the-art Transformer models](https://docs.adapterhub.ml/model_overview.html) with minimal coding overhead for training and inference. _Adapters_ provides a unified interface for efficient fine-tuning and modular transfer learning, supporting a myriad of features like full-precision or quantized training (e.g. [Q-LoRA, Q-Bottleneck Adapters, or Q-PrefixTuning](https://github.com/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb)), [adapter merging via task arithmetics](https://docs.adapterhub.ml/adapter_composition.html#merging-adapters) or the composition of multiple adapters via [composition blocks](https://docs.adapterhub.ml/adapter_composition.html), allowing advanced research in parameter-efficient transfer learning for NLP tasks.
lenglaender marked this conversation as resolved.
Show resolved Hide resolved


> **Note**: The _Adapters_ library has replaced the `adapter-transformers` package. All previously trained adapters are compatible with the new library. For transitioning, please read: https://docs.adapterhub.ml/transitioning.html.

## Installation

Expand All @@ -57,6 +57,9 @@ cd adapters
pip install .
```

> **Note**: The _Adapters_ library has replaced the [`adapter-transformers`](https://github.com/adapter-hub/adapter-transformers-legacy) package. All previously trained adapters are compatible with the new library. For transitioning, please read: https://docs.adapterhub.ml/transitioning.html.
lenglaender marked this conversation as resolved.
Show resolved Hide resolved


## Quick Tour

#### Load pre-trained adapters:
Expand Down Expand Up @@ -156,7 +159,7 @@ Currently, adapters integrates all architectures and methods listed below:
| UniPELT | [Mao et al. (2022)](https://arxiv.org/pdf/2110.07577.pdf) | [Docs](https://docs.adapterhub.ml/method_combinations.html#unipelt) |
| Prompt Tuning | [Lester et al. (2021)](https://aclanthology.org/2021.emnlp-main.243/) | [Docs](https://docs.adapterhub.ml/methods.html#prompt-tuning) |
| QLoRA | [Dettmers et al. (2023)](https://arxiv.org/pdf/2305.14314.pdf) | [Notebook](https://colab.research.google.com/github/Adapter-Hub/adapters/blob/main/notebooks/QLoRA_Llama_Finetuning.ipynb) |
| ReFT | [Wu et al. (2024)](https://arxiv.org/pdf/2404.03592) | [Docs](https://docs.adapterhub.ml/methods.html#reft) |
| ReFT | [Wu et al. (2024)](https://arxiv.org/pdf/2404.03592) | [Docs](https://docs.adapterhub.ml/methods.html#reft) | |
lenglaender marked this conversation as resolved.
Show resolved Hide resolved

## Supported Models

Expand Down
115 changes: 82 additions & 33 deletions docs/adapter_composition.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,17 +40,17 @@ The basic building blocks of the more advanced setups are objects derived from `
each representing a different possibility to combine single adapters.
The following table gives an overview on the supported composition blocks and their support by different adapter methods.

| Block | Bottleneck<br> Adapters | Prefix<br> Tuning | Compacter | LoRA | (IA)³ | Prompt Tuning |
| --- | --- | --- | --- | --- | --- | --- |
| [`Stack`](#stack) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) | |
| [`Fuse`](#fuse) | ✅ | | ✅ | | | |
| [`Split`](#split) | ✅ | | ✅ | | | |
| [`BatchSplit`](#batchsplit) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) | |
| [`Parallel`](#parallel) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) | |
| [Output averaging](#output-averaging) | ✅ | | ✅ | ✅(*) | ✅(*) | |
| [Parameter averaging](#parameter-averaging) | ✅ | ✅ | ✅ | ✅ | ✅ | |

(*) except for Deberta-v1, GPT-2.
| Block | Bottleneck<br> Adapters | Prefix<br> Tuning | Compacter | LoRA | (IA)³ | Prompt Tuning |
| ------------------------------------------- | ----------------------- | ----------------- | --------- | ---- | ----- | ------------- |
| [`Stack`](#stack) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) | |
| [`Fuse`](#fuse) | ✅ | | ✅ | | | |
| [`Split`](#split) | ✅ | | ✅ | | | |
| [`BatchSplit`](#batchsplit) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) | |
| [`Parallel`](#parallel) | ✅ | ✅ | ✅ | ✅(*) | ✅(*) | |
| [Output averaging](#output-averaging) | ✅ | | ✅ | ✅(*) | ✅(*) | |
| [Parameter averaging](#parameter-averaging) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |

(*) except for Deberta and GPT-2.

Next, we present all composition blocks in more detail.

Expand Down Expand Up @@ -240,7 +240,7 @@ print("MRPC adapter output:", bool(torch.argmax(output2[0]).item()))

Following approaches of ensembling full models at inference time for better generalization, recent work on adapters has explored methods of averaging pre-trained adapters.
This includes averaging output representations of adapters ([Wang et al., 2021](https://arxiv.org/pdf/2109.04877.pdf)) as well as averaging adapter parameters ([Wang et al., 2022](https://arxiv.org/pdf/2205.12410.pdf), [Chronopoulou et al., 2023](https://aclanthology.org/2023.findings-eacl.153.pdf)).
`adapters` provides built-in support for both types of inference time averaging methods.
_Adapters_ provides built-in support for both types of inference time averaging methods.

### Output averaging

Expand All @@ -260,25 +260,74 @@ model.add_adapter("o")
model.active_adapters = ac.Average("m", "n", "o", weights=[0.1, 0.6, 0.3])
```

### Parameter averaging

Parameter averaging enables creating a new adapter via weighted averaging of the parameters of multiple pre-trained adapters.
As this process is typically not done dynamically at runtime, `adapters` provides `average_adapter()` as a dedicated method for parameter averaging.
In the example below, the parameters of the adapters `m`, `n` and `o` are averaged (with weights `0.1` `0.6` and `0.3`, respectively) to create a new adapter `avg`.
Note that for this to succeed, all averaged adapters must use the same adapter configuration.
### Merging Adapters
lenglaender marked this conversation as resolved.
Show resolved Hide resolved
We can create new adapters by combining the parameters of multiple trained adapters, i.e. merging multiple existing adapters into a new one. The `average_adapter()` method provides this functionality:
lenglaender marked this conversation as resolved.
Show resolved Hide resolved

```python
model.add_adapter("m")
model.add_adapter("n")
model.add_adapter("o")
model.add_adapter("bottleneck_1", "seq_bn")
model.add_adapter("bottleneck_2", "seq_bn")
model.add_adapter("bottleneck_3", "seq_bn")

model.average_adapter(adapter_name="avg", adapter_list=["bottleneck_1", "bottleneck_2", "bottleneck_3"], weights=[-1, 1.2, 0.8])
```
In this example, the parameters of the three added bottleneck adapters are merged (with weights `-1`, `1.2` and `0.8`, respectively) to create a new adapter `avg`.
Note that for this to succeed, all averaged adapters must use the same adapter configuration. Compared to output averaging, parameter averaging of adapters has the advantage of not inducing any additional inference time relative to using a single adapter.

All [adapter methods](https://docs.adapterhub.ml/overview.html#table-of-adapter-methods) support linear merging. In linear merging, the weights of the trained adapters are linearly combined: Let us have *n* adapters and let $\Phi_i$ be all the parameters of adapter *i*, and $\lambda_i$ be the corresponding weight. The merged adapter parameters $\Phi_{merged}$ are calculated as:

$$
\Phi_{merged} = \sum_{i=0}^{N} \lambda_i \Phi_i
$$

The `average_adapter` method only merges the weights of the adapters but does not create a new head. To average the weights of heads, use the `average_head` method.

#### Merging LoRA Adapters
LoRA introduces $A$ and $B$ matrixes with $\Delta W = BA$. Since the B and A matrices are strongly dependent on each other, there are several ways to merge the weights of LoRA adapters. You can choose the combination method by passing the `combine_strategy` parameter to the `average_adapter` method:

model.average_adapter("avg", ["m", "n", "o"], weights=[0.1, 0.6, 0.3])
1. `combine_strategy = "linear"`: Linear Combination (default). This has been proposed for LoRA by [Chronopoulou et al. (2023)](https://arxiv.org/abs/2311.09344). With $\Phi = \{A, B\}$:

$$
\Phi_{merged} = \sum_{i=0}^{N} \lambda_i \Phi_i
$$

2. `combine_strategy = "lora_linear_only_negate_b"` Following [Zhang et al. (2023)](https://proceedings.neurips.cc/paper_files/paper/2023/hash/299a08ee712d4752c890938da99a77c6-Abstract-Conference.html), this method only uses negative weights for the B-matrix if the weight is negative:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only in the name is redundant. I would remove it to make it shorter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it is best to keep it. Because if we simply call it lora_linear_negate_b, it sounds like the B matrix is always negated. But this method means that when the weights are negative, then we only negate the B matrix and not the A matrix.


$$
A_{merged} &= \sum_{i=0}^{N} |\lambda_i| A_i\\
B_{merged} &= \sum_{i=0}^{N} \lambda_i B_i
$$

3. `combine_strategy = "lora_delta_w_svd"`: This method merges the $\Delta W_i$ of each adapter and then performs a singular value decomposition (SVD) to obtain the *A* and *B* LoRA matrices:
1. For every adapter *i* we calculate: $\Delta W_i = B_i \cdot A_i$
2. $\Delta W_{new} = \sum_{i=0}^N \lambda_i \cdot W_i$
3. Perform SVD on $\text{SVD}(\Delta W_{new})$ to obtain $A_{new}$ and $B_{new}$

`lora_delta_w_svd` is not supported by Deberta and GPT-2. Example usage of these LoRA-specific merging strategies:

```python
model.add_adapter("lora_1", "seq_bn")
model.add_adapter("lora_2", "seq_bn")
model.add_adapter("lora_3", "seq_bn")

model.average_adapter(
adapter_name="lora_avg",
adapter_list=["lora_1", "lora_2", "lora_3"],
weights=[1, -1, 1],
combine_strategy="lora_delta_w_svd",
svd_rank=8
)
# Note that "lora_delta_w_svd" requires the "svd_rank" parameter, which determines the r (rank) of the resulting LoRA adapter after SVD
```

Compared to output averaging, parameter averaging of adapters has the advantage of not inducing any additional inference time relative to using a single adapter.
For both output and parameter averaging, passed weights are normalized by default. To disable normalization, pass `normalize_weights=False`.
For more detailed examples and explanations, refer to our [Task Arithmetic notebook](https://github.com/adapter-hub/adapters/tree/main/notebooks/task_arithmetics_in_adapter.ipynb).


```{eval-rst}
.. tip::
Adding more adapter merging methods is easy: You have to simply modify the ``average_adapter`` method. Most adapter-methods use the default implementation that only supports linear merging in `model_mixin.py <https://github.com/adapter-hub/adapters/blob/main/src/adapters/model_mixin.py>`_. Others like LoRA overwrite this method to add new merging methods like "lora_delta_w_svd", have a look at `lora.py <https://github.com/adapter-hub/adapters/blob/main/src/adapters/methods/lora.py>`_.
```

For both output and parameter averaging, passed weights are normalized by default.
To disable normalization, pass `normalize_weights=False`.

## Nesting composition blocks

Expand All @@ -293,13 +342,13 @@ model.active_adapters = ac.Stack("a", ac.Split("b", "c", splits=60))

However, combinations of adapter composition blocks cannot be arbitrarily deep. All currently supported possibilities are visualized in the table below.

|Block|Supported Nesting|
|---|---|
| [`Stack`](#stack)|[str, Fuse, Split, Parallel, BatchSplit, Average]|
| [`Fuse`](#fuse)|[str, Stack]|
|[`Split`](#split)|[str, Split, Stack, BatchSplit, Average]|
|[`Parallel`](#parallel)|[str, Stack, BatchSplit, Average]|
|[`BatchSplit`](#batchsplit)|[str, Stack, Split, BatchSplit, Average]|
|[`Average`](#output-averaging)|[str, Stack, Split, BatchSplit]|
| Block | Supported Nesting |
| ------------------------------ | ------------------------------------------------- |
| [`Stack`](#stack) | [str, Fuse, Split, Parallel, BatchSplit, Average] |
| [`Fuse`](#fuse) | [str, Stack] |
| [`Split`](#split) | [str, Split, Stack, BatchSplit, Average] |
| [`Parallel`](#parallel) | [str, Stack, BatchSplit, Average] |
| [`BatchSplit`](#batchsplit) | [str, Stack, Split, BatchSplit, Average] |
| [`Average`](#output-averaging) | [str, Stack, Split, BatchSplit] |

In the table, `str` represents an adapter, e.g. adapter "a" in the nesting example above. Depending on the individual model, some nested compositions might not be possible.
Loading
Loading