Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
113 commits
Select commit Hold shift + click to select a range
cff31a4
add alora dir
Jun 18, 2025
beb92e4
Add files via upload
kgreenewald Jun 18, 2025
0b445a4
initial alora-peft integration
kgreenewald Jun 19, 2025
3625096
fix init files
Jun 19, 2025
bde5021
bugfixes
Jun 23, 2025
c43a6e1
Update layer.py
kgreenewald Jun 24, 2025
063716d
Update __init__.py
kgreenewald Jun 24, 2025
8cde2c0
Update model.py
kgreenewald Jun 24, 2025
4ce95c7
Merge branch 'huggingface:main' into main
kgreenewald Jun 30, 2025
00f9cff
Add tokenized invocation_tokens to config
kgreenewald Jun 30, 2025
b94faa7
Get rid of tokenizer argument, now use invocation_tokens
kgreenewald Jun 30, 2025
26f5cf7
tokenizer code in warning
kgreenewald Jun 30, 2025
dcd78c5
Update test_custom_models.py
kgreenewald Jun 30, 2025
087781f
alora tests
kgreenewald Jul 1, 2025
89fd2b1
test debugging
Jul 1, 2025
b4b3465
refactor alora as lora variant
kgreenewald Jul 3, 2025
e9c0568
Merge pull request #1 from kgreenewald/codex/refactor-alora-method-to…
kgreenewald Jul 3, 2025
4de01c7
add alora to lora config
kgreenewald Jul 3, 2025
f7cb9d8
Update model.py
kgreenewald Jul 3, 2025
10d660f
Update model.py
kgreenewald Jul 3, 2025
e61a7b2
Update layer.py for alora
kgreenewald Jul 3, 2025
94ebcb3
Update layer.py for use_alora
kgreenewald Jul 3, 2025
99046cc
Update __init__.py
kgreenewald Jul 3, 2025
628a84d
Update peft_model.py
kgreenewald Jul 3, 2025
c2f83f1
Update config.py
kgreenewald Jul 3, 2025
9c73782
alora_offsets forward hook
kgreenewald Jul 3, 2025
516d563
Merge branch 'huggingface:main' into main
kgreenewald Jul 3, 2025
4a47414
Delete src/peft/tuners/alora directory
kgreenewald Jul 3, 2025
4608351
Check use_alora flag for aLoRA
kgreenewald Jul 3, 2025
3b2466e
Merge pull request #2 from kgreenewald/codex/update-tuners/peft_model…
kgreenewald Jul 3, 2025
faeadb1
Merge branch 'huggingface:main' into main
kgreenewald Jul 8, 2025
191a605
inference working
Jul 13, 2025
475ee8f
update tests
kgreenewald Jul 13, 2025
3781a81
tests passing
Jul 14, 2025
b9d1745
whitespace
Jul 14, 2025
22177f3
whitespace
Jul 14, 2025
498bdb1
format
Jul 14, 2025
6bb1d5b
make quality
Jul 14, 2025
9f26600
Update pyproject.toml
kgreenewald Jul 14, 2025
e613d02
streamline config
kgreenewald Jul 17, 2025
9a0f9d9
decoder tests
Jul 18, 2025
3d1284a
make quality
Jul 18, 2025
3734623
moving more alora_offsets to variants.py
kgreenewald Jul 18, 2025
fa59f51
fixes
Jul 18, 2025
814b895
Merge branch 'huggingface:main' into main
kgreenewald Jul 18, 2025
8e418c0
Update peft_model.py
kgreenewald Jul 30, 2025
dd6b670
Update config.py
kgreenewald Jul 30, 2025
183c6a6
Update variants.py
kgreenewald Jul 30, 2025
1df3c9c
Update variants.py
kgreenewald Jul 30, 2025
6c129c0
Update testing_common.py
kgreenewald Jul 30, 2025
61b44a1
Merge branch 'huggingface:main' into main
kgreenewald Jul 30, 2025
9de9c18
Update testing_common.py
kgreenewald Jul 30, 2025
6b7242c
Update config.py
kgreenewald Jul 30, 2025
21a4054
Update model.py
kgreenewald Jul 30, 2025
c9fb085
Update peft_model.py
kgreenewald Jul 30, 2025
3bde17d
Update lora.md
kgreenewald Jul 30, 2025
0e50475
variants tests and example
kgreenewald Aug 5, 2025
9d6e90f
Merge branch 'main' of https://github.com/kgreenewald/peft_alora
kgreenewald Aug 5, 2025
e77ab4a
Merge branch 'huggingface:main' into main
kgreenewald Aug 5, 2025
0a24c72
fixes
Aug 5, 2025
6fe25db
amend
Aug 5, 2025
06bf2a2
new changes
Aug 6, 2025
2d49f38
Update lora.md
kgreenewald Aug 6, 2025
180d9f5
Merge branch 'huggingface:main' into main
kgreenewald Aug 6, 2025
cff5b07
Update lora.md
kgreenewald Aug 6, 2025
7924039
Update docs/source/developer_guides/lora.md
kgreenewald Aug 15, 2025
ac82acd
Merge branch 'huggingface:main' into main
kgreenewald Aug 15, 2025
6f1e284
Update lora.md
kgreenewald Aug 15, 2025
21ceb56
Update variants.py
kgreenewald Aug 15, 2025
d1d31e7
Update src/peft/tuners/lora/variants.py
kgreenewald Aug 15, 2025
076411f
Update variants.py
kgreenewald Aug 15, 2025
3c267ce
Update tests/test_lora_variants.py
kgreenewald Aug 15, 2025
56455a8
Update test_custom_models.py
kgreenewald Aug 15, 2025
14752be
Update model.py
kgreenewald Aug 15, 2025
57313a5
Update testing_common.py
kgreenewald Aug 15, 2025
089d304
Update bnb.py
kgreenewald Aug 18, 2025
35c7aae
Update test_lora_variants.py
kgreenewald Aug 18, 2025
cb79411
Update test_lora_variants.py
kgreenewald Aug 18, 2025
99dc4fa
Update test_lora_variants.py
kgreenewald Aug 18, 2025
133183a
workaround for new tokens
kgreenewald Aug 18, 2025
0b7b164
Update test_lora_variants.py
kgreenewald Aug 19, 2025
7d05034
Update variants.py
kgreenewald Aug 19, 2025
1d16e13
Update lora.md
kgreenewald Aug 19, 2025
31fcfcc
tests and example
Aug 19, 2025
45be768
Update test_lora_variants.py
kgreenewald Aug 19, 2025
de4b886
offsets_change
Aug 20, 2025
0033694
Merge remote-tracking branch 'upstream/main'
kgreenewald Aug 20, 2025
26287e8
Merge branch 'main' of https://github.com/kgreenewald/peft_alora
kgreenewald Aug 20, 2025
b2d16e5
Merge branch 'huggingface:main' into main
kgreenewald Sep 1, 2025
5bba212
Update pyproject.toml
kgreenewald Sep 1, 2025
3bd6196
Update test_lora_variants.py
kgreenewald Sep 1, 2025
b541cff
Update test_lora_variants.py
kgreenewald Sep 1, 2025
783cf90
Update test_lora_variants.py
kgreenewald Sep 1, 2025
92e1305
Update test_custom_models.py
kgreenewald Sep 1, 2025
e536b1a
Update test_decoder_models.py
kgreenewald Sep 1, 2025
43a2fc2
Update variants.py
kgreenewald Sep 1, 2025
ea964fd
Update variants.py
kgreenewald Sep 1, 2025
7bf2943
Update test_gpu_examples.py
kgreenewald Sep 1, 2025
c1e6a39
latest requests
Sep 2, 2025
af76162
latest requests
Sep 2, 2025
1ae7155
Update variants.py
kgreenewald Sep 2, 2025
bd15f77
Update variants.py
kgreenewald Sep 2, 2025
4641d60
make test
Sep 2, 2025
c99cd90
Update variants.py
kgreenewald Sep 2, 2025
0128793
Update lora.md
kgreenewald Sep 2, 2025
4e79da0
make style
Sep 2, 2025
f2ab507
Update lora.md
kgreenewald Sep 3, 2025
b27a6dc
Update docs/source/developer_guides/lora.md
kgreenewald Sep 3, 2025
082d417
Update docs/source/developer_guides/lora.md
kgreenewald Sep 3, 2025
582b043
Update docs/source/developer_guides/lora.md
kgreenewald Sep 3, 2025
2d3fadf
Update docs/source/developer_guides/lora.md
kgreenewald Sep 3, 2025
9a20744
Update docs/source/developer_guides/lora.md
kgreenewald Sep 3, 2025
dbd56e7
Update lora.md
kgreenewald Sep 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions docs/source/developer_guides/lora.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,108 @@ from peft import LoraConfig

config = LoraConfig(use_rslora=True, ...)
```
### Activated LoRA (aLoRA)

Activated LoRA (aLoRA) is a low rank adapter architecture for Causal LMs that allows for reusing existing base model KV cache for more efficient inference. This approach is best suited for inference pipelines which rely on the base model for most tasks/generations, but use aLoRA adapter(s) to perform specialized task(s) within the chain. For example, checking or correcting generated outputs of the base model. In these settings, inference times can be sped up by an order of magnitude or more. For more information on aLoRA and many example use cases, see https://huggingface.co/papers/2504.12397.

This technique scans for the last occurence of an invocation sequence (`alora_invocation_tokens`) in each input (this can be as short as 1 token), and activates the adapter weights on tokens starting with the beginning of the invocation sequence (any inputs after the invocation sequence are also adapted, and all generated tokens will use the adapted weights). Weights on prior tokens are left un-adapted -- making the cache for those tokens interchangeable with base model cache due to the causal attention mask in Causal LMs. Usage is very similar to standard LoRA, with the key difference that this invocation sequence must be specified when the adapter is created:

```py
from peft import LoraConfig

config = LoraConfig(alora_invocation_tokens=alora_invocation_tokens, task_type="CAUSAL_LM", ...)
```

where `alora_invocation_tokens` is a list of integer token ids. Given a desired invocation string, this can be obtained as
```
invocation_string = "placeholder"
alora_invocation_tokens = tokenizer.encode(invocation_string, add_special_tokens=False).
```
where the tokenizer is the tokenizer for the base model. Note that we have `add_special_tokens=False` to avoid adding SOS/EOS tokens in our search string (which will most likely cause failure to find).

**Notes**
* aLoRA is only supported for `task_type=CAUSAL_LM` tasks due to its focus on cache reuse.
* Since the weights are adapted on fewer tokens, often (not always) aLoRA requires higher rank (`r`) than LoRA. `r=32` can be a good starting point.
* aLoRA weights cannot be merged into the base model by definition, since the adapter weights are selectively applied to a subset of tokens. Attempts to merge will throw errors.
* Beam search is not yet supported.
* It is generally not recommended to add new tokens to the tokenizer that are not present in the base model, as this can complicate the target use case of both the base model and adapter model operating on overlapping context. That said, there is a possible workaround by first efficiently adding [trainable tokens](https://huggingface.co/docs/peft/en/package_reference/trainable_tokens) to the base model prior to training the adapter.

#### Choice of invocation sequence and SFT design

Each input must have the `alora_invocation_tokens` sequence present, it is not added automatically. To maximize model performance without compromising cache reuse, it is recommended to have the adapter weights activated early, i.e. at the start of any adapter-specific prompting, but after any long inputs such as prior generations or documents. As with any model,
formatting should be consistent between train and test.

Consider the following example, where the base model has a chat template,
and the goal is to train the adapter to generate a desired output.

* Option 1: If there is no task-specific prompt, i.e. the input is a chat history with the `assistant` prompt, then the chat template's `assistant` prompt (e.g. `<|start_of_role|>assistant<|end_of_role|>`) is a natural choice for the invocation string. See the model's chat template to find the prompt for the model.
* Option 2: If there is a task-specific prompt for the adapter that describes the task the adapter is learning, and that prompt is put as a `user` turn immediately prior to the generation, then the chat template's `user` prompt (e.g. `<|start_of_role|>user<|end_of_role|>`) is a natural choice for the invocation string.

Once deciding on an invocation string, get the model tokenizer and obtain `alora_invocation_tokens` as
```
alora_invocation_tokens = tokenizer.encode(invocation_string, add_special_tokens=False).
```

An example inference setup is at [alora finetuning](https://github.com/huggingface/peft/blob/main/examples/alora_finetuning/alora_finetuning.py).

**Note** If using custom strings for the invocation string, make sure that the start and end of the string are special tokens to avoid issues with tokenization at the boundaries.

To see why, imagine that 'a', 'b', 'c', and 'ab' are tokens in your tokenizer (numbers 1, 2, 3, 4 respectively). Suppose that your alora_invocation_tokens = [2, 3]. Now imagine your input string is "abc". Because "ab" is a token, this will get tokenized as [4,3]. So the alora_invocation_tokens will fail to be found, despite the string "bc" being in it. If the start and end of the invocation string are special tokens, however, this failure case will never happen since special tokens are never tokenized into the same token with other characters.

#### Using (and reusing) cache for generation
The main purpose of Activated LoRA is to make KV cache interchangeable between the base model and aLoRA adapter models **prior to the invocation sequence** since base and adapted KV values are not compatible. Specifically, keys and values stored during one model generation can be used in subsequent generations to avoid expensive prefill operations for context tokens. When sharing cache between the base model and aLoRA adapters, there are 2 main patterns:
1. The base model has generated something, and an aLoRA adapter is then called to do a followup generation. Example: the base model answers a question, and an aLoRA trained to detect hallucinations checks the base model response.
2. An aLoRA adapter has generated something, and the base model or a different aLoRA adapter is called to do a followup generation where there is partial context overlap with the original aLoRA. Example: The user provides a query, and an aLoRA rewrites the query to be more self-contained and improve retrieval in a RAG system. Then, documents are retrieved and loaded into context, an aLoRA checks if these documents are indeed relevant to the question, and then the base model generates an answer.


To demonstrate the above behaviors when using caching, we're using [DynamicCache](https://huggingface.co/docs/transformers/en/kv_cache) from `transformers`. Care must be taken to ensure that adapted cache values are not mixed with base cache values. In particular, an extra step is required for sharing the cache when there is partial context overlap (pattern 2).

**Pattern 1: Base model followed by aLoRA** Here, the entire input and generation from the base model is input into the aLoRA adapter, along with the invocation sequence:
```
from transformers import DynamicCache
...
cache = DynamicCache()
inputs_base = tokenizer(prompt_base, return_tensors="pt")
# Generate from base model and save cache
with model_alora.disable_adapter():
output = model_alora.generate(inputs_base["input_ids"].to(device),attention_mask=inputs_base["attention_mask"].to(device),past_key_values = cache,return_dict_in_generate=True)
output_text_base = tokenizer.decode(output.sequences[0])
cache = output.past_key_values

# Generate with aLoRA adapter from cache
prompt_alora = output_text + INVOCATION_STRING
inputs_alora = tokenizer(prompt_alora, return_tensors="pt").to(device)
output = model_alora.generate(**inputs_alora, past_key_values=cache)
output_text_alora = tokenizer.decode(output[0])

# Note: cache is now tainted with adapter values and cannot be used in base model from here on!
**Pattern 2: aLoRA generation followed by base model (or another aLoRA) with partial context overlap** Here, we prefill the shared context using the base model, and then generate.
```
from transformers import DynamicCache
import copy
...
cache = DynamicCache()
inputs_shared = tokenizer(prompt_shared, return_tensors="pt").to(device)

# Prefill from base model and save cache
with model_alora.disable_adapter():
with torch.no_grad():
model_alora(**inputs_shared, past_key_values=cache)
cache_copy = copy.deepcopy(cache)

# Generate from aLoRA using prefilled cache
prompt_alora = prompt_shared + INVOCATION_STRING
inputs_alora = tokenizer(prompt_alora, return_tensors="pt").to(device)
output = model_alora.generate(**inputs_alora, past_key_values=cache)
output_text_alora = tokenizer.decode(output[0])

# Generate from base model using saved cache not tainted by aLoRA KV values
prompt_base = prompt_shared
inputs_base = tokenizer(prompt_base, return_tensors="pt").to(device)
with model_alora.disable_adapter():
output = model_alora.generate(**inputs_base, past_key_values=cache_copy)
output_text_base = tokenizer.decode(output[0])
```

### Weight-Decomposed Low-Rank Adaptation (DoRA)

Expand Down
76 changes: 76 additions & 0 deletions examples/alora_finetuning/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
# Activated LoRA (aLoRA)

## Introduction
Activated LoRA (aLoRA) is an adapter that selectively activates its weights only after a given invocation sequence, ensuring that hidden states match the base model prior to this point. This allows reusing the base model KVs (stored in the KV cache) for tokens before the invocation,
enabling much faster real-world inference (e.g. vLLM) when switching between generation with the base model and generation with adapters.
See the [paper](https://huggingface.co/papers/2504.12397) for more details.

## Quick start (shown for Mistral 7B)
```python
import torch
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, DataCollatorForLanguageModeling
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3", device_map="cuda")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
dataset = load_dataset("Lots-of-LoRAs/task1660_super_glue_question_generation", split="train")

invocation_string = "[/INST]" # End of user turn in Mistral chat template
invocation_tokens = tokenizer.encode(invocation_string, add_special_tokens=False)

lora_config = LoraConfig(
task_type="CAUSAL_LM",
alora_invocation_tokens=invocation_tokens,
r=32,
target_modules=["q_proj", "k_proj", "v_proj"],
)

peft_model = get_peft_model(model, lora_config)
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
trainer = Trainer(
model=peft_model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
peft_model.save_pretrained("alora-mistral-7b")
```

### Use the training example script directly
Pass the invocation string with `--invocation_string` when running the training example
script. For Mistral 7B, do:
```bash
python examples/alora_finetuning/alora_finetuning.py --base_model mistralai/Mistral-7B-Instruct-v0.3 --data_path Lots-of-LoRAs/task1660_super_glue_question_generation --invocation_string "[/INST]"
```
and similarly for Llama-3.2-3B-Instruct:
```bash
python examples/alora_finetuning/alora_finetuning.py --base_model meta-llama/Llama-3.2-3B-Instruct --data_path Lots-of-LoRAs/task1660_super_glue_question_generation --invocation_string "<|start_header_id|>assistant<|end_header_id|>"
```

### Full example of the script
```bash
python alora_finetuning.py \
--base_model "PATH_TO_MODEL" \
--data_path "PATH_TO_DATASET" \
--output_dir "PATH_TO_OUTPUT_DIR" \
--batch_size 1 \
--num_epochs 3 \
--learning_rate 3e-4 \
--cutoff_len 512 \
--val_set_size 500 \
--invocation_string "[/INST]" \
--quantize \
--eval_step 10 \
--save_step 100 \
--device "cuda:0" \
--lora_r 32 \
--lora_alpha 32 \
--lora_dropout 0.05 \
--lora_target_modules "q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj" \
--hub_model_id "YOUR_HF_REPO" \
--push_to_hub
```
Loading
Loading