FIX Account for attention mask being a dict, fix generate issues with gemma #2579

BenjaminBossan · 2025-06-10T09:43:55Z

1. attention mask being a dict

After the transformers change in huggingface/transformers#37866, it can happen that:

Models using different types of attention in different layers (i.e. gemma3) will now have a dict returned by prepare_inputd_for_generation (one dict entry per attention type)

As PEFT operates on the attention mask for prompt learning methods, we need to adjust the code for the possibility of attention_mask being a dict. Right now, I simply extract the single value if the dict is just one element. For other sizes, I just raise an error, as I don't know how to deal with that. For our tests, this is enough but we might need to find a better solution in the future.

2. torch.compile errors during generation

#2458 fixed an issue with 4d attention masks and added gemma3 to the test suite, which uses 4d attention masks. However, the solution was insufficient, as it involves replacing the 4d attention mask with a 2d mask and handing it off to the model to create the correct 4d attention mask. The problem is that mask creation triggers an error with torch.compile and thus needs to be performed outside of the compile context, i.e. during prepare_inputs_for_generation. This PR now uses the same logic as transformers to do exactly that.

There are still issues with prefix tuning and incorrect shapes, which may be solvable, but require further work. Similarly, there is an issue with VBLoRA because this line is not torch.compile friendly:

peft/src/peft/tuners/vblora/layer.py

Line 191 in e67052b

if self.training and vblora_logits_A[0, 0].isinf().any():

The corresponding tests are skipped for now.

Finally, for these fixes to work, two more changes are needed on the transformers side:

Await a new transformers release (>4.52) so that we can use create_masks_for_generate.
For prompt learning, we remove the cache_position argument, I'm not quite sure if there is not a better solution. Anyway, because of this it needs to be recomputed but models like gemma recompute in a way that is not torch.compile-friendly. They should use a compile friendly method instead. When I locally patch transformers to do so, the tests pass. cache_position is no longer being removed from the model_kwargs, thus the aforementioned problem does not occur.

For these reasons, this PR stays in draft status for now and #2580 is used to make the CI green for the time being.

Resolves CI errors such as this one: https://github.com/huggingface/peft/actions/runs/15481482956/job/43588020111#step:5:53182 After resolving that error, other errors can occur, but they're unrelated and investigated independently. After the transformers change in huggingface/transformers#37866, it can happen that: > Models using different types of attention in different layers (i.e. gemma3) will now have a dict returned by prepare_inputd_for_generation (one dict entry per attention type) As PEFT operates on the attention mask for prompt learning methods, we need to adjust the code for the possibility of attention_mask being a dict. Right now, I simply extract the single value if the dict is just one element. For other sizes, I just raise an error, as I don't know how to deal with that. For our tests, this is enough but we might need to find a better solution in the future.

HuggingFaceDocBuilderDev · 2025-06-10T09:47:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Avoid regression, even though I'm not quite sure if the old behavior is technically correct.

githubnemo

Thanks for taking this on :)

LGTM but it was a bit hard to understand. Added comments on what would've helped to have in terms of explanations to understand the code faster.

githubnemo · 2025-06-27T09:34:10Z

src/peft/peft_model.py

-                    model_kwargs["attention_mask"] = torch.ones((bs, total_seq_len), dtype=attention_mask.dtype)
+                    attention_mask_2d = torch.ones((bs, total_seq_len), dtype=attention_mask.dtype)
+
+                    # heuristic to determine if we're in prefill stage


in the comment: please explain what the prefill stage is for and what it relates to (kv cache)

githubnemo · 2025-06-27T09:35:25Z

src/peft/peft_model.py

+                        # if in prefill stage, for prompt learning methods that are not prefix tuning, new tokens
+                        # (embeddings) are inserted


in the comment: please explain why prefix tuning is exempt here

currently it reads as if prefix tuning doesn't insert inputs at all which is confusing because it does, but, IIUC, into the kv cache of all layers.

githubnemo · 2025-06-27T09:40:40Z

src/peft/peft_model.py

+        # if cache_position exists and if we're in the prefill stage
+        if (
+            (model_kwargs.get("cache_position") is not None)
+            and (model_kwargs["cache_position"][0] == 0)


it might make sense to move the cache position not None and cache position == 0 part higher and make that into is_prefill with a proper comment to repeat repetition and add a bit of semantic context.

src/peft/utils/other.py

BenjaminBossan · 2025-06-27T10:47:07Z

Note: I checked locally and some Gemma3 tests are failing when run on GPU due to compile errors:

pytest tests/test_decoder_models.py::TestDecoderModels::test_generate[CPTConfig-config_kwargs3-hf-internal-testing/tiny-random-Gemma3ForCausalLM] tests/test_decoder_models.py::TestDecoderModels::test_generate[VBLoRAConfig-config_kwargs13-hf-internal-testing/tiny-random-Gemma3ForCausalLM] tests/test_decoder_models.py::TestDecoderModels::test_generate[PromptEncoderConfig-config_kwargs11-hf-internal-testing/tiny-random-Gemma3ForCausalLM] tests/test_decoder_models.py::TestDecoderModels::test_generate[PromptTuningConfig-config_kwargs12-hf-internal-testing/tiny-random-Gemma3ForCausalLM] tests/test_decoder_models.py::TestDecoderModels::test_generate_pos_args[CPTConfig-config_kwargs3-hf-internal-testing/tiny-random-Gemma3ForCausalLM] tests/test_decoder_models.py::TestDecoderModels::test_generate_pos_args[PromptTuningConfig-config_kwargs12-hf-internal-testing/tiny-random-Gemma3ForCausalLM] tests/test_decoder_models.py::TestDecoderModels::test_generate_pos_args[VBLoRAConfig-config_kwargs13-hf-internal-testing/tiny-random-Gemma3ForCausalLM] tests/test_decoder_models.py::TestDecoderModels::test_generate_pos_args[PromptEncoderConfig-config_kwargs11-hf-internal-testing/tiny-random-Gemma3ForCausalLM]

The error is:

E               torch._dynamo.exc.Unsupported: Data dependent operator
E                 Explanation: Operator `aten._local_scalar_dense.default` has a non-Tensor output whose value is dependent on the data of Tensor inputs.
E                 Hint: Enable tracing of data-dependent output operators with `torch._dynamo.config.capture_scalar_outputs = True`
E               
E                 Developer debug context: aten._local_scalar_dense.default
E               
E               
E               from user code:
E                  File "/home/name/work/forks/transformers/src/transformers/utils/generic.py", line 943, in wrapper
E                   output = func(self, *args, **kwargs)
E                 File "/home/name/work/forks/transformers/src/transformers/models/gemma3/modeling_gemma3.py", line 681, in forward
E                   outputs: BaseModelOutputWithPast = self.model(
E                 File "/home/name/work/forks/transformers/src/transformers/utils/generic.py", line 943, in wrapper
E                   output = func(self, *args, **kwargs)
E                 File "/home/name/work/forks/transformers/src/transformers/models/gemma3/modeling_gemma3.py", line 525, in forward
E                   cache_position = torch.arange(

The reason is that Gemma uses some code that is not torch.compile friendly to generate cache_position. If that code was switched to a compile friendly version, the tests should pass. Ping @Cyrilvallez

I could confirm that these failures are not caused by this PR, but rather that these failures were masked by the error that is fixed in this PR.

githubnemo

Perfect, it reads a lot clearer now. Thanks :)

- Bump versions - Update a comment to poin to new PR - Remove a test skip that is obsolete after huggingface#2579

- Bump versions - Update a comment to poin to new PR - Remove a test skip that is obsolete after #2579

Resolves CI errors such as this one: https://github.com/huggingface/peft/actions/runs/15481482956/job/43588020111#step:5:53182 After resolving that error, other errors can occur, but they're unrelated and investigated independently. After the transformers change in huggingface/transformers#37866, it can happen that: > Models using different types of attention in different layers (i.e. gemma3) will now have a dict returned by prepare_inputd_for_generation (one dict entry per attention type) As PEFT operates on the attention mask for prompt learning methods, we need to adjust the code for the possibility of attention_mask being a dict. Right now, I simply extract the single value if the dict is just one element. For other sizes, I just raise an error, as I don't know how to deal with that. For our tests, this is enough but we might need to find a better solution in the future.

- Bump versions - Update a comment to poin to new PR - Remove a test skip that is obsolete after huggingface#2579

BenjaminBossan marked this pull request as draft June 10, 2025 09:56

Fix multiple issues with prompt learning + gemma

5b6e5b9

BenjaminBossan changed the title ~~FIX Account for attention mask being a dict~~ FIX Account for attention mask being a dict, fix generate issues with gemma Jun 11, 2025

BenjaminBossan added 2 commits June 16, 2025 17:10

Fixes for prefix tuning

176f639

Fix for non-prefix prompt learning

fd3edea

Avoid regression, even though I'm not quite sure if the old behavior is technically correct.

BenjaminBossan marked this pull request as ready for review June 27, 2025 09:20

githubnemo reviewed Jun 27, 2025

View reviewed changes

BenjaminBossan added 2 commits June 27, 2025 12:30

Reviewer feedback

3db6330

Reviewer feedback 2

ed66380

BenjaminBossan requested a review from githubnemo June 27, 2025 10:47

githubnemo approved these changes Jun 27, 2025

View reviewed changes

githubnemo mentioned this pull request Jun 27, 2025

Support for Activated LoRA (Issue https://github.com/huggingface/peft/issues/2523) #2609

Merged

BenjaminBossan merged commit 171da8e into huggingface:main Jun 27, 2025
10 of 14 checks passed

BenjaminBossan deleted the fix-attention-mask-is-dict branch June 27, 2025 11:40

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Jul 3, 2025

Release 0.16.0

373bdc3

- Bump versions - Update a comment to poin to new PR - Remove a test skip that is obsolete after huggingface#2579

BenjaminBossan mentioned this pull request Jul 3, 2025

Release 0.16.0 #2629

Merged

BenjaminBossan added a commit that referenced this pull request Jul 3, 2025

Release 0.16.0 (#2629)

45996a1

- Bump versions - Update a comment to poin to new PR - Remove a test skip that is obsolete after #2579

efraimdahl pushed a commit to efraimdahl/peft that referenced this pull request Jul 12, 2025

Release 0.16.0 (huggingface#2629)

dc0f49f

- Bump versions - Update a comment to poin to new PR - Remove a test skip that is obsolete after huggingface#2579

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FIX Account for attention mask being a dict, fix generate issues with gemma #2579

FIX Account for attention mask being a dict, fix generate issues with gemma #2579

Uh oh!

BenjaminBossan commented Jun 10, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 10, 2025

Uh oh!

githubnemo left a comment

Uh oh!

githubnemo Jun 27, 2025

Uh oh!

githubnemo Jun 27, 2025

Uh oh!

githubnemo Jun 27, 2025

Uh oh!

Uh oh!

BenjaminBossan commented Jun 27, 2025

Uh oh!

githubnemo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# if in prefill stage, for prompt learning methods that are not prefix tuning, new tokens
		# (embeddings) are inserted

FIX Account for attention mask being a dict, fix generate issues with gemma #2579

FIX Account for attention mask being a dict, fix generate issues with gemma #2579

Uh oh!

Conversation

BenjaminBossan commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. attention mask being a dict

2. torch.compile errors during generation

Uh oh!

HuggingFaceDocBuilderDev commented Jun 10, 2025

Uh oh!

githubnemo left a comment

Choose a reason for hiding this comment

Uh oh!

githubnemo Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

githubnemo Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

githubnemo Jun 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BenjaminBossan commented Jun 27, 2025

Uh oh!

githubnemo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BenjaminBossan commented Jun 10, 2025 •

edited

Loading