Generate: handle `cache_position` update in `generate` #29467

gante · 2024-03-05T19:21:25Z

What does this PR do?

Updates cache_position in generate, and makes it the primary source for the input position in the models that support them, llama and gemma (as opposed to relying on past_key_values.seen_tokens).

The PR also adds the following related changes:

StaticCache now supports get_seq_length(). This was drawn from Static Cache: no mandatory cache_positions input #29221, and is needed for .prepare_inputs_for_generation() retrocompatibility;
The seen_tokens attribute enters a deprecation cycle, as it is redundant with cache_positions (and doesn't work with compilation).

This PR is drawn from the diff in #29374, i.e. it is a requirement for generate compilation with fullgraph=True 🙌

👉 Llama, Gemma, and Cache slow tests ran, no new failures
👉 FWD compilation benchmarks ran, no throughput change

HuggingFaceDocBuilderDev · 2024-03-06T15:23:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Alright, I think Llama is already testing this. Moving fast here

ArthurZucker · 2024-03-07T04:12:26Z

src/transformers/cache_utils.py

+ # TODO: This is error prone, a filled cache may be `0.0`. Let's use a stateless integer instead, after
+ # https://github.com/pytorch/pytorch/issues/120248 is fixed
+ return (self.key_cache[0, 0].any(dim=-1)).sum()


alright, we are deprecating this anyways

ArthurZucker · 2024-03-07T04:12:59Z

src/transformers/generation/utils.py

@@ -663,7 +662,8 @@ def _update_model_kwargs_for_generation(
 dim=-1,
 )

- model_kwargs["cache_position"] = model_inputs.get("cache_position", None)
+ if "cache_position" in model_kwargs and model_kwargs["cache_position"] is not None:
+ model_kwargs["cache_position"] = model_kwargs["cache_position"][-1:] + 1


my single worry here is potential stride, adding a .contiguous() might be needed

I've double-checked, it's always (1,) 🤗 (which makes sense, since it's a 1D tensor)

Its shape will indeed be different, at least between prefill and subsequent generation

ArthurZucker · 2024-03-07T04:13:57Z

src/transformers/generation/utils.py

We should also set the dtype of the cache positions to int32 wdyt?

Our integers inputs (input_ids, attention_mask, ...) are all int64, I think we should keep a consistent type :p

ArthurZucker · 2024-03-07T04:15:23Z

src/transformers/models/gemma/modeling_gemma.py

@@ -790,6 +790,10 @@ def _reset_cache(self):
 more detail.
 return_dict (`bool`, *optional*):
 Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+ cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):


we have correct long typing here!

(see int64 comment above)

gante · 2024-03-14T16:09:13Z

(rebasing and reruning tests, just in case 🙃 )

To resolve error `TypeError: LlavaLlamaForCausalLM.forward() got an unexpected keyword argument 'cache_position'` introduced by huggingface/transformers#29467

gante force-pushed the update_cache_position branch from f5c91b9 to 572ca8e Compare March 6, 2024 15:01

gante marked this pull request as ready for review March 6, 2024 16:19

gante requested a review from ArthurZucker March 6, 2024 16:19

ArthurZucker approved these changes Mar 7, 2024

View reviewed changes

gante mentioned this pull request Mar 13, 2024

fix _update_model_kwargs_for_generation #29560

Closed

5 tasks

gante added 6 commits March 14, 2024 16:08

tmp commit

28ef8d3

handle corner cases

d0be7a2

static cache seq len

0fabea6

gemma

e6277fa

deprecation cycle for seen_tokens

8c29e49

derp

10360b3

gante force-pushed the update_cache_position branch from 58660e2 to 10360b3 Compare March 14, 2024 16:09

gante merged commit 23db187 into huggingface:main Mar 14, 2024
21 checks passed

gante deleted the update_cache_position branch March 14, 2024 16:35

itsdotscience mentioned this pull request Mar 22, 2024

Update llava_llama.py for transformers 4.39.1 haotian-liu/LLaVA#1323

Closed

njhill mentioned this pull request Mar 23, 2024

Generate: fix generation with inputs_embeds when input_ids=None for llama and gemma #29821

Closed

This was referenced Apr 1, 2024

Generating text with Llama 2 doesn't work when num_beams > 1 and only inputs_embeds is provided #29968

Closed

[generate] fix breaking change for patch #29976

Merged

itazap pushed a commit that referenced this pull request May 14, 2024

Generate: handle cache_position update in generate (#29467)

cbf8d77

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate: handle `cache_position` update in `generate` #29467

Generate: handle `cache_position` update in `generate` #29467

gante commented Mar 5, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 6, 2024

ArthurZucker left a comment

ArthurZucker Mar 7, 2024

ArthurZucker Mar 7, 2024

gante Mar 14, 2024

ArthurZucker Mar 7, 2024

gante Mar 14, 2024

ArthurZucker Mar 7, 2024

gante Mar 14, 2024

gante commented Mar 14, 2024

Generate: handle cache_position update in generate #29467

Generate: handle cache_position update in generate #29467

Conversation

gante commented Mar 5, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Mar 6, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Mar 7, 2024

Choose a reason for hiding this comment

ArthurZucker Mar 7, 2024

Choose a reason for hiding this comment

gante Mar 14, 2024

Choose a reason for hiding this comment

ArthurZucker Mar 7, 2024

Choose a reason for hiding this comment

gante Mar 14, 2024

Choose a reason for hiding this comment

ArthurZucker Mar 7, 2024

Choose a reason for hiding this comment

gante Mar 14, 2024

Choose a reason for hiding this comment

gante commented Mar 14, 2024

Generate: handle `cache_position` update in `generate` #29467

Generate: handle `cache_position` update in `generate` #29467

gante commented Mar 5, 2024 •

edited

Loading