[Whisper] Use Attention Cache #28931

sanchit-gandhi · 2024-02-08T16:57:57Z

What does this PR do?

Refactors the Whisper model to use the attention cache abstraction proposed in #26681. This is required to have consistency with the StaticCache attention class proposed in #27931.

The complexity with the current Cache abstraction comes from the fact that Whisper is an encoder-decoder model, meaning each decoder attention layer consists of:

A self-attention layer (k/v cache over the previous decoder input ids)
A cross-attention layer (k/v cache from the encoder hidden-states)

=> the problematic layer for static generation is the dynamic k/v cache in the self-attention layer. In anticipation of using a static cache for this module, the proposed design uses a separate cache for each layer. We can't build the k/v cache into a single Cache abstraction, as the shapes for the self and cross-attention key-values are different (which would break compile).

The design is therefore:

past_key_values: Tuple[Cache] = (past_self_attn_key_values, past_cross_attn_key_values)

Where past_self_attn_key_values and past_cross_attn_key_values are each Cache abstractions. This is not the most elegant design, but is compatible with the current Cache abstraction. Another option would be to do a refactor of the Cache / DynamicCache / StaticCache for better compatibility with encoder-decoder models.

cc @ArthurZucker @tomaarsen @gante

sanchit-gandhi · 2024-02-08T17:04:23Z

src/transformers/models/whisper/modeling_whisper.py

-        self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
-        # add present self-attn cache to positions 1,2 of present_key_value tuple
+        # decoder uni-directional self-attention cached key/values states are at position 0
+        self_attn_past_key_value = past_key_value[0] if past_key_value is not None else None


The difficulty comes here from the fact that we're dealing with two sets of past key-values per decoder layer: one from the self-attention, and one from the cross-attention. The current solution uses a separate cache for each.

HuggingFaceDocBuilderDev · 2024-02-08T17:17:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sanchit-gandhi · 2024-02-08T17:31:54Z

src/transformers/models/whisper/modeling_whisper.py

@@ -801,7 +799,7 @@ def forward(

 # Copied from transformers.models.mbart.modeling_mbart.MBartDecoderLayer with MBart->Whisper, MBART->WHISPER


I'll propagate changes to all other MBart derived modules when we're happy with the design

soumendukrg · 2024-02-12T22:06:06Z

I have a fully functional working solution for static shaped Whisper, which we have extensively tested on librispeech dataset and get same accuracy as original model.

gante · 2024-02-14T17:26:30Z

@sanchit-gandhi FYI I'm going to change the Cache structure a bit, while it's not widespread in the codebase. In a nutshell, given the hard constraints of the static cache (and its obvious benefits), all caches will have an interface similar to the new static cache (which differs from the original Cache implementation).

PR in progress here: #29005

After this PR is done, then we can expand its usage using the same interface, e.g. for encoder-decoder models 🤗

sanchit-gandhi · 2024-02-27T17:11:50Z

Thanks for the context @gante! Is there anything I can do to help with the static cache refactor? Pretty keen to implement a compile-compatible cache for Whisper!

github-actions · 2024-03-23T08:04:13Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

[Whisper] Use Attention Cache

5af2661

sanchit-gandhi commented Feb 8, 2024

View reviewed changes

sanchit-gandhi added 2 commits February 8, 2024 17:29

backwards comp

f8a2077

style

b11bd2d

sanchit-gandhi commented Feb 8, 2024

View reviewed changes

fix sdpa/fa2

def2277

github-actions bot closed this Mar 31, 2024

sanchit-gandhi mentioned this pull request May 22, 2024

Add torch.compile for Whisper #30949

Closed

sanchit-gandhi mentioned this pull request May 31, 2024

[whisper] static kv cache #31166

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Whisper] Use Attention Cache #28931

[Whisper] Use Attention Cache #28931

sanchit-gandhi commented Feb 8, 2024 •

edited

Loading

sanchit-gandhi Feb 8, 2024

HuggingFaceDocBuilderDev commented Feb 8, 2024

sanchit-gandhi Feb 8, 2024

soumendukrg commented Feb 12, 2024

gante commented Feb 14, 2024 •

edited

Loading

sanchit-gandhi commented Feb 27, 2024

github-actions bot commented Mar 23, 2024

		@@ -801,7 +799,7 @@ def forward(

		# Copied from transformers.models.mbart.modeling_mbart.MBartDecoderLayer with MBart->Whisper, MBART->WHISPER

[Whisper] Use Attention Cache #28931

[Whisper] Use Attention Cache #28931

Conversation

sanchit-gandhi commented Feb 8, 2024 • edited Loading

What does this PR do?

sanchit-gandhi Feb 8, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 8, 2024

sanchit-gandhi Feb 8, 2024

Choose a reason for hiding this comment

soumendukrg commented Feb 12, 2024

gante commented Feb 14, 2024 • edited Loading

sanchit-gandhi commented Feb 27, 2024

github-actions bot commented Mar 23, 2024

sanchit-gandhi commented Feb 8, 2024 •

edited

Loading

gante commented Feb 14, 2024 •

edited

Loading