[WIP] [Whisper] Fix generate and tokenizer behavior with added tokens #33512

eustlb · 2024-09-16T15:30:56Z

What does this PR do?

Description of the issue

Transformers' PretTrainedTokenizer _add_tokens add tokens at the end of the vocabulary. Likewise, PreTrainedModel's _get_resized_embeddings will add newly initialized tokens at the end. This behavior is not compatible with Whisper's timestamp token identification. Indeed, official implementation as well as Transformer's one identifies timestamp tokens as the ones that have an id > timestamp_begin. For this reason, the newly added tokens are falsely considered timestamps and this poses decoding issues.

Implementation possibilities

Two possibilities here:

change Whisper's decoding logic using a timestamp_end
overwrite _add_tokens and _get_resized_embeddings in order to add new tokens before the first timestamp token (and not at the end as current implementation)

I chose option 1 to avoid overwriting the default methods of the Transformers library, focusing instead on modifying the Whisper-specific generation method and tokenizer logic.

Implementation decision (see below discussion for details)

In the updated implementation, Whisper will now use both timestamp_begin and timestamp_end, instead of just timestamp_begin as in OpenAI’s original version.

→ Justification:

what is the best for the users (least amount of work for them)

We have two scenarios:

Adding text tokens.
Adding timestamp tokens.

These cases overlap, as both types of tokens must be added via the add_tokens method. For the system to work seamlessly:

Text tokens should be inserted before the first timestamp token (Case 1).
Timestamp tokens should be added at the end of the vocabulary (Case 2).

To handle Case 1, modifying the add_tokens method would also require changes to the resize_token_embedding. From a simplicity perspective, deviating from the standard implementation of these methods complicates things unnecessarily.

Since Case 1 is more common, we should prioritize simplifying it for the user. Hardcoding a default number of timestamp tokens achieves this, and the number_timestamp_tokens config option easily accommodates Case 2 for the few users who need it.

As for the tokenizer, the regex pattern currently used to detect timestamps prevents the use of custom timestamp tokens (meaning with a custom syntax not matching the pattern). However, since this is a rare use case and the current implementation already depends on regex, it’s best to keep things simple and do it this way.

Who can review?

@ylacombe

TODO

discuss different implementations and choose
add tests before merging

ylacombe

Thanks for this PR @eustlb, it's already in great shape!

I left a few comments here and there, but the most important ones are on the tokenizer side.

What you proposed looks great to me, the only downside that I see is that users would have to be careful on how they add new tokens, especially if they want to introduce new timestamp tokens:

first they'd have to add timestamp tokens to the tokenizer and add number_timestamp_tokens to the generation_config
then they can add new tokens as usual

I don't see any clear way to avoid this downside, and I don't think this will be a frequent use-case anyway, so I'm happy to keep your current way of doing it.

Besides the comments, it'd be great to add some tests to test_modeling_whisper.py and to test_tokenization_whisper.py to make sure generation and tokenization works when you add new tokens!

src/transformers/generation/logits_process.py

ylacombe · 2024-09-17T07:56:01Z

src/transformers/models/whisper/generation_whisper.py

@@ -347,7 +347,9 @@ def generate(
            synced_gpus (`bool`, *optional*, defaults to `False`):
                Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
            return_timestamps (`bool`, *optional*):
-                Whether to return the timestamps with the text. This enables the `WhisperTimestampsLogitsProcessor`.
+                Whether to return the timestamps with the text. This enables the `WhisperTimeStampLogitsProcessor`.
+                By default, Whisper uses 1501 timestamp tokens.  If a custom number of timestamp tokens is needed,


This is nice! Should we also write this somewhere in whisper.md to further highlight it since it's quite hidden from the user?

src/transformers/models/whisper/generation_whisper.py

src/transformers/generation/logits_process.py

src/transformers/models/whisper/generation_whisper.py

ylacombe · 2024-09-17T08:07:26Z

src/transformers/models/whisper/tokenization_whisper.py

+    @property
+    def timestamp_end(self) -> int:
+        timestamp_ids = [value for key, value in self.added_tokens_encoder.items() if self.timestamp_pat.match(key)]
+        return max(timestamp_ids)


It's on this point that I'll be the most cautious, as we introduced two ways of computing timestamp_end:

In the modeling file with timestamp_begin + generation_config.get("number_timestamp_tokens", 1501)

in the tokenizer, with the current way of doing it: taking the maximum id of every tokens that match the timestamp pattern.

@itazap and @ArthurZucker, WDYT of this ?

To give a bit of context, the Whisper model has its usual vocabulary on which is appended a series of timestamps token ids (usually 1501 tokens). When a user appends a new token to the tokenizer, it is wrongly identified as a timestamp token since it's id is greater than the o.g vocabulary size.

Would it make sense to avoid using timestamp begin and timestamp end here, and to simply verify in the rest of the file if a token is a timestamp thanks to self.timestamp_pat.match(key) ?

This could be a way to be more rigorous on how we teat timestamps in the tokenizer but would introduce a difference in how we compute timestamps in the generation file and here.

Up on this! I do agree that the introduced inconsistency is not convenient. Since I do not see another way of identifying timestamp_end, and that the solution I proposed is computationally unoptimal (recomputed at each call, could be improved with LRU cache), I agree that identifying timestamp tokens from the tokenizer's path would be better with a regex match. This would remove necessity of timestamp_end, nevertheless we can't avoid using timestamp_begin since it is required to computing timestamp values (here).

overall let's aim for simplicity:

what is the best for the users (least amount of work for them)

what is the simplest code for this.

HuggingFaceDocBuilderDev · 2024-09-23T16:05:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

eustlb · 2024-09-24T10:24:45Z

→ Let's wrap up a bit and justify choices further:

In the updated implementation, Whisper will now use both timestamp_begin and timestamp_end, instead of just timestamp_begin as in OpenAI’s original version.

→ Justification:

what is the best for the users (least amount of work for them)

We have two scenarios:

Adding text tokens.
Adding timestamp tokens.

These cases overlap, as both types of tokens must be added via the add_tokens method. For the system to work seamlessly:

Text tokens should be inserted before the first timestamp token (Case 1).
Timestamp tokens should be added at the end of the vocabulary (Case 2).

To handle Case 1, modifying the add_tokens method would also require changes to the resize_token_embedding. From a simplicity perspective, deviating from the standard implementation of these methods complicates things unnecessarily.

Since Case 1 is more common, we should prioritize simplifying it for the user. Hardcoding a default number of timestamp tokens achieves this, and the number_timestamp_tokens config option easily accommodates Case 2 for the few users who need it.

As for the tokenizer, the regex pattern currently used to detect timestamps prevents the use of custom timestamp tokens (meaning with a custom syntax not matching the pattern). However, since this is a rare use case and the current implementation already depends on regex, it’s best to keep things simple and do it this way.

eustlb changed the title ~~Fix whisper generate added tokens~~ [Whisper] Fix generate and tokenizer behavior with added tokens Sep 16, 2024

ylacombe reviewed Sep 17, 2024

View reviewed changes

itazap mentioned this pull request Sep 20, 2024

#33512 handle last element out of range error #33625

Open

eustlb added 5 commits September 23, 2024 16:53

add and handle timestamp_end (generate)

712d63f

update generate docstring

4cfdc43

add and handle timestamp_end (tokenizer)

0b3ca28

add and handle timestamp_end (fast tokenizer)

5aea5b3

add and handle timestamp_end (logits processor)

e78bbed

eustlb force-pushed the fix-whisper-generate-added-tokens branch from 9b88b2c to e78bbed Compare September 23, 2024 15:25

eustlb added 2 commits September 23, 2024 17:35

identify timestamp tokens with regex pattern

e1bd766

--amend

deffaed

eustlb added 4 commits September 23, 2024 18:24

fix copies

55e6294

make

53d9f6a

fix apply _is_timestamp_token to array

09efe6f

make

90dbbad

eustlb changed the title ~~[Whisper] Fix generate and tokenizer behavior with added tokens~~ [WIP] [Whisper] Fix generate and tokenizer behavior with added tokens Oct 13, 2024

eustlb mentioned this pull request Oct 14, 2024

Whisper Beam Search doesn't work #33445

Closed

4 tasks

eustlb mentioned this pull request Dec 19, 2024

tokenizer decode decode with timestamp fails for extended vocabulary #35330

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] [Whisper] Fix generate and tokenizer behavior with added tokens #33512

[WIP] [Whisper] Fix generate and tokenizer behavior with added tokens #33512

eustlb commented Sep 16, 2024 •

edited

Loading

ylacombe left a comment

ylacombe Sep 17, 2024

ylacombe Sep 17, 2024

ylacombe Sep 17, 2024

ylacombe Sep 17, 2024

eustlb Sep 17, 2024

ArthurZucker Sep 21, 2024

HuggingFaceDocBuilderDev commented Sep 23, 2024

eustlb commented Sep 24, 2024

[WIP] [Whisper] Fix generate and tokenizer behavior with added tokens #33512

Are you sure you want to change the base?

[WIP] [Whisper] Fix generate and tokenizer behavior with added tokens #33512

Conversation

eustlb commented Sep 16, 2024 • edited Loading

What does this PR do?

Description of the issue

Implementation possibilities

Implementation decision (see below discussion for details)

Who can review?

TODO

ylacombe left a comment

Choose a reason for hiding this comment

ylacombe Sep 17, 2024

Choose a reason for hiding this comment

ylacombe Sep 17, 2024

Choose a reason for hiding this comment

ylacombe Sep 17, 2024

Choose a reason for hiding this comment

ylacombe Sep 17, 2024

Choose a reason for hiding this comment

eustlb Sep 17, 2024

Choose a reason for hiding this comment

ArthurZucker Sep 21, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 23, 2024

eustlb commented Sep 24, 2024

eustlb commented Sep 16, 2024 •

edited

Loading