[Whisper Tokenizer] Make decoding faster after adding timestamps #26299

sanchit-gandhi · 2023-09-20T15:32:53Z

What does this PR do?

Following the update to the Whisper tokenizer to handle encoding/decoding timestamps (#26054), there is one line in the decoding which takes extremely long:

transformers/src/transformers/models/whisper/tokenization_whisper.py

Line 614 in f94c9b3

token_ids = [token for token in token_ids if token not in timestamp_ids]

Here we do an order N * M operation to filter out all the timestamp tokens, where N is the length of the token ids, and M the number of timestamp tokens (for each token, check whether it’s in the timestamp token list).

In practice, this is causing decoding to take extremely long for typical validation sets, e.g. LibriSpeech test clean took ~30 mins for the tokenizer to decode on a TPU v3 (which has lots of CPU power to run this operation).

This PR switches the timestamp filtering to a regex string operation, which in a toy benchmark was a factor of > 2000 faster. Would love to hear from @ArthurZucker whether we're happy to sacrifice a bit of readability for this speed-up!

HuggingFaceDocBuilderDev · 2023-09-20T15:56:31Z

The documentation is not available anymore as the PR was closed or merged.

ArthurZucker

Is this faster than storing all the timestamp tokens as a property, and use it to filter the tokens?

sanchit-gandhi · 2023-09-22T13:52:56Z

Yep it was significantly faster once scaled to large datasets and long sequence lengths (>5k samples with 256 seq len)

ArthurZucker

I see! Thanks for fixing this then 😉

make decoding faster

be7c3b8

sanchit-gandhi requested a review from ArthurZucker September 20, 2023 17:01

ArthurZucker reviewed Sep 22, 2023

View reviewed changes

ArthurZucker approved these changes Sep 22, 2023

View reviewed changes

ArthurZucker mentioned this pull request Sep 26, 2023

Slow tokenizer decode #26335

Closed

sanchit-gandhi merged commit 211f93a into huggingface:main Sep 28, 2023

sanchit-gandhi deleted the whisper-tok-decode branch September 28, 2023 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Whisper Tokenizer] Make decoding faster after adding timestamps #26299

[Whisper Tokenizer] Make decoding faster after adding timestamps #26299

sanchit-gandhi commented Sep 20, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 20, 2023 •

edited

Loading

ArthurZucker left a comment

sanchit-gandhi commented Sep 22, 2023

ArthurZucker left a comment

[Whisper Tokenizer] Make decoding faster after adding timestamps #26299

[Whisper Tokenizer] Make decoding faster after adding timestamps #26299

Conversation

sanchit-gandhi commented Sep 20, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Sep 20, 2023 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

sanchit-gandhi commented Sep 22, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

sanchit-gandhi commented Sep 20, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 20, 2023 •

edited

Loading