Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Whisper Tokenizer] Make decoding faster after adding timestamps #26299

Merged
merged 1 commit into from
Sep 28, 2023

Conversation

sanchit-gandhi
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi commented Sep 20, 2023

What does this PR do?

Following the update to the Whisper tokenizer to handle encoding/decoding timestamps (#26054), there is one line in the decoding which takes extremely long:

token_ids = [token for token in token_ids if token not in timestamp_ids]

Here we do an order N * M operation to filter out all the timestamp tokens, where N is the length of the token ids, and M the number of timestamp tokens (for each token, check whether it’s in the timestamp token list).

In practice, this is causing decoding to take extremely long for typical validation sets, e.g. LibriSpeech test clean took ~30 mins for the tokenizer to decode on a TPU v3 (which has lots of CPU power to run this operation).

This PR switches the timestamp filtering to a regex string operation, which in a toy benchmark was a factor of > 2000 faster. Would love to hear from @ArthurZucker whether we're happy to sacrifice a bit of readability for this speed-up!

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Sep 20, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this faster than storing all the timestamp tokens as a property, and use it to filter the tokens?

@sanchit-gandhi
Copy link
Contributor Author

Yep it was significantly faster once scaled to large datasets and long sequence lengths (>5k samples with 256 seq len)

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Thanks for fixing this then 😉

@sanchit-gandhi sanchit-gandhi merged commit 211f93a into huggingface:main Sep 28, 2023
@sanchit-gandhi sanchit-gandhi deleted the whisper-tok-decode branch September 28, 2023 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants