Fix `skip_special_tokens` for `Wav2Vec2CTCTokenizer._decode` #29311

msublee · 2024-02-27T06:22:32Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@ArthurZucker

ArthurZucker

Thanks! Do you mind adding a small test to make sure all special tokens are now skipped?

msublee · 2024-03-01T05:47:05Z

I'm unfamiliar with this kind of PR (to big repo). Where and How should I add a small test?

ArthurZucker

There is a test here:

transformers/tests/models/wav2vec2/test_tokenization_wav2vec2.py

Line 453 in d03de1c

def test_tokenizer_decode_added_tokens(self):

but it seems to not properly test so let's update it 😉

msublee · 2024-03-06T07:34:48Z

I updated the test!
Also, I excluded the pad_token filtering in _decode since pad_token is used as CTC-blank token and is filtered in convert_tokens_to_string.

transformers/src/transformers/models/wav2vec2/tokenization_wav2vec2.py

Line 321 in b27aa20

processed_chars = list(filter(lambda char: char != self.pad_token, chars))

Finally, I applied this update to both Wav2Vec2Tokenizer and Wav2Vec2CTCTokenizer.

ArthurZucker

Overall LGTM, we need to test a bit more, and I'll ask @sanchit-gandhi his expertise on this model and a second look 👀

ArthurZucker · 2024-03-07T05:41:45Z

tests/models/wav2vec2/test_tokenization_wav2vec2.py


        self.assertEqual(batch_tokens, ["HELLO<unk>!?!?$$$", "BYE BYE<unk>$$$"])
+        self.assertEqual(batch_tokens_2, ["HELO!?!?", "BYE BYE"])


let's add a new token, like "<new_tokens>" and test that we can encode decode as we expect for both fast and slow tokenizers! 🤗

Is there a fast Wav2Vec2CTCTokenizer? I can't find it!

Sorry there is no fast tokenizer here! my bad. Let's just use other tokens than <unk>!

Sorry for the late. I updated it.

HuggingFaceDocBuilderDev · 2024-03-30T15:39:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Perfect thanks!

* Fix skip_special_tokens process for Wav2Vec2CTCTokenizer._decode * Fix skip_special_tokens for Wav2Vec2CTCTokenizer._decode * Exclude pad_token filtering since it is used as CTC-blank token * Add small test for skip_special_tokens * Update decoding test for added new token

msublee changed the title ~~Fix skip_special_tokens process for Wav2Vec2CTCTokenizer._decode~~ Fix skip_special_tokens for Wav2Vec2CTCTokenizer._decode Feb 28, 2024

ArthurZucker reviewed Feb 29, 2024

View reviewed changes

ArthurZucker reviewed Mar 4, 2024

View reviewed changes

Fix skip_special_tokens process for Wav2Vec2CTCTokenizer._decode

faa316a

msublee force-pushed the fix_wav2vec2ctctokenizer_skip_special_tokens branch from 5b0057e to faa316a Compare March 6, 2024 06:20

msublee added 3 commits March 6, 2024 15:42

Fix skip_special_tokens for Wav2Vec2CTCTokenizer._decode

796b731

Exclude pad_token filtering since it is used as CTC-blank token

ea24cda

Add small test for skip_special_tokens

204c29b

ArthurZucker reviewed Mar 7, 2024

View reviewed changes

ArthurZucker requested a review from sanchit-gandhi March 7, 2024 05:42

ArthurZucker removed the request for review from sanchit-gandhi March 25, 2024 09:54

Update decoding test for added new token

d01e421

ArthurZucker approved these changes Apr 2, 2024

View reviewed changes

ArthurZucker merged commit 15cd687 into huggingface:main Apr 2, 2024
18 checks passed

msublee deleted the fix_wav2vec2ctctokenizer_skip_special_tokens branch April 3, 2024 01:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `skip_special_tokens` for `Wav2Vec2CTCTokenizer._decode` #29311

Fix `skip_special_tokens` for `Wav2Vec2CTCTokenizer._decode` #29311

msublee commented Feb 27, 2024 •

edited

Loading

ArthurZucker left a comment

msublee commented Mar 1, 2024

ArthurZucker left a comment

msublee commented Mar 6, 2024 •

edited

Loading

ArthurZucker left a comment

ArthurZucker Mar 7, 2024

msublee Mar 7, 2024 •

edited

Loading

ArthurZucker Mar 25, 2024

msublee Apr 2, 2024

HuggingFaceDocBuilderDev commented Mar 30, 2024

ArthurZucker left a comment


		self.assertEqual(batch_tokens, ["HELLO<unk>!?!?$$$", "BYE BYE<unk>$$$"])
		self.assertEqual(batch_tokens_2, ["HELO!?!?", "BYE BYE"])

Fix skip_special_tokens for Wav2Vec2CTCTokenizer._decode #29311

Fix skip_special_tokens for Wav2Vec2CTCTokenizer._decode #29311

Conversation

msublee commented Feb 27, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

ArthurZucker left a comment

Choose a reason for hiding this comment

msublee commented Mar 1, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

msublee commented Mar 6, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Mar 7, 2024

Choose a reason for hiding this comment

msublee Mar 7, 2024 • edited Loading

Choose a reason for hiding this comment

ArthurZucker Mar 25, 2024

Choose a reason for hiding this comment

msublee Apr 2, 2024

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Mar 30, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Fix `skip_special_tokens` for `Wav2Vec2CTCTokenizer._decode` #29311

Fix `skip_special_tokens` for `Wav2Vec2CTCTokenizer._decode` #29311

msublee commented Feb 27, 2024 •

edited

Loading

msublee commented Mar 6, 2024 •

edited

Loading

msublee Mar 7, 2024 •

edited

Loading