-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix skip_special_tokens
for Wav2Vec2CTCTokenizer._decode
#29311
Fix skip_special_tokens
for Wav2Vec2CTCTokenizer._decode
#29311
Conversation
skip_special_tokens
for Wav2Vec2CTCTokenizer._decode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Do you mind adding a small test to make sure all special tokens are now skipped?
I'm unfamiliar with this kind of PR (to big repo). Where and How should I add a small test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a test here:
def test_tokenizer_decode_added_tokens(self): |
but it seems to not properly test so let's update it 😉
5b0057e
to
faa316a
Compare
I updated the test!
Finally, I applied this update to both Wav2Vec2Tokenizer and Wav2Vec2CTCTokenizer .
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, we need to test a bit more, and I'll ask @sanchit-gandhi his expertise on this model and a second look 👀
|
||
self.assertEqual(batch_tokens, ["HELLO<unk>!?!?$$$", "BYE BYE<unk>$$$"]) | ||
self.assertEqual(batch_tokens_2, ["HELO!?!?", "BYE BYE"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a new token, like "<new_tokens>"
and test that we can encode decode as we expect for both fast and slow tokenizers! 🤗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a fast Wav2Vec2CTCTokenizer
? I can't find it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry there is no fast tokenizer here! my bad. Let's just use other tokens than <unk>
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late. I updated it.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect thanks!
* Fix skip_special_tokens process for Wav2Vec2CTCTokenizer._decode * Fix skip_special_tokens for Wav2Vec2CTCTokenizer._decode * Exclude pad_token filtering since it is used as CTC-blank token * Add small test for skip_special_tokens * Update decoding test for added new token
* Fix skip_special_tokens process for Wav2Vec2CTCTokenizer._decode * Fix skip_special_tokens for Wav2Vec2CTCTokenizer._decode * Exclude pad_token filtering since it is used as CTC-blank token * Add small test for skip_special_tokens * Update decoding test for added new token
What does this PR do?
Fixes #29243
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker