-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wav2vec2: adding single-char tokens to tokenizer causes tokenization mistakes #10622
Comments
My workaround right now is to keep a reference to the original from transformers import Wav2Vec2Processor
tokenizer = Wav2Vec2Processor.from_pretrained('facebook/wav2vec2-base').tokenizer
unique_no_split_tokens = tokenizer.unique_no_split_tokens
tokenizer.add_tokens('x')
tokenizer.unique_no_split_tokens = unique_no_split_tokens
token_ids = tokenizer('C x A').input_ids
decoded = tokenizer.decode(token_ids)
print(decoded, token_ids)
# C x A [19, 4, 32, 4, 7] |
Hey @elgeish, Sorry for replying that late! Yes, you are absolutely right here :-) def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
# copy past the function from `src/transformers/tokenization_utils.py`
# + add the "hack":
unique_no_split_tokens = tokenizer.unique_no_split_tokens
tokenizer.unique_no_split_tokens = unique_no_split_tokens If you want and have some time, it would be amazing if you could open a PR :-) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hey @patrickvonplaten, |
Hey @Muktan, yes this would be great :-) |
Environment info
transformers
version: 4.4.0.dev0Who can help
@patrickvonplaten and @LysandreJik
Issue is probably related to interactions of the following:
transformers/src/transformers/tokenization_utils.py
Line 213 in 9a8c168
transformers/src/transformers/tokenization_utils.py
Line 352 in 11fdde0
transformers/src/transformers/models/wav2vec2/tokenization_wav2vec2.py
Line 184 in cb38ffc
This is a corner case:
add_tokens
adds new tokens toself.unique_no_split_tokens
-- causingtokenize()
to skip callingWav2Vec2CTCTokenizer._tokenize()
This is probably not the case with most tokenizers since their vocab includes most, if not all, commonly used single-characters tokens without including them in
self.unique_no_split_tokens
. I faced this while debugging my code for #10581 to add support for Buckwalter Arabic transliteration. The issue is not limited to adding single-char tokens but rather when words (space-separated) start or end with a newly added token.Information
Model I am using (Bert, XLNet ...): wav2vec2
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Expected behavior
Should have printed
C x A [19, 4, 32, 4, 7]
The text was updated successfully, but these errors were encountered: