wav2vec2: adding single-char tokens to tokenizer causes tokenization mistakes #10622

elgeish · 2021-03-10T03:05:16Z

Environment info

transformers version: 4.4.0.dev0
Platform: Linux-5.8.0-44-generic-x86_64-with-glibc2.10
Python version: 3.8.8
PyTorch version (GPU?): 1.8.0 (True)
Tensorflow version (GPU?): 2.4.1 (False)
Using GPU in script?: N/A
Using distributed or parallel set-up in script?: N/A

Who can help

Issue is probably related to interactions of the following:

transformers/src/transformers/tokenization_utils.py

Line 213 in 9a8c168

    
           self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(tokens_to_add)))

transformers/src/transformers/tokenization_utils.py

Line 352 in 11fdde0

    
           self._tokenize(token) if token not in self.unique_added_tokens_encoder else [token]

transformers/src/transformers/models/wav2vec2/tokenization_wav2vec2.py

Line 184 in cb38ffc

return list(text.replace(" ", self.word_delimiter_token))

This is a corner case: add_tokens adds new tokens to self.unique_no_split_tokens -- causing tokenize() to skip calling Wav2Vec2CTCTokenizer._tokenize()

This is probably not the case with most tokenizers since their vocab includes most, if not all, commonly used single-characters tokens without including them in self.unique_no_split_tokens. I faced this while debugging my code for #10581 to add support for Buckwalter Arabic transliteration. The issue is not limited to adding single-char tokens but rather when words (space-separated) start or end with a newly added token.

Information

Model I am using (Bert, XLNet ...): wav2vec2

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: adding tokens to ASR vocab

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: training an ASR with extended vocab

To reproduce

Steps to reproduce the behavior:

from transformers import Wav2Vec2Processor

tokenizer = Wav2Vec2Processor.from_pretrained('facebook/wav2vec2-base').tokenizer
tokenizer.add_tokens('x')
token_ids = tokenizer('C x A').input_ids
decoded = tokenizer.decode(token_ids)
print(decoded, token_ids)
# CxA [19, 32, 7]

Expected behavior

Should have printed C x A [19, 4, 32, 4, 7]

The text was updated successfully, but these errors were encountered:

elgeish · 2021-03-10T03:18:41Z

My workaround right now is to keep a reference to the original tokenizer.unique_no_split_tokens before adding tokens then restoring it afterwards:

from transformers import Wav2Vec2Processor

tokenizer = Wav2Vec2Processor.from_pretrained('facebook/wav2vec2-base').tokenizer
unique_no_split_tokens = tokenizer.unique_no_split_tokens
tokenizer.add_tokens('x')
tokenizer.unique_no_split_tokens = unique_no_split_tokens
token_ids = tokenizer('C x A').input_ids
decoded = tokenizer.decode(token_ids)
print(decoded, token_ids)
# C x A [19, 4, 32, 4, 7]

patrickvonplaten · 2021-03-29T11:31:36Z

Hey @elgeish,

Sorry for replying that late! Yes, you are absolutely right here :-)
I think we should overwrite the add_tokens(self, ...) function in
src/transformers/models/wav2vec2/tokenization_wav2vec2.py with the "hack" just as you did:

def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
   # copy past the function from `src/transformers/tokenization_utils.py` 
   # + add the "hack": 
   unique_no_split_tokens = tokenizer.unique_no_split_tokens
   tokenizer.unique_no_split_tokens = unique_no_split_tokens

If you want and have some time, it would be amazing if you could open a PR :-)

github-actions · 2021-04-22T15:02:11Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Muktan · 2021-04-28T18:38:53Z

Hey @patrickvonplaten,
Shall I open a PR for this issue.

patrickvonplaten · 2021-04-30T22:18:11Z

Hey @Muktan,

yes this would be great :-)

Muktan mentioned this issue May 1, 2021

[Wav2vec2] Fixed tokenization mistakes while adding single-char tokens to tokenizer #11538

Merged

5 tasks

patrickvonplaten closed this as completed in #11538 May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wav2vec2: adding single-char tokens to tokenizer causes tokenization mistakes #10622

wav2vec2: adding single-char tokens to tokenizer causes tokenization mistakes #10622

elgeish commented Mar 10, 2021 •

edited

Loading

elgeish commented Mar 10, 2021

patrickvonplaten commented Mar 29, 2021 •

edited

Loading

github-actions bot commented Apr 22, 2021

Muktan commented Apr 28, 2021

patrickvonplaten commented Apr 30, 2021

wav2vec2: adding single-char tokens to tokenizer causes tokenization mistakes #10622

wav2vec2: adding single-char tokens to tokenizer causes tokenization mistakes #10622

Comments

elgeish commented Mar 10, 2021 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

elgeish commented Mar 10, 2021

patrickvonplaten commented Mar 29, 2021 • edited Loading

github-actions bot commented Apr 22, 2021

Muktan commented Apr 28, 2021

patrickvonplaten commented Apr 30, 2021

elgeish commented Mar 10, 2021 •

edited

Loading

patrickvonplaten commented Mar 29, 2021 •

edited

Loading