Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wav2vec2: adding single-char tokens to tokenizer causes tokenization mistakes #10622

Closed
2 of 4 tasks
elgeish opened this issue Mar 10, 2021 · 5 comments · Fixed by #11538
Closed
2 of 4 tasks

wav2vec2: adding single-char tokens to tokenizer causes tokenization mistakes #10622

elgeish opened this issue Mar 10, 2021 · 5 comments · Fixed by #11538

Comments

@elgeish
Copy link
Contributor

elgeish commented Mar 10, 2021

Environment info

  • transformers version: 4.4.0.dev0
  • Platform: Linux-5.8.0-44-generic-x86_64-with-glibc2.10
  • Python version: 3.8.8
  • PyTorch version (GPU?): 1.8.0 (True)
  • Tensorflow version (GPU?): 2.4.1 (False)
  • Using GPU in script?: N/A
  • Using distributed or parallel set-up in script?: N/A

Who can help

@patrickvonplaten and @LysandreJik

Issue is probably related to interactions of the following:

self.unique_no_split_tokens = sorted(set(self.unique_no_split_tokens).union(set(tokens_to_add)))

self._tokenize(token) if token not in self.unique_added_tokens_encoder else [token]

return list(text.replace(" ", self.word_delimiter_token))

This is a corner case: add_tokens adds new tokens to self.unique_no_split_tokens -- causing tokenize() to skip calling Wav2Vec2CTCTokenizer._tokenize()

This is probably not the case with most tokenizers since their vocab includes most, if not all, commonly used single-characters tokens without including them in self.unique_no_split_tokens. I faced this while debugging my code for #10581 to add support for Buckwalter Arabic transliteration. The issue is not limited to adding single-char tokens but rather when words (space-separated) start or end with a newly added token.

Information

Model I am using (Bert, XLNet ...): wav2vec2

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: adding tokens to ASR vocab

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: training an ASR with extended vocab

To reproduce

Steps to reproduce the behavior:

from transformers import Wav2Vec2Processor

tokenizer = Wav2Vec2Processor.from_pretrained('facebook/wav2vec2-base').tokenizer
tokenizer.add_tokens('x')
token_ids = tokenizer('C x A').input_ids
decoded = tokenizer.decode(token_ids)
print(decoded, token_ids)
# CxA [19, 32, 7]

Expected behavior

Should have printed C x A [19, 4, 32, 4, 7]

@elgeish
Copy link
Contributor Author

elgeish commented Mar 10, 2021

My workaround right now is to keep a reference to the original tokenizer.unique_no_split_tokens before adding tokens then restoring it afterwards:

from transformers import Wav2Vec2Processor

tokenizer = Wav2Vec2Processor.from_pretrained('facebook/wav2vec2-base').tokenizer
unique_no_split_tokens = tokenizer.unique_no_split_tokens
tokenizer.add_tokens('x')
tokenizer.unique_no_split_tokens = unique_no_split_tokens
token_ids = tokenizer('C x A').input_ids
decoded = tokenizer.decode(token_ids)
print(decoded, token_ids)
# C x A [19, 4, 32, 4, 7]

@patrickvonplaten
Copy link
Contributor

patrickvonplaten commented Mar 29, 2021

Hey @elgeish,

Sorry for replying that late! Yes, you are absolutely right here :-)
I think we should overwrite the add_tokens(self, ...) function in
src/transformers/models/wav2vec2/tokenization_wav2vec2.py with the "hack" just as you did:

def _add_tokens(self, new_tokens: Union[List[str], List[AddedToken]], special_tokens: bool = False) -> int:
   # copy past the function from `src/transformers/tokenization_utils.py` 
   # + add the "hack": 
   unique_no_split_tokens = tokenizer.unique_no_split_tokens
   tokenizer.unique_no_split_tokens = unique_no_split_tokens

If you want and have some time, it would be amazing if you could open a PR :-)

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@Muktan
Copy link
Contributor

Muktan commented Apr 28, 2021

Hey @patrickvonplaten,
Shall I open a PR for this issue.

@patrickvonplaten
Copy link
Contributor

Hey @Muktan,

yes this would be great :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants