Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer use_fast=True encode has fatal bug #29483

Closed
2 tasks done
sydney0zq opened this issue Mar 6, 2024 · 4 comments
Closed
2 tasks done

Tokenizer use_fast=True encode has fatal bug #29483

sydney0zq opened this issue Mar 6, 2024 · 4 comments

Comments

@sydney0zq
Copy link

sydney0zq commented Mar 6, 2024

System Info

Transformers=4.37
Pytorch=2.1.1

  • My own modified scripts
  • My own task or dataset (give details below)

Reproduction

For yi34b model, after SFT, we save the model without tokenizer.json but with anything else about tokenizer, then we run tokenizer.convert_ids_to_tokens([64001]) (the tokenizer use_fast being set to default value), it gets <|endoftext|>. But the vocab has only 64000, therefore it triggers CUDA device assertion error.

This bug is prone to be reproduced. Please fix it.

Expected behavior

Fix the bug ASAP

@sydney0zq sydney0zq changed the title Tokenizer use_fast=False encode error Tokenizer use_fast=False encode has fatal bug Mar 6, 2024
@sydney0zq sydney0zq changed the title Tokenizer use_fast=False encode has fatal bug Tokenizer use_fast=True encode has fatal bug Mar 6, 2024
@ArthurZucker
Copy link
Collaborator

Hey! The token was probably added 🤗

vocab has only 64000

what happends when you print the tokenizer

@sydney0zq
Copy link
Author

sydney0zq commented Mar 12, 2024

tokenizer.convert_ids_to_tokens([64001])

@ArthurZucker

Here is the output:

>>> tokenizer.convert_ids_to_tokens([64001])
['<|endoftext|>']
>>> tokenizer.convert_ids_to_tokens([64001])
['<|endoftext|>']
>>> tokenizer.convert_ids_to_tokens([64002])
[None]
>>> tokenizer.convert_ids_to_tokens([1])
['<s>']
>>> tokenizer.convert_ids_to_tokens([0])
['<unk>']
>>> tokenizer.vocab_size
64000

The results seem harmless, however, when use use_fast=False, it outputs:

>>> tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
>>> tokenizer.convert_ids_to_tokens([64001])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tiger/anaconda3/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 982, in convert_ids_to_tokens
    tokens.append(self._convert_id_to_token(index))
  File "/home/tiger/anaconda3/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 280, in _convert_id_to_token
    token = self.sp_model.IdToPiece(index)
  File "/home/tiger/anaconda3/lib/python3.9/site-packages/sentencepiece/__init__.py", line 1179, in _batched_func
    return _func(self, arg)
  File "/home/tiger/anaconda3/lib/python3.9/site-packages/sentencepiece/__init__.py", line 1172, in _func
    raise IndexError('piece id is out of range.')
IndexError: piece id is out of range.

@ArthurZucker
Copy link
Collaborator

Alright I think yi is a special case, #29797 will be fixing this!

@sydney0zq
Copy link
Author

Alright I think yi is a special case, #29797 will be fixing this!

Thanks, closing the issue~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants