Tokenizer `use_fast=True` encode has fatal bug #29483

sydney0zq · 2024-03-06T12:07:18Z

System Info

Transformers=4.37
Pytorch=2.1.1

My own modified scripts
My own task or dataset (give details below)

Reproduction

For yi34b model, after SFT, we save the model without tokenizer.json but with anything else about tokenizer, then we run tokenizer.convert_ids_to_tokens([64001]) (the tokenizer use_fast being set to default value), it gets <|endoftext|>. But the vocab has only 64000, therefore it triggers CUDA device assertion error.

This bug is prone to be reproduced. Please fix it.

Expected behavior

Fix the bug ASAP

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-03-07T10:48:43Z

Hey! The token was probably added 🤗

vocab has only 64000

what happends when you print the tokenizer

sydney0zq · 2024-03-12T03:27:03Z

tokenizer.convert_ids_to_tokens([64001])

@ArthurZucker

Here is the output:

>>> tokenizer.convert_ids_to_tokens([64001])
['<|endoftext|>']
>>> tokenizer.convert_ids_to_tokens([64001])
['<|endoftext|>']
>>> tokenizer.convert_ids_to_tokens([64002])
[None]
>>> tokenizer.convert_ids_to_tokens([1])
['<s>']
>>> tokenizer.convert_ids_to_tokens([0])
['<unk>']
>>> tokenizer.vocab_size
64000

The results seem harmless, however, when use use_fast=False, it outputs:

>>> tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
>>> tokenizer.convert_ids_to_tokens([64001])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/tiger/anaconda3/lib/python3.9/site-packages/transformers/tokenization_utils.py", line 982, in convert_ids_to_tokens
    tokens.append(self._convert_id_to_token(index))
  File "/home/tiger/anaconda3/lib/python3.9/site-packages/transformers/models/llama/tokenization_llama.py", line 280, in _convert_id_to_token
    token = self.sp_model.IdToPiece(index)
  File "/home/tiger/anaconda3/lib/python3.9/site-packages/sentencepiece/__init__.py", line 1179, in _batched_func
    return _func(self, arg)
  File "/home/tiger/anaconda3/lib/python3.9/site-packages/sentencepiece/__init__.py", line 1172, in _func
    raise IndexError('piece id is out of range.')
IndexError: piece id is out of range.

ArthurZucker · 2024-03-25T06:45:23Z

Alright I think yi is a special case, #29797 will be fixing this!

sydney0zq · 2024-04-09T07:36:44Z

Alright I think yi is a special case, #29797 will be fixing this!

Thanks, closing the issue~

sydney0zq changed the title ~~Tokenizer use_fast=False encode error~~ Tokenizer use_fast=False encode has fatal bug Mar 6, 2024

sydney0zq changed the title ~~Tokenizer use_fast=False encode has fatal bug~~ Tokenizer use_fast=True encode has fatal bug Mar 6, 2024

sydney0zq closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer `use_fast=True` encode has fatal bug #29483

Tokenizer `use_fast=True` encode has fatal bug #29483

sydney0zq commented Mar 6, 2024 •

edited

Loading

ArthurZucker commented Mar 7, 2024

sydney0zq commented Mar 12, 2024 •

edited

Loading

ArthurZucker commented Mar 25, 2024

sydney0zq commented Apr 9, 2024

Tokenizer use_fast=True encode has fatal bug #29483

Tokenizer use_fast=True encode has fatal bug #29483

Comments

sydney0zq commented Mar 6, 2024 • edited Loading

System Info

Reproduction

Expected behavior

ArthurZucker commented Mar 7, 2024

sydney0zq commented Mar 12, 2024 • edited Loading

ArthurZucker commented Mar 25, 2024

sydney0zq commented Apr 9, 2024

Tokenizer `use_fast=True` encode has fatal bug #29483

Tokenizer `use_fast=True` encode has fatal bug #29483

sydney0zq commented Mar 6, 2024 •

edited

Loading

sydney0zq commented Mar 12, 2024 •

edited

Loading