Inconsistent behavior between tokenizer and fast tokenizer #28577

xuzhenqi · 2024-01-18T11:10:11Z

System Info

transformers version: 4.36.2
Platform: Linux-4.18.0-193.6.3.el8_2.v1.4.x86_64-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.0
Accelerate version: 0.25.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf", trust_remote_code=True, use_fast=False)
fast_tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf", trust_remote_code=True, use_fast=True)
prompt = "▁<PRE>//"
inputs = tokenizer(prompt, return_tensors="pt")
print(f"tokenizer ids: {inputs.input_ids}")
inputs = fast_tokenizer(prompt, return_tensors="pt")
print(f"fast tokenizer ids: {inputs.input_ids}")

This scripts will output:

tokenizer ids: tensor([[    1, 32007,   458]])
fast tokenizer ids: tensor([[    1, 32007,   849]])

In the tokenizer.json from the model folder, we can see:

"//": 458,
"▁//": 849,

Fast tokenizer probably ignores the <PRE> token, is it a correct behavior?

Expected behavior

Fast tokenizer should be consistent with normal tokenizer.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-01-18T11:17:32Z

Thanks for reporting! This is pretty much a known bug but will be fixed by the likes of #26678 (when propagated to Llama)

ArthurZucker · 2024-03-25T07:08:55Z

#28881 will fix this issue!

huggingface deleted a comment from github-actions bot Feb 20, 2024

huggingface deleted a comment from github-actions bot Mar 25, 2024

ArthurZucker mentioned this issue Mar 26, 2024

Inconsistent tokenization between fast and slow tokenizers for sentencepiece user-defined tokens. #29868

Closed

4 tasks

huggingface deleted a comment from github-actions bot Apr 18, 2024

ArthurZucker mentioned this issue Apr 22, 2024

[LlamaTokenizerFast] Refactor default llama #28881

Merged

ArthurZucker closed this as completed in #28881 Apr 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behavior between tokenizer and fast tokenizer #28577

Inconsistent behavior between tokenizer and fast tokenizer #28577

xuzhenqi commented Jan 18, 2024

ArthurZucker commented Jan 18, 2024

ArthurZucker commented Mar 25, 2024

Inconsistent behavior between tokenizer and fast tokenizer #28577

Inconsistent behavior between tokenizer and fast tokenizer #28577

Comments

xuzhenqi commented Jan 18, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jan 18, 2024

ArthurZucker commented Mar 25, 2024