Inconsistency between fast and slow codellama tokenizers #26455

UniverseFly · 2023-09-28T01:24:48Z

System Info

transformers version: 4.33.2
Platform: Linux-5.15.0-82-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.3
Accelerate version: 0.22.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

A simple reproduce:

from transformers import AutoTokenizer

t_fast = AutoTokenizer.from_pretrained("codellama/codellama-7B-Instruct-Hf", use_fast=True)
t_slow = AutoTokenizer.from_pretrained("codellama/codellama-7B-Instruct-Hf", use_fast=False)

ids_fast = t_fast.encode("<s>[INST]", add_special_tokens=False)
ids_slow = t_slow.encode("<s>[INST]", add_special_tokens=False)

assert ids_fast == ids_slow, f"Fast: {ids_fast}, Slow: {ids_slow}"
# AssertionError: Fast: [1, 518, 25580, 29962], Slow: [1, 25580, 29962]

Expected behavior

I'm not sure which one is correct. Actually decoding the fast tokenizer outputs will get '<s> [INST]', while the slow tokenizer '<s>INST]', both not same as the original string.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-09-28T06:55:23Z

Hey! This was already reported and is a duplicate of #25881 and related to #26318 as well. Will be fixed in tokenizers soon!

ArthurZucker · 2023-12-01T10:01:10Z

The PR is looking good, had a few delays but was able to reproduce 1-1 with fast but also with sentencepiece actual token addition.

edmondja mentioned this issue Oct 3, 2023

V4.34 tokenizer incompatible with mistral #26567

Closed

ArthurZucker mentioned this issue Oct 23, 2023

[Core Tokenization] Support a fix for spm fast models #26678

Merged

3 tasks

huggingface deleted a comment from github-actions bot Nov 6, 2023

ArthurZucker mentioned this issue Nov 6, 2023

Inconsistent results between LlamaTokenizer and LlamaTokenizerFast #27230

Closed

4 tasks

huggingface deleted a comment from github-actions bot Dec 1, 2023

huggingface deleted a comment from github-actions bot Jan 2, 2024

ArthurZucker closed this as completed in #26678 Jan 18, 2024

ArthurZucker mentioned this issue Jan 26, 2024

Models with a sentencepiece tokenizers have problems with special tokens and encode decode #28714

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistency between fast and slow codellama tokenizers #26455

Inconsistency between fast and slow codellama tokenizers #26455

UniverseFly commented Sep 28, 2023 •

edited

Loading

ArthurZucker commented Sep 28, 2023

ArthurZucker commented Dec 1, 2023

Inconsistency between fast and slow codellama tokenizers #26455

Inconsistency between fast and slow codellama tokenizers #26455

Comments

UniverseFly commented Sep 28, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Sep 28, 2023

ArthurZucker commented Dec 1, 2023

UniverseFly commented Sep 28, 2023 •

edited

Loading