Auto-Converted Fast Tokenizer Producing Incorrect Results #24233

young-geng · 2023-06-13T09:20:50Z

System Info

transformers version: 4.30.1
Platform: Linux-5.15.107+-x86_64-with-glibc2.31
Python version: 3.10.12
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu118 (False)
Tensorflow version (GPU?): 2.12.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.6.9 (cpu)
Jax version: 0.4.10
JaxLib version: 0.4.10
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The auto-converted fast tokenizer for the LLaMA model sometimes does not produce the same tokenization results as the original sentence piece tokenizer. This is affecting the OpenLLaMA models. Here's the code to reproduce it:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b', use_fast=False)
fast_tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b')

text = 'thermal'
print(tokenizer.encode(text))
print(fast_tokenizer.encode(text))

The code produces the following output:

[1, 14412]
[1, 31822, 496, 12719]

Expected behavior

The auto-converted fast tokenizer should produce the exact same tokens as the original sentencepiece tokenizer.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-06-13T14:13:53Z

Hey! Thanks for reporting. I am investigating this !

stephantul · 2023-06-14T04:43:52Z

Hi, I have a fix. It also makes the conversion process a lot faster (it is super slow on my machine right now for some reason). Is it ok if I make a PR?

@young-geng do you have other examples of words that go wrong? I think I've fixed it, but more evidence would also be nice 😸

young-geng · 2023-06-14T05:06:24Z

@stephantul I can dig into it more to find some more examples. Could you tell me why this happens?

stephantul · 2023-06-14T05:15:06Z

I'm still a bit confused as to the exact cause of the issue. I think it has to do with the way the merges are ordered. I'm now running the slow conversion process, which takes a long time, but the new fast conversion process at least fixes the "thermal" example you had above.

After that, I can compare and give you a proper analysis, should be done later today.

stephantul · 2023-06-14T06:26:50Z

The issue was that your tokenizer has a merge which has a score of 0, which is _t. This merge wasn't properly recorded, since the conversion code checked for Falsiness of the merge score, and not whether it existed.

i.e., it checked if vocab_score:, but it should have been checking if vocab_score is None:. Because of this, it removed the _t as a possible merge, which afflicts _thermal and other words starting with lowercase letter t.

ArthurZucker · 2023-06-14T09:54:58Z

Great work @stephantul ! Will review your PR to merge it asap!

dsdanielpark · 2023-12-04T05:44:58Z

ArthurZucker

I have encountered the same inconsistency. Due to various reasons, it is always difficult to use the latest version. Could you please let me know from which version of transformers this issue was updated?

ArthurZucker · 2023-12-04T06:40:02Z

Hey! This was available in the following releases: v4.35.2 v4.35.1 v4.35.0 v4.34.1 v4.34.0 v4.33.3 v4.33.2 v4.33.1 v4.33.0 v4.32.1 v4.32.0 v4.31.0

dsdanielpark · 2023-12-04T10:44:55Z

ArthurZucker

Thank you for your response.

In the case of llama2 tokenizer, I have confirmed that all 8.56 billion tokens in datasets of famous LLMs are tokenized identically in both the fast tokenizer and slow tokenizer even with transformers version 4.31.0.

ArthurZucker · 2023-12-04T12:03:59Z

Awesome 🚀

hr0nix · 2024-06-13T11:46:38Z

Hey, I think the bug might be back.

I've just updated to the most recent version of transformers and tokenizers and my slow-fast equivalence test started failing for dinhanhx/llama-tokenizer-hf and mistralai/Mistral-7B-v0.3

ArthurZucker · 2024-06-19T11:33:06Z

Hey! Can you either share a small reproducer or share the tests you are running?

This was referenced Jun 13, 2023

The result from open_llm_leaderboard is not as expected. openlm-research/open_llama#48

Open

What is ddboolq in the evaluation? We cannot find the "ddboolq" task in lm-evaluation-harness. openlm-research/open_llama#47

Closed

stephantul mentioned this issue Jun 14, 2023

Fix bug in slow tokenizer conversion, make it a lot faster #24266

Merged

5 tasks

amyeroberts closed this as completed in #24266 Jun 15, 2023

This was referenced Jul 7, 2023

Open-LLaMA-3B results are much worse than reported in this repo openlm-research/open_llama#68

Open

Do you mind providing the exact command or script used to produce the evaluation results? openlm-research/open_llama#60

Closed

jonatanklosko mentioned this issue Sep 8, 2023

Issues loading tokenizer/Support loading tokenizer.model? elixir-nx/bumblebee#239

Closed

mstojkovicTT mentioned this issue Oct 28, 2024

Problems with KV cache padding to max_sequence length tenstorrent/tt-forge-fe#422

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-Converted Fast Tokenizer Producing Incorrect Results #24233

Auto-Converted Fast Tokenizer Producing Incorrect Results #24233

young-geng commented Jun 13, 2023 •

edited

Loading

ArthurZucker commented Jun 13, 2023 •

edited

Loading

stephantul commented Jun 14, 2023

young-geng commented Jun 14, 2023

stephantul commented Jun 14, 2023

stephantul commented Jun 14, 2023

ArthurZucker commented Jun 14, 2023

dsdanielpark commented Dec 4, 2023

ArthurZucker commented Dec 4, 2023

dsdanielpark commented Dec 4, 2023 •

edited

Loading

ArthurZucker commented Dec 4, 2023

hr0nix commented Jun 13, 2024 •

edited

Loading

ArthurZucker commented Jun 19, 2024

Auto-Converted Fast Tokenizer Producing Incorrect Results #24233

Auto-Converted Fast Tokenizer Producing Incorrect Results #24233

Comments

young-geng commented Jun 13, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jun 13, 2023 • edited Loading

stephantul commented Jun 14, 2023

young-geng commented Jun 14, 2023

stephantul commented Jun 14, 2023

stephantul commented Jun 14, 2023

ArthurZucker commented Jun 14, 2023

dsdanielpark commented Dec 4, 2023

ArthurZucker commented Dec 4, 2023

dsdanielpark commented Dec 4, 2023 • edited Loading

ArthurZucker commented Dec 4, 2023

hr0nix commented Jun 13, 2024 • edited Loading

ArthurZucker commented Jun 19, 2024

young-geng commented Jun 13, 2023 •

edited

Loading

ArthurZucker commented Jun 13, 2023 •

edited

Loading

dsdanielpark commented Dec 4, 2023 •

edited

Loading

hr0nix commented Jun 13, 2024 •

edited

Loading