Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-Converted Fast Tokenizer Producing Incorrect Results #24233

Closed
2 of 4 tasks
young-geng opened this issue Jun 13, 2023 · 12 comments · Fixed by #24266
Closed
2 of 4 tasks

Auto-Converted Fast Tokenizer Producing Incorrect Results #24233

young-geng opened this issue Jun 13, 2023 · 12 comments · Fixed by #24266

Comments

@young-geng
Copy link

young-geng commented Jun 13, 2023

System Info

  • transformers version: 4.30.1
  • Platform: Linux-5.15.107+-x86_64-with-glibc2.31
  • Python version: 3.10.12
  • Huggingface_hub version: 0.15.1
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.1+cu118 (False)
  • Tensorflow version (GPU?): 2.12.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.6.9 (cpu)
  • Jax version: 0.4.10
  • JaxLib version: 0.4.10
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The auto-converted fast tokenizer for the LLaMA model sometimes does not produce the same tokenization results as the original sentence piece tokenizer. This is affecting the OpenLLaMA models. Here's the code to reproduce it:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b', use_fast=False)
fast_tokenizer = AutoTokenizer.from_pretrained('openlm-research/open_llama_7b')

text = 'thermal'
print(tokenizer.encode(text))
print(fast_tokenizer.encode(text))

The code produces the following output:

[1, 14412]
[1, 31822, 496, 12719]

Expected behavior

The auto-converted fast tokenizer should produce the exact same tokens as the original sentencepiece tokenizer.

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jun 13, 2023

Hey! Thanks for reporting. I am investigating this !

@stephantul
Copy link
Contributor

Hi, I have a fix. It also makes the conversion process a lot faster (it is super slow on my machine right now for some reason). Is it ok if I make a PR?

@young-geng do you have other examples of words that go wrong? I think I've fixed it, but more evidence would also be nice 😸

@young-geng
Copy link
Author

@stephantul I can dig into it more to find some more examples. Could you tell me why this happens?

@stephantul
Copy link
Contributor

I'm still a bit confused as to the exact cause of the issue. I think it has to do with the way the merges are ordered. I'm now running the slow conversion process, which takes a long time, but the new fast conversion process at least fixes the "thermal" example you had above.

After that, I can compare and give you a proper analysis, should be done later today.

@stephantul
Copy link
Contributor

The issue was that your tokenizer has a merge which has a score of 0, which is _t. This merge wasn't properly recorded, since the conversion code checked for Falsiness of the merge score, and not whether it existed.

i.e., it checked if vocab_score:, but it should have been checking if vocab_score is None:. Because of this, it removed the _t as a possible merge, which afflicts _thermal and other words starting with lowercase letter t.

@ArthurZucker
Copy link
Collaborator

Great work @stephantul ! Will review your PR to merge it asap!

@dsdanielpark
Copy link

ArthurZucker

I have encountered the same inconsistency. Due to various reasons, it is always difficult to use the latest version. Could you please let me know from which version of transformers this issue was updated?

@ArthurZucker
Copy link
Collaborator

Hey! This was available in the following releases: v4.35.2 v4.35.1 v4.35.0 v4.34.1 v4.34.0 v4.33.3 v4.33.2 v4.33.1 v4.33.0 v4.32.1 v4.32.0 v4.31.0

@dsdanielpark
Copy link

dsdanielpark commented Dec 4, 2023

ArthurZucker

Thank you for your response.

In the case of llama2 tokenizer, I have confirmed that all 8.56 billion tokens in datasets of famous LLMs are tokenized identically in both the fast tokenizer and slow tokenizer even with transformers version 4.31.0.

image

@ArthurZucker
Copy link
Collaborator

Awesome 🚀

@hr0nix
Copy link

hr0nix commented Jun 13, 2024

Hey, I think the bug might be back.

I've just updated to the most recent version of transformers and tokenizers and my slow-fast equivalence test started failing for dinhanhx/llama-tokenizer-hf and mistralai/Mistral-7B-v0.3

@ArthurZucker
Copy link
Collaborator

Hey! Can you either share a small reproducer or share the tests you are running?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants