-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-Converted Fast Tokenizer Producing Incorrect Results #24233
Comments
Hey! Thanks for reporting. I am investigating this ! |
Hi, I have a fix. It also makes the conversion process a lot faster (it is super slow on my machine right now for some reason). Is it ok if I make a PR? @young-geng do you have other examples of words that go wrong? I think I've fixed it, but more evidence would also be nice 😸 |
@stephantul I can dig into it more to find some more examples. Could you tell me why this happens? |
I'm still a bit confused as to the exact cause of the issue. I think it has to do with the way the merges are ordered. I'm now running the slow conversion process, which takes a long time, but the new fast conversion process at least fixes the "thermal" example you had above. After that, I can compare and give you a proper analysis, should be done later today. |
The issue was that your tokenizer has a merge which has a score of 0, which is i.e., it checked |
Great work @stephantul ! Will review your PR to merge it asap! |
I have encountered the same inconsistency. Due to various reasons, it is always difficult to use the latest version. Could you please let me know from which version of transformers this issue was updated? |
Awesome 🚀 |
Hey, I think the bug might be back. I've just updated to the most recent version of transformers and tokenizers and my slow-fast equivalence test started failing for |
Hey! Can you either share a small reproducer or share the tests you are running? |
System Info
transformers
version: 4.30.1Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The auto-converted fast tokenizer for the LLaMA model sometimes does not produce the same tokenization results as the original sentence piece tokenizer. This is affecting the OpenLLaMA models. Here's the code to reproduce it:
The code produces the following output:
Expected behavior
The auto-converted fast tokenizer should produce the exact same tokens as the original sentencepiece tokenizer.
The text was updated successfully, but these errors were encountered: