-
Notifications
You must be signed in to change notification settings - Fork 865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difference between slow and fast GPT2 tokenizers #1363
Comments
Hey! Could you share a full reproducer? Would help me a lot to have the output of |
https://github.com/jploski/llama.cpp/tree/hf-issue-1363/tests/hf-issue-1363 (env output included in README.md) |
This comment is explaining the problem. Not sure if I should close this issue then? |
Probably yes! Unless the changes need to be propagated to the slow tokenizer? |
Thanks! |
Hm, as a transformers user I would normally assume that the slow and fast tokenizers are both correct. If only the fast tokenizer is correct, this should be documented somewhere. (Maybe it is, and I'm just not aware.) |
I have been thinking about that too, but I accepted that the fast GPT2 tokenizer offers more features than the original one and Falcon used them (unfortunately for us). The remark about the documentation is correct (and I certainly would like to ask a lot more questions about the serialization formats of HF;). |
I agree, these differences in functionality are confusing and would deserve at least a mention in the documentation (https://huggingface.co/docs/transformers/main_classes/tokenizer). I note that for the *TokenizerFast even the tokenizer_file parameter is currently entirely undocumented. |
We are trying to get the same results for fast and slow, but in this specific case Falcon used the GPT2TokenizerFast with a custom template processor. We can add a
Would any of you like to open a PR to update the documentation however you feel? You can ping me for review 🤗 |
Please see this comment by @jploski.
The text was updated successfully, but these errors were encountered: