-
Notifications
You must be signed in to change notification settings - Fork 824
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Potential Bug] Mistral Tokenizer Inconsistencies #1448
Comments
Hey! Thanks, a fix can be derived from #1357 and huggingface/transformers#26678. |
I have not had the time to change the default llama fast tokenizer, will try to do asap |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
I think this is still relevant |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This was fixed in |
I have downloaded the Mistral 7B tokenizer locally and tried to compare different combinations of the
legacy
anduse_fast
options:Which yields:
You can find the full code here.
There seem to be inconsistencies with how
legacy=False, use_fast=False
tokenizes input compared to the other options.If either option is set to
True
, there is an extra space added after tokens like<unk>
or other special tokens.It seems to me that only
legacy=False, use_fast=False
tokenenizes this input correctly.We have a production app that extends Mistral with other special tokens besides
<unk>
, and extra spaces are added after those too.So right now, we have switched over to
legacy=False, use_fast=False
, not getting any of the speed advantages of the Rust implementation.Would appreciate any insight to what we are missing! And thank you for the enormous amount of work you guys have put into this library 🙏
The text was updated successfully, but these errors were encountered: