-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer use_fast=True
encode has fatal bug
#29483
Comments
sydney0zq
changed the title
Tokenizer
Tokenizer Mar 6, 2024
use_fast=False
encode erroruse_fast=False
encode has fatal bug
sydney0zq
changed the title
Tokenizer
Tokenizer Mar 6, 2024
use_fast=False
encode has fatal buguse_fast=True
encode has fatal bug
Hey! The token was probably added 🤗
what happends when you print the tokenizer |
Here is the output:
The results seem harmless, however, when use
|
Alright I think |
Thanks, closing the issue~ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
System Info
Transformers=4.37
Pytorch=2.1.1
Reproduction
For yi34b model, after SFT, we save the model without
tokenizer.json
but with anything else about tokenizer, then we runtokenizer.convert_ids_to_tokens([64001])
(the tokenizer use_fast being set to default value), it gets<|endoftext|>
. But the vocab has only 64000, therefore it triggers CUDA device assertion error.This bug is prone to be reproduced. Please fix it.
Expected behavior
Fix the bug ASAP
The text was updated successfully, but these errors were encountered: