We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transformers
@ArthurZucker
examples
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf", trust_remote_code=True, use_fast=False) fast_tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf", trust_remote_code=True, use_fast=True) prompt = "▁<PRE>//" inputs = tokenizer(prompt, return_tensors="pt") print(f"tokenizer ids: {inputs.input_ids}") inputs = fast_tokenizer(prompt, return_tensors="pt") print(f"fast tokenizer ids: {inputs.input_ids}")
This scripts will output:
tokenizer ids: tensor([[ 1, 32007, 458]]) fast tokenizer ids: tensor([[ 1, 32007, 849]])
In the tokenizer.json from the model folder, we can see:
tokenizer.json
"//": 458, "▁//": 849,
Fast tokenizer probably ignores the <PRE> token, is it a correct behavior?
<PRE>
Fast tokenizer should be consistent with normal tokenizer.
The text was updated successfully, but these errors were encountered:
Thanks for reporting! This is pretty much a known bug but will be fixed by the likes of #26678 (when propagated to Llama)
Sorry, something went wrong.
#28881 will fix this issue!
LlamaTokenizerFast
Successfully merging a pull request may close this issue.
System Info
transformers
version: 4.36.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
This scripts will output:
In the
tokenizer.json
from the model folder, we can see:Fast tokenizer probably ignores the
<PRE>
token, is it a correct behavior?Expected behavior
Fast tokenizer should be consistent with normal tokenizer.
The text was updated successfully, but these errors were encountered: