-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistency between CodeLlamaTokenizer
and CodeLlamaTokenizerFast
#25881
Comments
Yep, this is a known bug and the correct output is the |
The fix is in #26678! |
@ArthurZucker When can we expect the fix to go through? |
Maybe a week or so, this needs a release in tokenizers and a merge in transformers! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
The PR in tokenizers is ready, I'll try to do a release today or tomorrow. The fix will need a release in transformers but should follow quick 😉 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Actually not sure I'll ship it fast enough it needs additional testing |
@ArthurZucker just to confirm, in that PR, although you override the base SpmConverter class, the LlamaConverter itself overrides the normalizer (here) and pre_tokenizer (here), so the changes made there won't fix this problem. |
Yes, a separate PR will deal with the Llama converter ! |
There were delays again but this is not stale! |
System Info
transformers
version: 4.33.0.dev0Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The two tokenizers should have the same behavior.
There's no exact equivalent of
add_special_tokens=False
in the original facebookresearch/codellama repo, but the following seems roughly equivalent for the"<PRE>"
case:which agrees with
CodeLlamaTokenizer
and disagrees withCodeLlamaTokenizerFast
1.Footnotes
I realize that one isn't supposed to directly encode
"<PRE>"
with the HF tokenizer, I'm just using it to construct a case where the HF and Facebook tokenizers can be compared. The Facebook tokenizer won't.encode
the EOS or BOS tokens to their corresponding IDs -- it treats them as an ordinary string of 3 characters. But it encodes the FIM tokens to their IDs, as used above with"<PRE>"
. ↩The text was updated successfully, but these errors were encountered: