-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 #3170
Conversation
Work on compiler warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like a more complete understanding of why the API change is needed. The fact that sentencepiece accepts it isn't good enough - does the text
parameter represent a UTF-8 string, or something else?
As far as I understand it, |
Yes, but it is a non-printable control character. As far as POSIX is concerned, text files must not contain NUL bytes - therefore, text should not contain NUL bytes. |
The small print says: 'Although POSIX.1-2017 does not distinguish between text files and binary files (see the ISO C standard)...' |
If the POSIX description isn't sufficient, what about MIME types? $ printf 'hello\n' >text.bin
$ printf 'hello\0\n' >not_text.bin
$ file -i text.bin
text.bin: text/plain; charset=us-ascii
$ file -i not_text.bin
not_text.bin: application/octet-stream; charset=binary |
@cebtenzzre : I believe that the NUL character is improbable and badly supported by software around. But this PR is aiming to improve |
I am not sure why POSIX is relevant here. Is there any reason to believe that the tokenizer needs to respect any part of the POSIX spec? I think this is simpler that, if the sentencepiece tokenizer can encode NUL characters, but our implementation can't, then it is a bug in our implementation and it should be fixed. Supporting non-NUL terminated strings seems the natural way to do it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll defer to your judgement in this case.
…izer-1 (ggerganov#3170) * Fix für ggerganov#2721 * Reenable tokenizer test for LLaMa * Add `console.cpp` dependency * Fix dependency to `common` * Fixing wrong fix. * Make console usage platform specific Work on compiler warnings. * Adapting makefile * Remove trailing whitespace * Adapting the other parts of the makefile * Fix typo. * Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 * Simplify logic * Add missing change... * Fix ugly compiler warning * llama_tokenize should accept strings containing NUL now * Adding huichen's test case
The last deviations are fixed now, too.
We don't seem to need Unicode normalization immediately.