Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing the last deviations from sentencepiece indicated by test-tokenizer-1 #3170

Merged
merged 19 commits into from
Sep 16, 2023

Conversation

goerch
Copy link
Collaborator

@goerch goerch commented Sep 14, 2023

The last deviations are fixed now, too.

We don't seem to need Unicode normalization immediately.

common/common.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@cebtenzzre cebtenzzre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like a more complete understanding of why the API change is needed. The fact that sentencepiece accepts it isn't good enough - does the text parameter represent a UTF-8 string, or something else?

@goerch
Copy link
Collaborator Author

goerch commented Sep 14, 2023

I would like a more complete understanding of why the API change is needed. The fact that sentencepiece accepts it isn't good enough - does the text parameter represent a UTF-8 string, or something else?

As far as I understand it, 0x00 is a valid Unicode point which converts to 0x00 in UTF8 and vice versa.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Sep 14, 2023

As far as I understand it, 0x00 is a valid Unicode point which converts to 0x00 in UTF8.

Yes, but it is a non-printable control character. As far as POSIX is concerned, text files must not contain NUL bytes - therefore, text should not contain NUL bytes.

@goerch
Copy link
Collaborator Author

goerch commented Sep 14, 2023

As far as I understand it, 0x00 is a valid Unicode point which converts to 0x00 in UTF8.

Yes, but it is a non-printable control character. As far as POSIX is concerned, text files must not contain NUL bytes - therefore, text should not contain NUL bytes.

The small print says: 'Although POSIX.1-2017 does not distinguish between text files and binary files (see the ISO C standard)...'

@cebtenzzre
Copy link
Collaborator

If the POSIX description isn't sufficient, what about MIME types?

$ printf 'hello\n' >text.bin
$ printf 'hello\0\n' >not_text.bin
$ file -i text.bin
text.bin: text/plain; charset=us-ascii
$ file -i not_text.bin
not_text.bin: application/octet-stream; charset=binary

@goerch
Copy link
Collaborator Author

goerch commented Sep 14, 2023

@cebtenzzre : I believe that the NUL character is improbable and badly supported by software around. But this PR is aiming to improve sentencepiece compatibility for a specific test case. So either we work with this PR or change the test case. How do you propose I should change the test case? Ignore token 3 and code point 0? Then I can simply change the test case and retract all changes to the llama.cpp kernel.

@slaren
Copy link
Collaborator

slaren commented Sep 14, 2023

I am not sure why POSIX is relevant here. Is there any reason to believe that the tokenizer needs to respect any part of the POSIX spec? I think this is simpler that, if the sentencepiece tokenizer can encode NUL characters, but our implementation can't, then it is a bug in our implementation and it should be fixed. Supporting non-NUL terminated strings seems the natural way to do it.

Copy link
Collaborator

@cebtenzzre cebtenzzre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll defer to your judgement in this case.

common/common.cpp Outdated Show resolved Hide resolved
@goerch goerch mentioned this pull request Sep 16, 2023
@slaren slaren merged commit b08e75b into ggerganov:master Sep 16, 2023
pkrmf pushed a commit to morlockstudios-com/llama.cpp that referenced this pull request Sep 26, 2023
…izer-1 (ggerganov#3170)

* Fix für ggerganov#2721

* Reenable tokenizer test for LLaMa

* Add `console.cpp` dependency

* Fix dependency to `common`

* Fixing wrong fix.

* Make console usage platform specific

Work on compiler warnings.

* Adapting makefile

* Remove trailing whitespace

* Adapting the other parts of the makefile

* Fix typo.

* Fixing the last deviations from sentencepiece indicated by test-tokenizer-1

* Simplify logic

* Add missing change...

* Fix ugly compiler warning

* llama_tokenize should accept strings containing NUL now

* Adding huichen's test case
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants