-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the tokenizer #2023
Comments
Related to #1931 - Some other changes to EOS token behaviour, maybe hidden behind a command line switch, would be nice to have too. Right now models using EOS token as separator (e.g. Vicuna 1.1/1.3 from lmsys, Nous-Hermes-13b) are somewhat broken (current code doesn't tokenize EOS, and replaces generated EOS with newline in interactive mode). |
I will try this in the weekend. |
Have you thought about supporting text normalization in tokenizer like https://github.com/google/sentencepiece/blob/master/doc/normalization.md ? This is essential to get correct encoding (NFKC) for unicode languages like Chinese. For example encoding and then decoding '(' wouldn't yield the original string. |
Spot on: somebody should have thought about this in the past probably. Not sure about the best way forward here.
I don't think so. |
@slaren : as soon as #3170 is merged I'm happy (for now) with the character encoding/decoding behavior of the @huichen : I just checked your example of '(' in @ggerganov: Should we test the character behavior of the Falcon tokenizer the same way as for the Llama one? Do we have a strategy how to cope with Unicode normalization if necessary? |
@goerch If I am not mistaken, |
We should fix the issues found in the llama tokenizer by @vjeux. It is explained in detail here: #252 (comment)
Might be a good first issue.
The text was updated successfully, but these errors were encountered: