-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for tokenize and untokenize of UTF-8 encoding in prompt/output #87
Conversation
make编译不过;
|
Need to add sentencepiece library manually |
@wizd I using your fork, the intract mode is not work: it was in dead loop..... |
Resolved in #79 |
Oh wait, did I get confused? |
I think it does. Are you still able to reproduce the issues? |
I reran the
There are 2 problems still:
|
Can you try running from shell script encoded as UTF-8 and outputting to a text file? Your terminal might not be handling Unicode correctly. You’ll also need to re-generate your models from scratch since this PR changes how the ggml files are created. |
@ggerganov just in case: did you re-run the quantization script as well? |
Oops .. all good now 🦙 |
suggestion: can we add a magic version number? i feel we’ll get further updates?
…On Mon, Mar 13, 2023 at 21:08, Georgi Gerganov ***@***.***> wrote:
> ***@***.***(https://github.com/ggerganov) just in case: did you re-run the quantization script as well?
Oops .. all good now 🦙
—
Reply to this email directly, [view it on GitHub](#87 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AADHVK435NNAH7DLEKXMTEDW35WCRANCNFSM6AAAAAAVYZNR34).
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
Does this merge into master? How to test it? The wizd's branch doesn't work well with intract mode. |
@ggerganov |
Fix TypeError in low_level chat
The tokenization process of LLaMA is filled with magic numbers and not easily replicable. However, I have found that using the SentencePiece library works well. It's possible that the original LLaMA model also used SentencePiece for its tokenization.
test prompt: '我静静的坐在雨中,思考着'
I sit quietly in the rain, thinking
This sentence was heavily tokenized to <0x??>, making it very difficult to replicate.