-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sentencepiece bpe compatible tokenizer #252
Conversation
c67e5ef
to
448f398
Compare
Breaks quantize.cpp currently, needs to update the tokenizer part to add the score. |
doh, thanks for pointing that out, I've only been using fp16 =) will fix. |
9711125
to
de29a00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat! I think this is pretty close but the Unicode handling isn’t quite right. In particular I don’t believe the tokenizer should be UTF-8 aware, since LLaMA should be perfectly capable of handling invalid UTF-8 strings. It seems to operate on the byte level so I believe this PR as-is will prevent characters that are not in the token dataset from being tokenized. Unrecognized characters are currently represented using their UTF-8 bytes as separate tokens.
The handling of UTF-8 here is exactly the same as SentencePiece does. Multi-byte characters that don't form tokens will be output one byte at a time. |
That’s what happens when I do code review at 1am! Everything looks great now (but it’s still 1am so I am not going to approve it until tomorrow morning when I can take a proper look) |
51d3b4a
to
0acb5f5
Compare
@eiz Also, let's start bumping the magic when the Or add a version P.S. Need a few more days before I start looking into details, so appreciate all the help from the collaborators so far |
The tokenization look great, I couldn't find any differences with the original llama tokenizer. |
@ggerganov I would suggest a version number. That allows for better error messages like |
"why not both?"
|
finp.read((char *) &format_version, sizeof(format_version)); | ||
|
||
if (format_version != 1) { | ||
fprintf(stderr, "%s: invalid model file '%s' (unsupported format version %" PRIu32 ")\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: move the format_version
to a shared header file of some sort, and then say (unsupported version 2, expected 1)
@eiz Apologies for the |
Good news: I ran this using the scoring logic in #270 and saw an improvement in perplexity for the 7B FP16 on wikitext-2/test from |
* potential out of bounds read * fix quantize * style * Update convert-pth-to-ggml.py Co-authored-by: slaren <2141330+slaren@users.noreply.github.com> * mild cleanup * don't need the space-prefixing here rn since main.cpp already does it * new file magic + version header field * readme notice * missing newlines
5d54356
to
649cee5
Compare
just wanted to note that this change had the positive side effect of the model now producing most common english words as a single token. before words where pieced together. this results in a significant speed up (~2x?) of generated text, even if the token/sec stayed the same. 🎉 |
* potential out of bounds read * fix quantize * style * Update convert-pth-to-ggml.py * mild cleanup * don't need the space-prefixing here rn since main.cpp already does it * new file magic + version header field * readme notice * missing newlines Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
I believe that there's an issue with this algorithm. It can only merge tokens if they are composite of existing tokens. Not only this, but the combination that works must be the highest scoring one. I was curious to see in the llama vocabulary how many tokens would not tokenize to themselves and I got 1631 out of 32000. Note that I reimplemented the algorithm in this commit, so I may have made a mistake. I'm trying to figure out right now if this is the same algorithm as in the sentencepiece project, in that case it's an issue with the original tokenizer. Or if it's an issue with this implementation only. Edit: this is the output of sentencepiece python implementation. So looks like it's able to print the tokens. So they must be using a different algorithm than this one (or I messed up the implementation).
|
@bobakfb asked ChatGPT to find the differences between this algo and SentencePiece one: https://github.com/google/sentencepiece/blob/master/src/bpe_model.cc
It looks correct. There's a "resegment" step at the end of the sentencepiece algorithm ( https://github.com/google/sentencepiece/blob/master/src/bpe_model.cc#L175-L200 ) that isn't present in this implementation ( Lines 1870 to 1883 in 7487137
So we should add it as well. |
…F16/KQuants per iter. (ggerganov#252) * Fix hordeconfig maxcontext setting. * cuda: Bring DMMV_F16 and KQUANTS_ITER Makefile flags over from llama.
I believe this largely fixes the tokenization issues. The example mentioned in #167 as well as my local tests (e.g. "accurately" should tokenize as
[7913, 2486]
) are fixed by it. Have not tested extensively though, especially with Unicode.I saw some discussion around file format updates so just take this as an rfc, I just hacked something in
sorry if my coding style is not to your liking ;)