Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sentencepiece bpe compatible tokenizer #252

Merged
merged 1 commit into from
Mar 20, 2023

Conversation

eiz
Copy link
Contributor

@eiz eiz commented Mar 18, 2023

I believe this largely fixes the tokenization issues. The example mentioned in #167 as well as my local tests (e.g. "accurately" should tokenize as [7913, 2486]) are fixed by it. Have not tested extensively though, especially with Unicode.

I saw some discussion around file format updates so just take this as an rfc, I just hacked something in

sorry if my coding style is not to your liking ;)

@eiz eiz force-pushed the mack/sentencepiece-bpe branch 6 times, most recently from c67e5ef to 448f398 Compare March 18, 2023 04:03
@slaren
Copy link
Collaborator

slaren commented Mar 18, 2023

Breaks quantize.cpp currently, needs to update the tokenizer part to add the score.

@eiz
Copy link
Contributor Author

eiz commented Mar 18, 2023

doh, thanks for pointing that out, I've only been using fp16 =) will fix.

convert-pth-to-ggml.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@j-f1 j-f1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat! I think this is pretty close but the Unicode handling isn’t quite right. In particular I don’t believe the tokenizer should be UTF-8 aware, since LLaMA should be perfectly capable of handling invalid UTF-8 strings. It seems to operate on the byte level so I believe this PR as-is will prevent characters that are not in the token dataset from being tokenized. Unrecognized characters are currently represented using their UTF-8 bytes as separate tokens.

@eiz
Copy link
Contributor Author

eiz commented Mar 18, 2023

The handling of UTF-8 here is exactly the same as SentencePiece does. Multi-byte characters that don't form tokens will be output one byte at a time.

@j-f1
Copy link
Collaborator

j-f1 commented Mar 18, 2023

That’s what happens when I do code review at 1am! Everything looks great now (but it’s still 1am so I am not going to approve it until tomorrow morning when I can take a proper look)

@ggerganov
Copy link
Owner

@eiz
Before you merge this, add temporary notice at the top of the README to regen models (like this: 8a01f56)

Also, let's start bumping the magic when the ggml models change:
https://github.com/ggerganov/llama.cpp/blob/master/convert-pth-to-ggml.py#L96

Or add a version ggml version number. Whichever you prefer.
Check during loading and if it is not correct - print message asking the user to regen the models.

P.S. Need a few more days before I start looking into details, so appreciate all the help from the collaborators so far

@slaren
Copy link
Collaborator

slaren commented Mar 18, 2023

The tokenization look great, I couldn't find any differences with the original llama tokenizer.

@Ronsor
Copy link
Contributor

Ronsor commented Mar 18, 2023

@ggerganov I would suggest a version number. That allows for better error messages like version unsupported versus something like invalid model file.

@eiz
Copy link
Contributor Author

eiz commented Mar 18, 2023

"why not both?"

  • changed file magic so existing unversioned files don't misparse (ggml -> ggmf "gg model file")
  • now a version number in the header

finp.read((char *) &format_version, sizeof(format_version));

if (format_version != 1) {
fprintf(stderr, "%s: invalid model file '%s' (unsupported format version %" PRIu32 ")\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: move the format_version to a shared header file of some sort, and then say (unsupported version 2, expected 1)

@ggerganov
Copy link
Owner

ggerganov commented Mar 19, 2023

@eiz Apologies for the convert-pth-to-ggml.py refactoring - please resolve the conflicts and merge when you are ready

@bakkot
Copy link
Contributor

bakkot commented Mar 20, 2023

Good news: I ran this using the scoring logic in #270 and saw an improvement in perplexity for the 7B FP16 on wikitext-2/test from 10.4625 before this PR to 5.8149 after. That's a huge improvement.

* potential out of bounds read

* fix quantize

* style

* Update convert-pth-to-ggml.py

Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>

* mild cleanup

* don't need the space-prefixing here rn since main.cpp already does it

* new file magic + version header field

* readme notice

* missing newlines
@eiz eiz merged commit 074bea2 into ggerganov:master Mar 20, 2023
@Green-Sky
Copy link
Collaborator

Green-Sky commented Mar 21, 2023

just wanted to note that this change had the positive side effect of the model now producing most common english words as a single token. before words where pieced together. this results in a significant speed up (~2x?) of generated text, even if the token/sec stayed the same. 🎉

mudler pushed a commit to go-skynet/llama that referenced this pull request Mar 21, 2023
* potential out of bounds read

* fix quantize

* style

* Update convert-pth-to-ggml.py

* mild cleanup

* don't need the space-prefixing here rn since main.cpp already does it

* new file magic + version header field

* readme notice

* missing newlines

Co-authored-by: slaren <2141330+slaren@users.noreply.github.com>
@anzz1 anzz1 mentioned this pull request Mar 22, 2023
@vjeux
Copy link

vjeux commented Jun 23, 2023

I believe that there's an issue with this algorithm. It can only merge tokens if they are composite of existing tokens. Not only this, but the combination that works must be the highest scoring one.

I was curious to see in the llama vocabulary how many tokens would not tokenize to themselves and I got 1631 out of 32000.
https://gist.github.com/vjeux/5a466b4c47dc19ec9630f6fbf0cc3a1b

Note that I reimplemented the algorithm in this commit, so I may have made a mistake.

I'm trying to figure out right now if this is the same algorithm as in the sentencepiece project, in that case it's an issue with the original tokenizer. Or if it's an issue with this implementation only.

Edit: this is the output of sentencepiece python implementation. So looks like it's able to print the tokens. So they must be using a different algorithm than this one (or I messed up the implementation).

>> print(tt.encode("""▁–"""))

[1, 56805, 702, 2]
1 <s>
56805 ▁
702 ▁–
2 </s>

@vjeux
Copy link

vjeux commented Jun 23, 2023

@bobakfb asked ChatGPT to find the differences between this algo and SentencePiece one: https://github.com/google/sentencepiece/blob/master/src/bpe_model.cc

Both of these algorithms appear to be implementations of the Byte-Pair Encoding (BPE) algorithm, which is often used for tokenizing text in natural language processing (NLP) tasks.

The key difference between the two is the post-processing step, specifically how they handle symbols or characters that were not matched to any token in the vocabulary.

Here are the notable differences:

Post-processing:
In the first algorithm (which is implemented within the SentencePiece framework), if a symbol sequence doesn't exist in the vocabulary, the algorithm performs a recursive process of segmentation (resegmentation), trying to break down the sequence into smaller subwords or symbols that are present in the vocabulary. This means it attempts to encode unrecognized sequences to known pieces.

In the second algorithm (the llama_tokenizer), if a symbol sequence is not found in the vocabulary, it treats each individual character as a separate token and outputs their corresponding byte values (plus 3, as the first three positions seem reserved for special tokens). This approach treats unrecognized sequences as a series of individual characters.

It looks correct. There's a "resegment" step at the end of the sentencepiece algorithm ( https://github.com/google/sentencepiece/blob/master/src/bpe_model.cc#L175-L200 ) that isn't present in this implementation (

llama.cpp/llama.cpp

Lines 1870 to 1883 in 7487137

for (int i = 0; i != -1; i = symbols_[i].next) {
auto & symbol = symbols_[i];
auto token = vocab_.token_to_id.find(std::string(symbol.text, symbol.n));
if (token == vocab_.token_to_id.end()) {
// output any symbols that did not form tokens as bytes.
for (int j = 0; j < (int) symbol.n; ++j) {
llama_vocab::id token_id = static_cast<uint8_t>(symbol.text[j]) + 3;
output.push_back(token_id);
}
} else {
output.push_back((*token).second);
}
}
).

So we should add it as well.

@slaren slaren mentioned this pull request Jun 27, 2023
This was referenced Jul 21, 2023
AAbushady pushed a commit to AAbushady/llama.cpp that referenced this pull request Jan 27, 2024
…F16/KQuants per iter. (ggerganov#252)

* Fix hordeconfig maxcontext setting.

* cuda: Bring DMMV_F16 and KQUANTS_ITER Makefile flags over from llama.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants