Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization is not equal to Meta's tokenization. #2310

Closed
viniciusarruda opened this issue Jul 21, 2023 · 24 comments
Closed

Tokenization is not equal to Meta's tokenization. #2310

viniciusarruda opened this issue Jul 21, 2023 · 24 comments
Labels
help wanted Extra attention is needed high priority Very important issue

Comments

@viniciusarruda
Copy link

I'm comparing the tokenization between original Meta repo and llama.cpp with LLaMA (also had same issue with LLaMA v2).

For example, tokenizing the prompt "Hello world" and " Hello world" gives the following:

For prompt "Hello world":
llama.cpp tokenizer: [10994, 3186]
Meta tokenizer: [15043, 3186]

For prompt " Hello world":
llama.cpp tokenizer: [15043, 3186]
Meta tokenizer: [29871, 15043, 3186]

Exploring the tokens, doing the detokenization, I got:

For tokens "[10994, 3186]":
llama.cpp tokenizer: |b'Hello world'|
Meta tokenizer: |Hello world|

For tokens "[15043, 3186]":
llama.cpp tokenizer: |b' Hello world'|
Meta tokenizer: |Hello world|

For tokens "[29871, 15043, 3186]":
llama.cpp tokenizer: |b' Hello world'|
Meta tokenizer: | Hello world|

*Adding | to ease visualization.

Exploring each token above with the id_to_piece functionality:

Looking the id_to_piece for llama.cpp:
id 10994 |b'Hello'|
id 3186 |b' world'|
id 15043 |b' Hello'|
id 29871 |b' '|

Looking the id_to_piece for Meta:
id 10994 |Hello|
id 3186 |▁world|
id 15043 |▁Hello|
id 29871 |▁|

*Adding | to ease visualization.
Note, the 29871 token is not the underline character but "\u2581" (See more about this here).

But, using the detokenizer in each id individually:

Using the llama.cpp detokenizer:
id 10994 |b'Hello'|
id 3186 |b' world'|
id 15043 |b' Hello'|
id 29871 |b' '|

Using the Meta detokenizer:
id 10994 |Hello|
id 3186 |world|
id 15043 |Hello|
id 29871 ||

The code used to produce this results can be seen here.
Use this file for the Meta tokenizer.
The model ggml-model-f16.bin is the 7B LLaMA model after using the convert.py script as mentioned here.

@ggerganov ggerganov added help wanted Extra attention is needed high priority Very important issue labels Jul 21, 2023
@goerch
Copy link
Collaborator

goerch commented Jul 21, 2023

There are known problems with the tokenizer,

@jxy
Copy link
Contributor

jxy commented Jul 22, 2023

because Tokenizer.tokenize() always add a space in front of the string passed in.

29871 is the space

@jxy
Copy link
Contributor

jxy commented Jul 22, 2023

just to explain what I meant by "space", see here: https://github.com/google/sentencepiece/blob/master/README.md#whitespace-is-treated-as-a-basic-symbol

@goerch
Copy link
Collaborator

goerch commented Jul 22, 2023

because Tokenizer.tokenize() always add a space in front of the string passed in.

I don't think it does that, because I just tested it here. But may be this happens elsewhere.

@goerch
Copy link
Collaborator

goerch commented Jul 22, 2023

For example, tokenizing the prompt "Hello world" and " Hello world" gives the following:

For prompt "Hello world":
llama.cpp tokenizer: [10994, 3186]
Meta tokenizer: [15043, 3186]
For prompt " Hello world":
llama.cpp tokenizer: [15043, 3186]
Meta tokenizer: [29871, 15043, 3186]

Running the tests I see the Meta tokens now. But I surely need guidance on how to integrate the necessary changes into llama.cpp .

BTW: I'd expect a couple of more problems for other tests.

goerch added a commit to goerch/llama.cpp that referenced this issue Jul 23, 2023
Waiting for the fallout ...
@goerch
Copy link
Collaborator

goerch commented Jul 23, 2023

@viniciusarruda and others ; nice find, indeed! Now that I've learned a bit about the tokenizer: this kind of problem could not only occur with llama but with derived models using different tokenizers (w.r.t. vocabulary for example) ? But the common theme here is sentencepiece compatibility? You made a big step with your notebook (which I used for testing already, thanks!), could you advise us regarding our future testing strategy?

@callMeMakerRen
Copy link

i am curious, from openai repo, How_to_count_tokens_with_tiktoken,it says Encodings specify how text is converted into tokens. Different models use different encodings,so i wonder if we are on the right way? did we only need to match the meta?

@henk717
Copy link

henk717 commented Jul 24, 2023

I'd like to add to this discussion that the real llama tokenizer eats spaces at the beginning making it very bad for continuous generation. Llamacpp's tokenizer was previously tested to give the same results as the HF tokenizer with our workaround. So for continuous generation the behavior might be better. If this is changed in the code it would be nice if this would be an optional change since if the tokenizer begins eating spaces this breaks use cases.

The tokenizer eating spaces is only desirable for applications with a seperate output window.

@goerch
Copy link
Collaborator

goerch commented Jul 24, 2023

I'd like to add to this discussion that the real llama tokenizer eats spaces at the beginning making it very bad for continuous generation. Llamacpp's tokenizer was previously tested to give the same results as the HF tokenizer with our workaround.

Do you have a possible test scenario for the proposed PR in mind? I don't know how to check that otherwise.

@henk717
Copy link

henk717 commented Jul 24, 2023

Yes, and the upstream tokenizer workarounds are not perfect so I will give you a method to test the ideal scenario.
Our UI strips a space at the end of a generation because most models can not handle it well if its already there because their tokens often have spaces in front of them, so we let the model decide if there should be a space or not.

So imagine a sentence that does not end with a space such as "This is an example." in a UI that lets users add to that text. The correct scenario would be this generation " It works very well." which combined forms "This is an example. This works very well."

The original Llama tokenizer at least on HF doesn't do this because they trained sentencepiece in a mode where it swallows the first space. So you'd get this "It works very well." resulting in "This is an example.It works very well".

Likewise you also don't want to always insert the space if you don't have to, "My favorite sport is foot" should auto complete to "ball and I play it every week" not " ball and I play it every week".

Adhering to that behavior would make it usable for continuous generation use cases because then the text flows naturally and the tokenizer picks the most appropriate token for the end result, rather than never picking a space or always picking a space.

On the huggingface side we force the tokenizers hand, we insert a random token such as a comma at the front of the input, and then after its tokenized we remove the first fake token which results in it doing a proper continuous generation. The behavior we typically observed from Llamacpp is very similar to that workaround. And we originally implemented workarounds like this because Llamacpp did a better job and its helped us in the past to identify things were wrong.

@goerch
Copy link
Collaborator

goerch commented Jul 24, 2023

Ad 1.

.\build\bin\Release\main.exe -m models\llama-7B\ggml-model-q4_0.bin -p "This is an example." --temp 0 --color

With main:

 This is an example. It's a bit of a mess, but it shows how to use the new features in the latest version of the library.
The first thing you need to do is create a new project and add the library to it. You can find out more about this here.

With PR:

 This is an example. It's a bit of a mess, but it shows how to use the new features in the latest version of the library.
The first thing you need to do is create a new project and add the library to it. You can find out more about this here.
[...]

Ad 2.

.\build\bin\Release\main.exe -m models\llama-7B\ggml-model-q4_0.bin -p "My favorite sport is foot" --temp 0 --color

With main:

 My favorite sport is foot ball. I like to play it and watch it on TV.
I like to play football, too. It's my favourite sport. [end of text]

With PR:

 My favorite sport is foot ball. I like to play it and watch it on TV.
I like to play football, too. It's my favourite sport. [end of text]

@goerch
Copy link
Collaborator

goerch commented Jul 24, 2023

I'm comparing the tokenization between original Meta repo and llama.cpp with LLaMA (also had same issue with LLaMA v2).

I believe to understand the current llama.cpp tokenizer compromises between LLaMA compatibility and the ability to process similar LLMs. I also think we can increase compatibility (slightly?), but that will need some more testing for the other LLMs?

@goerch
Copy link
Collaborator

goerch commented Jul 25, 2023

Maybe this warrants further investigation. I just ran

.\build\bin\release\perplexity.exe -m models\llama-7B\ggml-model-q4_0.bin -f datasets\wikitext-2-raw\wiki.test.raw --chunks 3
.\build\bin\release\perplexity.exe -m models\llama-7B\ggml-model-q4_1.bin -f datasets\wikitext-2-raw\wiki.test.raw --chunks 3
.\build\bin\release\perplexity.exe -m models\llama-7B\ggml-model-q5_0.bin -f datasets\wikitext-2-raw\wiki.test.raw --chunks 3
.\build\bin\release\perplexity.exe -m models\llama-7B\ggml-model-q5_1.bin -f datasets\wikitext-2-raw\wiki.test.raw --chunks 3
.\build\bin\release\perplexity.exe -m models\llama-7B\ggml-model-q8_0.bin -f datasets\wikitext-2-raw\wiki.test.raw --chunks 3

Main:

perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 71.03 seconds per pass - ETA 3 minutes
[1]4.4063,[2]4.9216,[3]5.8127,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 76.90 seconds per pass - ETA 3 minutes
[1]4.3950,[2]4.8454,[3]5.7376,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 102.27 seconds per pass - ETA 5 minutes
[1]4.2354,[2]4.7553,[3]5.6449,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 102.33 seconds per pass - ETA 5 minutes
[1]4.2480,[2]4.7375,[3]5.6165,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 121.01 seconds per pass - ETA 6 minutes
[1]4.2251,[2]4.7076,[3]5.5812,

With PR:

perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 72.39 seconds per pass - ETA 3 minutes
[1]4.4161,[2]4.9270,[3]5.8170,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 77.84 seconds per pass - ETA 3 minutes
[1]4.3436,[2]4.8170,[3]5.7152,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 105.67 seconds per pass - ETA 5 minutes
[1]4.2477,[2]4.7622,[3]5.6503,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 99.47 seconds per pass - ETA 4 minutes
[1]4.2295,[2]4.7272,[3]5.6083,
perplexity: calculating perplexity over 3 chunks, batch_size=512
perplexity: 120.07 seconds per pass - ETA 6 minutes
[1]4.2187,[2]4.7040,[3]5.5784,

@viniciusarruda
Copy link
Author

Found this which seems to fix at least the test cases I have. I don't know why and I didn't test setting bos to True.

I updated the code to produce the results by adding the b' '.


It might be helpful to understand the behavior of sentencepiece. In this notebook cell ("Load model from byte stream"), it includes a "▁" (U+2581) even before the first word. So, if the reverse mapping requires a space before the token, it will not work.

@goerch
Copy link
Collaborator

goerch commented Aug 21, 2023

@viniciusarruda : tokenizer fixes are in via the gguf branch. Does this work for you?

@goerch
Copy link
Collaborator

goerch commented Sep 4, 2023

In the meantime I tried to understand possible invariants for sentencepiece tokenizer with the following tests:

#define _CRT_SECURE_NO_WARNINGS

#include <cassert>
#include <iostream>
#include <algorithm>

#include "sentencepiece_processor.h"

#include "console.h"
#include "utf.h"

std::string random_string(size_t length) {
    auto randchar = []() -> char {
        const char charset[] =
        " \t\n"
        "0123456789"
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
        "abcdefghijklmnopqrstuvwxyz";
        const size_t max_index = (sizeof(charset) - 1);
        return charset[rand() % max_index];
    };
    std::string str(length, 0);
    std::generate_n(str.begin(), length, randchar);
    return str;
}

int main(int, char**) {
    console_init();
    auto spp = sentencepiece::SentencePieceProcessor();
    auto result = spp.Load("lama/llama-7B/tokenizer.model");
    if (result.ok()) {
        for(codepoint cp = 0x0000; cp <= 0xffff; ++ cp) {
            if(cp < 0xd800 || cp > 0xdfff) {
                auto utf8 = to_utf8(cp);
                auto ids = spp.EncodeAsIds(utf8);
                auto check = spp.DecodeIds(ids);
                if(utf8 != check)
                    std::cout << cp << " >" << utf8 << "< >" << check << "<" << std::endl;
            }
        }
        for(codepoint cp = 0x10000; cp <= 0x10ffff; ++ cp) {
            auto utf8 = to_utf8(cp);
            auto ids = spp.EncodeAsIds(utf8);
            auto check = spp.DecodeIds(ids);
            if(utf8 != check)
                 std::cout << cp << " >" << utf8 << "< >" << check << "<" << std::endl;
        }
    }
    for(auto i = 0; i < 1024; ++ i) {
        auto str = random_string(64);
        for(auto j = 0; j < str.length(); ++ j) {
            auto ids1 = spp.EncodeAsIds(str.substr(0, j));
            auto ids2 = spp.EncodeAsIds(str.substr(j));
            auto str1 = spp.DecodeIds(ids1);
            auto str2 = spp.DecodeIds(ids2);
            if (str1 + str2 != str)
                 std::cout << "1: >" << str1 << "< >" << str2 << "< >" << str << "<" << std::endl;
        }
    }
    for(auto i = 0; i < 1024; ++ i) {
        auto str = random_string(64);
        auto ids = spp.EncodeAsIds(str);
        for(auto j = 0; j < ids.size(); ++j) {
            auto str1 = spp.DecodeIds(std::vector(ids.begin(), ids.begin() + j));
            auto str2 = spp.DecodeIds(std::vector(ids.begin() + j, ids.end()));
            if (str1 + str2 != str)
                 std::cout << "2: >" << str1 << "< >" << str2 << "< >" << str << "<" << std::endl;
        }
    }
    return 0;
}

What would you expect to see?

9601 >▁< > < 2: >tcNSM8fcwo< >nmDL5G 2xEwwnFiX9zbp4E XZlKuKbViNTGRNMcd yI31iwR6LKa6< >tcNSM8fcwo nmDL5G 2xEwwnFiX9zbp4E XZlKuKbViNTGRNMcd yI31iwR6LKa6< [snip]

When rechecking test-tokenizer-1 I noticed someone decided to let these tests run only for BPE tokenizers and enabling them for sentencepiece based ones seems broken.

@ggerganov
Copy link
Owner

When I added the BPE tokenizer for Falcon, I disabled test-tokenizer-1 since my understanding was that it is supposed to be used only with BPE tokenizers and the current implementation of the BPE tokenizer is known to not work. Didn't think it was supposed to work with SPM

@goerch
Copy link
Collaborator

goerch commented Sep 5, 2023

When I added the BPE tokenizer for Falcon, I disabled test-tokenizer-1 since my understanding was that it is supposed to be used only with BPE tokenizers and the current implementation of the BPE tokenizer is known to not work. Didn't think it was supposed to work with SPM

A fast moving project, I see. I'm working on fixes, now that I have some kind of test bed. But this looks like a tough one, so don't expect results immediately.

@ggerganov
Copy link
Owner

@goerch Thank you for looking into it. Things are indeed moving fast and mistakes can happen.

The top priority is having the SPM tokenizer work 100% correct. After #2810 I thought we have achieved this, but based on your findings, it might not be the case yet. The test-tokenizer-0 should be expanded with examples where we currently fail. The test-tokenizer-0.py script can be used to generated ground truth.

@goerch
Copy link
Collaborator

goerch commented Sep 9, 2023

@ggerganov : I believe some of the remaining issues could be related to this comment. @howard0su , @klosax : AFAIU nobody looked into this so far, any reason for it?

@goerch
Copy link
Collaborator

goerch commented Sep 13, 2023

The test-tokenizer-0 should be expanded with examples where we currently fail. The test-tokenizer-0.py script can be used to generated ground truth.

Nice. I'll try this one next to cover the Unicode character set (that should at least demonstrate my 'high byte'(>0xff) problem) . Any thoughts on Unicode normalization as mentioned here?

@ggerganov
Copy link
Owner

Let's create a PR with a few failing cases and will try to give it more visibility so we get some help with it. I'm afraid I'm out of my depth here and won't be able to understand how the normalization works and what it takes to implement it, so hopefully people familiar with it will help out

@goerch goerch mentioned this issue Sep 14, 2023
34 tasks
@goerch
Copy link
Collaborator

goerch commented Sep 14, 2023

Nice. I'll try this one next to cover the Unicode character set (that should at least demonstrate my 'high byte'(>0xff) problem) . Any thoughts on Unicode normalization as mentioned here?

I was mistaken: testing high byte output with

for tok in range(0, 3 + 256):
    print("'" + tokenizer.decode([tok]) + "'")

and

    for (int i = 0; i < 3 + 256; ++i) {
        std::string str = llama_detokenize_spm(ctx, std::vector<int>(1, i));
        fprintf(stdout, "%s\n", str.c_str());
    }

shows identical output.

@goerch
Copy link
Collaborator

goerch commented Sep 19, 2023

If @viniciusarruda and @ggerganov don't disagree : I think we should close this issue and wait for possible Unicode normalization issues to show up (otherwise I need help to design a test case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed high priority Very important issue
Projects
None yet
Development

No branches or pull requests

6 participants