convert.py : handle special tokens #2820

ggerganov · 2023-08-26T18:31:52Z

Here we need to start handling special tokens in convert.py:

Lines 790 to 800 in e4324cb

    
           def add_meta_vocab(self, vocab: Vocab) -> None: 
        
               tokens = [] 
        
               scores = [] 
        
               toktypes = [] 
        
               # NOTE: `all_tokens` returns the the base vocabulary and added tokens 
        
               # TODO: add special tokens? 
        
               for text, score, toktype in vocab.all_tokens(): 
        
                   tokens.append(text) 
        
                   scores.append(score) 
        
                   toktypes.append(toktype)

An example is shown in convert-llama-7b-pth-to-gguf.py:

llama.cpp/convert-llama-7b-pth-to-gguf.py

Lines 186 to 239 in e4324cb

    
           if Path(dir_model + "/tokenizer.json").is_file(): 
        
               # Look for special tokens in tokenizer.json if it exists 
        
               with open(dir_model + "/tokenizer.json", "r", encoding="utf-8") as f: 
        
                   tokenizer = json.load(f) 
        
               if "added_tokens" in tokenizer and Path(dir_model + "/tokenizer_config.json").is_file(): 
        
                   with open(dir_model + "/tokenizer_config.json", "r", encoding="utf-8") as f: 
        
                       tokenizer_config = json.load(f) 
        
                   if "bos_token" in tokenizer_config and tokenizer_config["bos_token"] != None: 
        
                       for key in tokenizer["added_tokens"]: 
        
                           if key["content"] == tokenizer_config["bos_token"]["content"]: 
        
                               gguf_writer.add_bos_token_id(key["id"]) 
        
                   if "eos_token" in tokenizer_config and tokenizer_config["eos_token"] != None: 
        
                       for key in tokenizer["added_tokens"]: 
        
                           if key["content"] == tokenizer_config["eos_token"]["content"]: 
        
                               gguf_writer.add_eos_token_id(key["id"]) 
        
                   if "unk_token" in tokenizer_config and tokenizer_config["unk_token"] != None: 
        
                       for key in tokenizer["added_tokens"]: 
        
                           if key["content"] == tokenizer_config["unk_token"]["content"]: 
        
                               gguf_writer.add_unk_token_id(key["id"]) 
        
                   if "sep_token" in tokenizer_config and tokenizer_config["sep_token"] != None: 
        
                       for key in tokenizer["added_tokens"]: 
        
                           if key["content"] == tokenizer_config["sep_token"]["content"]: 
        
                               gguf_writer.add_sep_token_id(key["id"]) 
        
                   if "pad_token" in tokenizer_config and tokenizer_config["pad_token"] != None: 
        
                       for key in tokenizer["added_tokens"]: 
        
                           if key["content"] == tokenizer_config["pad_token"]["content"]: 
        
                               gguf_writer.add_pad_token_id(key["id"]) 
        
           else: 
        
               # If no tokenizer.json: Look for special tokens in config.json 
        
               if "bos_token_id" in hparams and hparams["bos_token_id"] != None: 
        
                   gguf_writer.add_bos_token_id(hparams["bos_token_id"]) 
        
               if "eos_token_id" in hparams and hparams["eos_token_id"] != None: 
        
                   gguf_writer.add_eos_token_id(hparams["eos_token_id"]) 
        
               if "unk_token_id" in hparams and hparams["unk_token_id"] != None: 
        
                   gguf_writer.add_unk_token_id(hparams["unk_token_id"]) 
        
               if "sep_token_id" in hparams and hparams["sep_token_id"] != None: 
        
                   gguf_writer.add_sep_token_id(hparams["sep_token_id"]) 
        
               if "pad_token_id" in hparams and hparams["pad_token_id"] != None: 
        
                   gguf_writer.add_pad_token_id(hparams["pad_token_id"])

The text was updated successfully, but these errors were encountered:

KerfuffleV2 · 2023-08-26T20:07:09Z

I can look at this after #2753 (hopefully) gets merged. I was planning on doing some cleanup work and fixing the type annotations, seems like the kind of thing that would be reasonable to throw into that kind of pull as well.

ggerganov · 2023-08-26T20:28:28Z

I think we need to use a model that utilizes special tokens to test this with. I see people mentioning "OpenChat V2 x OpenOrca" when they need to handle special tokens - maybe we can try to make those work

KerfuffleV2 · 2023-08-26T20:38:46Z

I think we need to use a model that utilizes special tokens to test this with.

Using a model with special tokens to test handling special tokens is an idea just crazy enough to work!

klosax · 2023-08-26T22:24:32Z

For BPE to work with llama models (Aquila?) convert.py should also add the merges like it is done in the falcon conversion script.

KerfuffleV2 · 2023-08-27T18:35:34Z

In progress over here: #2842

ggerganov · 2023-08-30T08:35:31Z

The next step is using the special tokens in llama.cpp - any ideas what needs to be done?

My guess is we need to just update the id_to_token and token_to_id maps:

llama.cpp/llama.cpp

Lines 947 to 950 in dc07dc4

    
           std::unordered_map<token, id> token_to_id; 
        
           std::vector<token_data>       id_to_token;

KerfuffleV2 · 2023-08-31T17:20:09Z

I'm not sure where discussion about this should be.

For BPE to work with llama models (Aquila?)

I've been doing some testing with https://huggingface.co/BAAI/Aquila-7B and https://huggingface.co/kfkas/Llama-2-ko-7b-Chat trying to get the BPE stuff to work.

First, it seems like all these BPE models just die in llama.cpp without #2889. Little surprised that pull has gotten no attention so far.

It also seems like the stuff in convert.py is still pretty far off even with merges being handled now. I started to try to fix some stuff in #2938

rajveer43 · 2023-09-02T16:01:14Z

Is it available for working?

ggerganov · 2023-09-03T06:01:15Z

We now have to use these special tokens in llama.cpp

Can somebody confirm that the following is correct:

we load the following special tokens (e.g. open llama):

{
	"bos_token": {
		"content": "<s>",
		"lstrip": false,
		"normalized": true,
		"rstrip": false,
		"single_word": false
	},
	"eos_token": {
		"content": "</s>",
		"lstrip": false,
		"normalized": true,
		"rstrip": false,
		"single_word": false
	},
	"unk_token": {
		"content": "<unk>",
		"lstrip": false,
		"normalized": true,
		"rstrip": false,
		"single_word": false
	}
}

we now tokenize the following string <s>hello world</s>
the result is that <s> and </s> are no longer tokenized as strings, but instead they are tokenized to the special tokens BOS and EOS. So we get for example the tokens: [1, 22172, 3186, 2]

KerfuffleV2 · 2023-09-03T06:31:14Z

I don't know what behavior is considered correct, but it seems like in that particular case it means you can't talk about HTML strikethrough tags anymore. I.E. a prompt like "Dear LLaMA model, people make a list according to such and such rules. Surround elements that meet a certain criteria with strikethrough like <s>item</s>." You'll pretty much immediately get nonsense if <s> and </s> are tokenized to BOS/EOS — same thing when the special tokens can conflict with something else that could plausibly be in a prompt.

It's less of an issue when the special tokens are like <|endoftext|> or whatever since it's less like to be something a user would write.

klosax · 2023-09-03T09:13:22Z

Normally the special token strings should not be recognized as special tokens in the user prompt. Better to have a CLI parameter for users who need to use them. Instead of using the model vocab these tokens should be user-configurable. Something like --bos-token "<|my-bos-token|>" should work.

KerfuffleV2 · 2023-09-03T09:23:01Z

What if it was something like --set-token "1=<|my-bos-token|>"? Then there would be a general facility to override any token id, sort of like setting the logit overrides. (Maybe --set-token isn't great, could be --override-token, --assign-token, whatever.)

klosax · 2023-09-03T09:31:31Z

Then there would be a general facility to override any token id, sort of like setting the logit overrides.

The token ids for the special tokens may differ from model to model like the default mapping strings. So any external use of the special tokens should not depend on knowing the token ids.

klosax · 2023-09-03T09:38:15Z

In addition, a CLI parameter for enabling or disabling printing the special tokens in the output from the model would be good.

KerfuffleV2 · 2023-09-03T09:50:57Z

So any external use of the special tokens should not depend on knowing the token ids.

Decent point. What I was talking about could still work with a small adaptation that you could use stuff like bos, unk, etc in addition to ids. I.E. --override-token "bos=<|my-bos-token|>".

klosax · 2023-09-03T10:01:54Z

What I was talking about could still work with a small adaptation that you could use stuff like bos, unk, etc in addition to ids.

Yes, that could also work.

Here is a snippet of the tinystories dataset. To correctly tokenize this dataset independent of model, a parameter for setting the EOS token is needed.

l3utterfly · 2023-09-09T11:53:12Z

Hi, is handling special tokens working in the latest master branch? I tested with https://huggingface.co/openchat/openchat_v3.2_super

It doesn't seem to work. I added a print and exit in convert.py, logging SpecialVocab, the special tokens don't seem to be picked up yet.

ggerganov · 2023-09-13T13:12:11Z

I need to understand how special tokens work. If they are not parsed during prompt processing, then I don't understand what is their purpose at all.

l3utterfly · 2023-09-15T04:07:56Z

From my understanding:

Special tokens are used in finetunes to provide better structure in LLM's output.

They are custom defined for each finetune (for example Openchat finetune uses the <|end_of_turn|> token after each person in a conversation. So this means they are guaranteed to not be present in the base model.
Training on data formatted to use these tokens will provide better results generally, because the model will know to activate weights that are related to the finetune when it sees the special tokens as part of the input. This coerces the model into outputs structures more related to the training format.
It helps end-user applications in parsing the output of the LLM. For example, when any BOS or EOT (end of turn) is hit, the end-user application can apply logic such as stopping the output and wait for more input. Kind of the same way as how "reverse prompt" works in llama.cpp, but more generalised. For example, "pass" tokens can be used to "pass" the conversation to other agents in multi-agent conversations.
Users can also use the special tokens as part of their prompt. A tokeniser that supports the special tokens will automatically parse them correctly. For example, a prompt could be: User: Hello<|end_of_turn|>Assistant:

Special tokens are defined by the organisers of each dataset during their finetune respectively, so what their uses are depends. So I think it's a good feature to support arbitrary special tokens in llama.cpp convert script by reading the "add_tokens.json" and adding them to the GGUF. Users of those finetunes will know how to use the special tokens at their end as long as those tokens are outputted by the LLM.

A drawback of the special tokens is that yes, when defined thoughtlessly, it will conflict with the output, as in the case of </s>, which means the model cannot talk about HTML strikethroughs. This tradeoff is usually handled by the finetune-ers themselves. As in the example of Openchat again, the <|end_or_turn|> token is chosen so the probability of it coming up in conversations is astronomically low that the finetune-ers consider that acceptable.

ggerganov · 2023-09-15T07:26:43Z

@l3utterfly Thank you - I think this description gives an answer to my question earlier.

So based on this, I think the only part that is currently missing is to replace the special token pieces (i.e. the text such as <s>, <|end_of_turn|>, etc) into the KEY_TOKENIZER_LIST before writing the vocab with gguf.py and it should work.

@KerfuffleV2 Are you interested in looking into this? Probably just SpecialVocab::add_to_gguf has to be updated

To test this, after updating gguf.py and converting a model that has the following special tokens:

{
	"bos_token": {
		"content": "<s>",
		"lstrip": false,
		"normalized": true,
		"rstrip": false,
		"single_word": false
	},
	"eos_token": {
		"content": "</s>",
		"lstrip": false,
		"normalized": true,
		"rstrip": false,
		"single_word": false
	},
	"unk_token": {
		"content": "<unk>",
		"lstrip": false,
		"normalized": true,
		"rstrip": false,
		"single_word": false
	}
}

main should tokenize the string Hello world</s> as [1, 22172, 3186, 2].

KerfuffleV2 · 2023-09-15T08:21:41Z

Sure, I can look at it. Not sure I'm 100% clear on what needs to happen from the conversion side so I may need to ask some follow up questions. I'll see what I can figure out on my own first.

edit: I think it's definitely going to be a lot more complicated than just changing add_to_gguf though. That function doesn't have access to the full vocab list, it just calls add_bos_token_id, etc. Also SpecialVocab just handles fixed list of special tokens like bos that have an add_BLAH_token_id function in GGUFWrite but presumably we want to support arbitrary special tokens that may not fall into that set.

KerfuffleV2 · 2023-09-15T08:42:48Z

@ggerganov Actually, I'm confused. We already write the text content for special tokens like BOS and llama.cpp seems to already know what the content is for the tokens. For example, when starting up:

llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'

So I think the C++ tokenizer side is not using the token content of tokens like BOS when tokenizing rather than this being something that could be fixed on the model conversion side. Or am I misunderstanding something?

edit: Not sure if it's significant for this but BOS, EOS get added with token type control (3) and UNK gets added with token type unknown (2). Possibly that's why they're getting ignored when tokenizing.

ggerganov · 2023-09-15T09:16:40Z

We already write the text content for special tokens like BOS and llama.cpp seems to already know what the content is for the tokens.

Ah, I guess the Python classes somehow already took care of that. Then I think we are done - special tokens should already work. Can you check how the Hello world</s> tokenizes with main --verbose-prompt?

KerfuffleV2 · 2023-09-15T09:30:33Z

It tokenizes like:

     1 -> ''
 16644 -> ' Hello'
   924 -> ' world'
  1089 -> '</'
 31829 -> 's'
 31901 -> '>'

I've been messing around with the C++ side and I don't really understand what's going on. I thought maybe it was because the </s> token had a score of 0.0 but setting it to 10000000.0 doesn't do anything. Setting its token type to 1 (normal) doesn't do anything when tokenizing either (setting BOS to normal makes it render as <s> though).

Dumping the text in llama_tokenizer_spm::tokenize looks like:

-- ▁Hello▁world</s>
-- Hello▁world</s>
-- ello▁world</s>
-- llo▁world</s>
-- lo▁world</s>
-- o▁world</s>
-- ▁world</s>
-- world</s>
-- orld</s>
-- rld</s>
-- ld</s>
-- d</s>
-- </s>
-- /s>
-- s>
-- >

I also added some debug prints to just before resegment gets called and in resegment:

0: 8 -- ▁Hello▁world</s>
** 16644 -- [▁Hello]
6: 8 -- ▁world</s>
** 924 -- [▁world]
12: 2 -- </s>
** 1089 -- [</]
14: 1 -- s>
** 31829 -- [s]
15: 1 -- >
** 31901 -- [>]

Patch

--- a/llama.cpp
+++ b/llama.cpp
@@ -3578,6 +3585,7 @@ struct llm_tokenizer_spm {
             llm_symbol sym;
             size_t len = utf8_len(text[offs]);
             sym.text = text.c_str() + offs;
+            printf("\n-- %s\n", text.c_str() + offs);
             sym.n = std::min(len, text.size() - offs);
             offs += sym.n;
             sym.prev = index - 1;
@@ -3624,6 +3632,7 @@ struct llm_tokenizer_spm {
 
         for (int i = 0; i != -1; i = symbols[i].next) {
             auto & symbol = symbols[i];
+            printf("%d: %zu -- %s\n", i, symbol.n, symbol.text);
             resegment(symbol, output);
         }
     }
@@ -3635,9 +3644,11 @@ private:
 
         // Do we need to support is_unused?
         if (token != vocab.token_to_id.end()) {
+            printf("** %d -- [%s]\n", token->second, text.c_str());
             output.push_back((*token).second);
             return;
         }
+        printf("!! [%s]\n", text.c_str());
 
         const auto p = rev_merge.find(text);

I also tried adding some debug output to try_add_bigram:

Expand

BIG: Found 0,1: 349 -- ▁H
BIG: Found 1,2: 4301 -- He
BIG: Found 2,3: 307 -- el
BIG: Found 3,4: 608 -- ll
BIG: Found 4,5: 4685 -- lo
BIG: Not found 5,6: o▁
BIG: Found 6,7: 271 -- ▁w
BIG: Found 7,8: 679 -- wo
BIG: Found 8,9: 272 -- or
BIG: Found 9,10: 13468 -- rl
BIG: Found 10,11: 395 -- ld
BIG: Not found 11,12: d<
BIG: Found 12,13: 1089 -- </
BIG: Not found 13,14: /s
BIG: Not found 14,15: s>
left = '▁world</s>' size = 4
BIG: Not found 5,6: o▁w
BIG: Not found 6,8: ▁wo
left = 'orld</s>' size = 2
BIG: Found 6,8: 456 -- ▁wor
BIG: Not found 8,10: orl
left = 'ello▁world</s>' size = 2
BIG: Found 1,2: 13588 -- Hel
BIG: Found 2,4: 452 -- ell
left = '▁Hello▁world</s>' size = 4
BIG: Bail: -1, 0
BIG: Found 0,2: 4161 -- ▁Hel
left = 'ld</s>' size = 2
BIG: Found 8,10: 12863 -- orld
BIG: Not found 10,12: ld<
left = 'ello▁world</s>' size = 3
BIG: Found 0,2: 10555 -- ▁Hell
BIG: Found 2,5: 7090 -- ello
left = '▁world</s>' size = 6
BIG: Not found 5,6: o▁wor
BIG: Found 6,10: 924 -- ▁world
left = '▁world</s>' size = 8
BIG: Not found 5,6: o▁world
BIG: Not found 6,12: ▁world<
left = '</s>' size = 2
BIG: Not found 6,12: ▁world</
BIG: Not found 12,14: </s
left = 'ello▁world</s>' size = 4
BIG: Found 0,2: 16644 -- ▁Hello
BIG: Not found 2,6: ello▁world
left = '▁Hello▁world</s>' size = 8
BIG: Bail: -1, 0
BIG: Not found 0,6: ▁Hello▁world

It doesn't look like it tried a combination with </s>. I don't really understand how that works so maybe that's expected.

l3utterfly · 2023-09-15T17:03:06Z

I took a look at the tokenising code in c++: llm_tokenizer_spm::tokenize. I'm not that familiar with the code, but from my limited understanding, it seems to be doing this:

splitting the text into utf8 chars
for each character, attempt to create a bi-gram out of it by combining two adjacent chars and looking for it in the vocab
recursively combining bi-grams to look for longer matches in the vocab
re-segment seems to be recursively going back up the tree finding matches (?). To be honest, I'm a little unclear the purpose of this at the moment

Debugging with the prompt: Hello</s> (BOS string is automatically added by main), I can see it's splitting the </s> because the logic identifies and merges the </ token and s tokens first.

From my understanding: this bi-gram focused tokenisation may skip over long tokens (tokens over multiple characters) because it may not merge correctly, perhaps due to the identified shorter tokens within the long token just so happens to not be divisible by two? (it seems </s> gets split into 3 tokens)

My thought is to use a greedy search on the tokens (n-grams), attempting to match tokens starting from the longest possible length. Regarding the token vocab, we can use a retrieval tree for prefix matching to speed up the search.

struct TrieNode {
    bool is_end = false;
    std::unordered_map<char, std::unique_ptr<TrieNode>> children;
    llama_vocab::id token_id;
};

class Trie {
public:
    TrieNode* root = new TrieNode();

    void insert(const std::string &word, llama_vocab::id id) {
        TrieNode* node = root;
        for (char c : word) {
            if (node->children.find(c) == node->children.end()) {
                node->children[c] = std::make_unique<TrieNode>();
            }
            node = node->children[c].get();
        }
        node->is_end = true;
        node->token_id = id;
    }

    std::pair<bool, llama_vocab::id> search(const std::string &word) {
        TrieNode* node = root;
        for (char c : word) {
            if (node->children.find(c) == node->children.end()) {
                return {false, -1};
            }
            node = node->children[c].get();
        }
        if (node->is_end) {
            return {true, node->token_id};
        }
        return {false, -1};
    }
};

The tokenize function would then be:

void tokenize(const std::string & text, std::vector<llama_vocab::id> & output) {
        Trie vocabTrie;

        // Populate trie with vocabulary
        for (const auto &pair : vocab.token_to_id) {
            const llama_vocab::token &token = pair.first;
            const llama_vocab::id &id = pair.second;
            vocabTrie.insert(token, id);
        }

        size_t pos = 0;
        while (pos < text.size()) {
            size_t max_len = 0;
            llama_vocab::id max_token_id;
            
            // Check all possible sub-strings starting from pos, favoring the longest possible tokens
            for (size_t len = text.size() - pos; len >= 1; --len) {
                std::pair<bool, llama_vocab::id> search_result = vocabTrie.search(text.substr(pos, len));
                if (search_result.first) {
                    max_len = len;
                    max_token_id = search_result.second;
                    break;
                }
            }
            
            if (max_len > 0) {
                output.push_back(max_token_id);
                pos += max_len;
            } else {
                // TODO: add logic to handle the case where no token is found, 
                // such as adding individual characters to the output or advancing by 
                // the length of the next UTF-8 character.
                pos += utf8_len(text[pos]); // advances by the length of the next UTF-8 character
            }
        }
    }

I tested with the prompt Hello</s>, it seems tokenize correctly into:

main: prompt: 'Hello</s>'
main: number of tokens in prompt = 3
     1 -> ''
 15043 -> ' Hello'
     2 -> ''

Using a model that supports the EOS token, it correctly passes the conversation to the "Assistant" after </s> is reached.

A few things to note in my implementation of the tokeniser:

This is a proof of concept I wrote up in a few hours after trying to understand the current tokeniser, the greedy search could be very inefficient here as it checks all possible substrings of the prompt
I am not sure about the implications of the this new tokeniser returning the longest possible token matches. Also, it seems @KerfuffleV2 got a different token from me for " Hello", but it could be just due to we are testing with different models.
My tokeniser doesn't handle utf8 chars at all at the moment
My tokeniser ignores all invalid tokens at the moment
It's constructing a new retrieval tree every time upon tokenise. Should probably make the retrieval tree the default way to store vocab if this method is to go forward

I did a few short tests with my models, the coherence of the LLM output seems normal to me.

I am wondering if this is the right direction to head in? @ggerganov

ggerganov · 2023-09-15T18:55:46Z

Looks like the right direction - although I'm not 100% sure as I don't have deep understanding of how the tokenizer works.
It is important to make sure that test-tokenizer-0-llama and test-tokenizer-1-llama still work after this change:

./bin/test-tokenizer-0-llama ../models/ggml-vocab-llama.gguf
./bin/test-tokenizer-1-llama ../models/ggml-vocab-llama.gguf

Tagging @goerch in case they might have some insight.

ggerganov · 2023-09-15T20:56:35Z

I just remembered that some time ago #1931 was proposed, but the PR remained unmerged as it was during a big refactoring work. It looks like @Igoorx proposed changes to the tokenizer to handle special tokens. Might be worth looking into that and resurrecting the PR

KerfuffleV2 · 2023-09-15T21:17:11Z

Looks like that creates a special token to id map and special cases checking it: https://github.com/ggerganov/llama.cpp/pull/1931/files#diff-150dc86746a90bad4fc2c3334aeb9b5887b3adad3cc1459446717638605348efR2098-R2136

I guess it would be possible to take that approach without fully rewriting the tokenizer. (Also, wasn't the tokenizer initially the greedy type a long time ago and then got changed or am I remembering incorrectly?)

goerch · 2023-09-16T00:58:44Z

I also added some debug prints to just before resegment gets called and in resegment

I took resegment from here.

@ggerganov , @klosax : are we talking about sentencepiece or gpt2-like tokenization here (which I only tested once with unconvincing results)? Do we have a reference model for gpt2-like tokenization like Aquila or Baichuan already under test?

If we are talking about special tokens for sentencepiece do you mean user defined tokens or is this a different extension mechanism? And indeed in this case we should try to revive #1931.

goerch · 2023-09-19T10:44:15Z

(which I only tested once with unconvincing results)

I looked into some of the open issues at #3252. @KerfuffleV2 : which models are we testing here?

goerch · 2023-09-20T05:20:34Z

I'm staring at the following code in BpeVocab:

    def __init__(self, fname_tokenizer: Path, fname_added_tokens: Path | None) -> None:
        self.bpe_tokenizer = json.loads(open(str(fname_tokenizer), encoding="utf-8").read())
        added_tokens: dict[str, int]
        if fname_added_tokens is not None:
            # FIXME: Verify that added tokens here _cannot_ overlap with the main vocab.
            added_tokens = json.load(open(fname_added_tokens, encoding="utf-8"))
        else:
            # Fall back to trying to find the added tokens in tokenizer.json
            tokenizer_json_file = fname_tokenizer.parent / 'tokenizer.json'
            if not tokenizer_json_file.is_file():
                added_tokens = {}
            else:
                tokenizer_json = json.load(open(tokenizer_json_file, encoding="utf-8"))
                added_tokens = dict(
                    (item['content'], item['id'])
                    for item in tokenizer_json.get('added_tokens', [])
                    # Added tokens here can be duplicates of the main vocabulary.
                    if item['content'] not in self.bpe_tokenizer )

Are there known cases where fname_tokenizer <> fname_tokenizer.parent / 'tokenizer.json' (that would seem illogical to me)? Otherwise we are reading the same file twice for no reason.

KerfuffleV2 · 2023-09-20T18:01:31Z

fname_tokenizer would be vocab.json for BPE and added_tokens.json possibly for fname_added_tokens. fname_tokenizer.parent is pretty much just basename, it strips off the the last element in the path. So /blah/blah/vocab.json's "parent" is /blah/blah/ and fname_tokenizer.parent / 'tokenizer.json' is just the tokenizer.json in the same directory as vocab.json.

Does the way it works make sense? Who knows! I think I was the one that added some fallback logic there but I mostly left it the way it was originally when I was was messing with that part.

goerch · 2023-09-20T20:11:48Z

Does the way it works make sense? Who knows!

I certainly don't. But here is my current understanding (partly based on #3252 (comment)):

We try to support two classes of tokenizers:

SPM (sentencepiece)
- SPM splits input into pieces and tokenizes, somewhere in this process we have Unicode normalization
- SPM differentiates token types (most important ones being UNKNOWN, CONTROL, BYTE, NORMAL)
- SPM supports pad_token, unk_token, bos_token and eos_token
BPE (`GPT-2 like)
- BPE splits input by a magic regexp (not supported in C++), byte encodes the pieces with some more magic and then tokenizes
- BPE does not directly support token types, but considers some Unicode character types in the magic regexp
- BPE does not directly support pad_token, unk_token, bos_token and eos_token, but has something like <|endoftext|> for most of them

Both tokenizers use some kind of byte pair encoding to tokenize.

Regarding the source of complete models we have

Original LLaMa models
- Tokenizer file is tokenizer.model, can be loaded by sentencepiece
- Token types, pad_token, unk_token, bos_token and eos_token are determined by SPM
Huggingface models
- Huggingface adds some cognitive burden with APIs
- We could have at least a SPM or BPE tokenizer, determined by tokenizer_config.json (if existent?)
- tokenizer_config.json contains information about pad_token, unk_token, bos_token and eos_token.
- Our tokenizer file currently seems to be vocab.json, although for Aquila and Falcon I see a more complete tokenizer.json
- We have added tokens in tokenizer.json, which could or could not be part of the vocabulary and look a lot like CONTROL tokens to me
- Added tokens can additionally(?) be described in added_tokens.json
- We optionally have special_tokens_map.json which contains a mix of information about CONTROL tokens and pad_token, unk_token, bos_token and eos_token
- I don't have the slightest idea about Huggingface API revisions.

We invented something like linefeed_token additionally.

On the implementation side it seems we have tokenizer handling split across a couple of conversion scripts, gguf.py and the corresponding llama.cpp code.

Here are my most urgent questions:

Is there any good source of documentation for HF tokenizer (or model) files or API revisions?
What am I missing in the description of the the requirements?
Any way to simplify the requirements (my first idea would be to require the existence of tokenizer_config.json and tokenizer.json for HF models and disregard added_tokens.json and special_tokens_map.json if possible)?
Where should we consolidate tokenizer handling on our conversion side?

nlpcat · 2023-09-24T23:03:35Z

it still has problems to support special token in starcoder like <fim_prefix> in bpe

jploski · 2023-10-05T01:45:48Z

The lack of handling for special tokens in llm_tokenizer_spm also affects Mistral Orca.

In SentencePiece's original implementation there is something called PrefixMatcher, which is initialized with user_defined_symbols (as the special tokens are called there). This PrefixMatcher is then used to split the input into "character sequence" in the BPE tokenizer. I suppose it skips right over the atomic/unsplittable special tokens before the main BPE algorithm begins.

The llama.cpp implementation (which is apparently a port of/inspired by the bpe_model.cc from SentencePiece linked above) instead "splits the input into utf8 chars", but without the matcher part, i.e. disregarding the atomic special tokens.

staviq · 2023-10-05T11:06:49Z

In SentencePiece's original implementation there is something called PrefixMatcher

Thank you, I was thinking the same thing yesterday, and couldn't find any confirmation.

I already found a way to extract unsplittable tokens directly in the tokenizer without any model/convert.py changes, I'm gonna play with this some more. I have a general idea of how to solve this with minor changes in tokenizer function.

I also found a separate approach for tokenizing in O(log N), while solving this problem in the process, by building a tree structure of token/"subtokens", and matching downwards instead of upwards ( matching full long tokens first ). I have to try this to see how consistent it would be with current tokenizer.

jploski · 2023-10-05T11:12:10Z

In SentencePiece's original implementation there is something called PrefixMatcher

Thank you, I was thinking the same thing yesterday, and couldn't find any confirmation.

I already found a way to extract unsplittable tokens directly in the tokenizer without any model/convert.py changes, I'm gonna play with this some more. I have a general idea of how to solve this with minor changes in tokenizer function.

I also found a separate approach for tokenizing in O(log N), while solving this problem in the process, by building a tree structure of token/"subtokens", and matching downwards instead of upwards ( matching full long tokens first ). I have to try this to see how consistent it would be with current tokenizer.

I assume you are familiar with the trie data structure? I think this is what PrefixMatcher uses. Although it may be an overkill for finding all occurrences of a couple substrings in a short body of text. Apart from that, regular expressions come to mind. (I don't know how important it is for the implementation to stay similar to SentencePiece's for comparability.)

staviq · 2023-10-05T11:38:39Z

trie

Oh, it has a name :) That's the exact thing I had in mind, I was using it since uni for text sorting and searching, I didn't know that's how it's called in English :)

l3utterfly · 2023-10-05T11:44:43Z

@staviq

Here's a proof of concept I wrote a while ago that uses Trie: #2820 (comment)

Hope it helps. It tokenises special characters correctly, but I haven't had the time to add support for UTF8 chars and edge cases yet

goerch · 2023-10-06T06:16:34Z

In SentencePiece's original implementation there is something called PrefixMatcher, which is initialized with user_defined_symbols (as the special tokens are called there).

What do you think about reviving #1931 as suggested in #2820 (comment)?

teleprint-me · 2023-10-15T15:59:11Z

I'm working on an experimental solution to this problem because I keep running into it and I'm not the only one; There are plenty of other issues related to this.

I'm confident there's a way to do this without creating dependencies.

We technically do not need to rely on huggingface and I can actually see reliance on it becoming an issue of its own.

I'm in the middle of creating some utilities to dump the necessary data to mapped data structures for reuse; think of it like a programmitic hexdump, but for models.

I already created one for safetensors. My next goal is to handle it for torch models. Then for huggingface models.

If my intuition is correct, then we shouldn't really need huggingface at all which would actually be a really good thing.

It would also be flexible enough to build on top of and extend as needed.

It would create a gateway towards unifying and streamlining all model conversions as well, which is my end goal.

This comment is a copy-paste from PR #3633.

@ds5t5 @ggerganov @Green-Sky

I'd like to know if this is a path worth pursuing. Let me know.

cebtenzzre · 2023-10-17T19:29:42Z

Is this fixed by #3538?

ggerganov · 2023-10-17T19:41:10Z

I hope so. Looking for more reports if this works as expected.
I've posted an example, based on my understanding of how ChatML is supposed to work: #3475 (comment)

ggerganov · 2023-10-18T13:47:59Z

Optimistically marking this as resolved. Likely we have to take some extra look in the proposal in #3664 in order to cover all cases. And probably we need #3585 merged to be able to convert models without errors

ggerganov added enhancement New feature or request good first issue Good for newcomers labels Aug 26, 2023

KerfuffleV2 mentioned this issue Aug 27, 2023

Various script cleanups/fixes + convert merges and special token handling #2842

Merged

KerfuffleV2 mentioned this issue Sep 19, 2023

Work on the BPE tokenizer #3252

Merged

goerch mentioned this issue Sep 20, 2023

Improve support for special tokens #1931

Closed

ds5t5 mentioned this issue Sep 25, 2023

add refact model #3329

Merged

staviq mentioned this issue Oct 6, 2023

Tokenizer not picking the right tokens ( mistral openorca ) #3475

Closed

goerch mentioned this issue Oct 10, 2023

tokenizer : special token handling #3538

Merged

5 tasks

Rafaelblsilva mentioned this issue Oct 13, 2023

Added mistral instruct chat format as "mistral-instruct" abetlen/llama-cpp-python#799

Merged

ggerganov closed this as completed Oct 18, 2023

convert.py : handle special tokens #2820

convert.py : handle special tokens #2820

Comments

ggerganov commented Aug 26, 2023

KerfuffleV2 commented Aug 26, 2023

ggerganov commented Aug 26, 2023

KerfuffleV2 commented Aug 26, 2023

klosax commented Aug 26, 2023

KerfuffleV2 commented Aug 27, 2023

ggerganov commented Aug 30, 2023

KerfuffleV2 commented Aug 31, 2023

rajveer43 commented Sep 2, 2023

ggerganov commented Sep 3, 2023

KerfuffleV2 commented Sep 3, 2023

klosax commented Sep 3, 2023

KerfuffleV2 commented Sep 3, 2023 • edited Loading

klosax commented Sep 3, 2023

klosax commented Sep 3, 2023

KerfuffleV2 commented Sep 3, 2023

klosax commented Sep 3, 2023

l3utterfly commented Sep 9, 2023

ggerganov commented Sep 13, 2023 • edited Loading

l3utterfly commented Sep 15, 2023 • edited Loading

ggerganov commented Sep 15, 2023

KerfuffleV2 commented Sep 15, 2023 • edited Loading

KerfuffleV2 commented Sep 15, 2023 • edited Loading

ggerganov commented Sep 15, 2023

KerfuffleV2 commented Sep 15, 2023 • edited Loading

l3utterfly commented Sep 15, 2023 • edited Loading

ggerganov commented Sep 15, 2023

ggerganov commented Sep 15, 2023

KerfuffleV2 commented Sep 15, 2023

goerch commented Sep 16, 2023

goerch commented Sep 19, 2023 • edited Loading

goerch commented Sep 20, 2023 • edited Loading

KerfuffleV2 commented Sep 20, 2023

goerch commented Sep 20, 2023 • edited Loading

nlpcat commented Sep 24, 2023

jploski commented Oct 5, 2023

staviq commented Oct 5, 2023 • edited Loading

jploski commented Oct 5, 2023

staviq commented Oct 5, 2023 • edited Loading

l3utterfly commented Oct 5, 2023 • edited Loading

goerch commented Oct 6, 2023

teleprint-me commented Oct 15, 2023 • edited Loading

cebtenzzre commented Oct 17, 2023

ggerganov commented Oct 17, 2023

ggerganov commented Oct 18, 2023

KerfuffleV2 commented Sep 3, 2023 •

edited

Loading

ggerganov commented Sep 13, 2023 •

edited

Loading

l3utterfly commented Sep 15, 2023 •

edited

Loading

KerfuffleV2 commented Sep 15, 2023 •

edited

Loading

KerfuffleV2 commented Sep 15, 2023 •

edited

Loading

KerfuffleV2 commented Sep 15, 2023 •

edited

Loading

l3utterfly commented Sep 15, 2023 •

edited

Loading

goerch commented Sep 19, 2023 •

edited

Loading

goerch commented Sep 20, 2023 •

edited

Loading

goerch commented Sep 20, 2023 •

edited

Loading

staviq commented Oct 5, 2023 •

edited

Loading

staviq commented Oct 5, 2023 •

edited

Loading

l3utterfly commented Oct 5, 2023 •

edited

Loading

teleprint-me commented Oct 15, 2023 •

edited

Loading