Update gpt2 preprocess and add deepseek coder preprocess #4070

DOGEwbx · 2023-11-14T09:51:12Z

Revise the initial GPT-2 preprocessing for compatibility with Falcon's behavior by utilizing Regex to ensure simplicity.
Incorporate DeepSeek coder pre-processing and enable other models based on HuggingFace tokenizer to adhere to the same procedure to endorse their individual models.
Implement tests for the modifications and update the hf-to-gguf conversion script.

I get the deepseekcoder vocab with

python convert-hf-to-gguf.py <MODEL_PATH> --outfile ./models/ggml-vocab-deepseek-coder.gguf --model-name DeepseekCoder --vocab-only

The regex in unicode.h is extract with the brilliant repo https://github.com/Genivia/RE-flex with following script

#include <reflex/matcher.h> // reflex::Matcher, reflex::Input, reflex::Pattern
#include <reflex/stdmatcher.h>
#include <locale>
#include <regex>
#include <iostream>
#include <string>
#include <cstring>
#include <vector>


int main()
{
    std::vector<std::string> regex_list = {
        "[\\p{P}\\$\\+<=>\\^~\\|]+",
        "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
        "[0-9][0-9][0-9]"
        "\\s?\\p{L}+",
        "\\s?\\p{P}+",
        "\\p{N}",

    };
    for(auto s: regex_list){
        std::string regex_s = reflex::StdMatcher::convert(s, reflex::convert_flag::unicode);
        std::cout<<s<<std::endl<<regex_s<<std::endl;
        }    
}

merge master

Galunid

This was just a cursory pass, I'm not sure if C++ tokenizer changes are acceptable, since I'm not very familiar with how tokenizers work. I'll leave review of that part to someone else.

convert-hf-to-gguf.py

llama.cpp

convert-hf-to-gguf.py

DOGEwbx · 2023-11-14T13:50:30Z

current bpe_gpt2_preprocess behave different from falcon model's tokenizer(not sure if it is your target).
For example, to string "99000", falcon tokenizer will pretokenize it to "990" and "00", will current prerocess will not cut it, result in different behaviour

src: '99000'
res: '99000'
tok: 2512 1321 
main : failed test:    '99000'
main : detokenized to: '99000' instead of '99000'
main : expected tokens:  30825,    527, 
main : got tokens:        2512,   1321,

convert-hf-to-gguf.py

Galunid

Python looks alright to me, as I said I'm not familiar with tokenization stuff so I'll leave the C++ part to someone else

ggerganov

What is the tokenization time for wikitext 2 before and after this PR?
Run the perplexity tool - it is printed at the start

DOGEwbx · 2023-11-16T15:30:29Z

@ggerganov
Before the pr, the tokenization time is 1021.49 ms
After the pr, the tokenization time is 28982.1 ms
Seems like there is a 28x performance gap due to the complicated regex (I can't find a better solution to use regex with unicode in cpp without adding extra dependecies like RE-flex)
However, I think the correctness and extendability is much more important for tokenizer didn't a long time compared with inference.

DOGEwbx · 2023-11-16T15:31:07Z

I'd love to try out if you have any suggestions on imporving the performance

ggerganov · 2023-11-16T15:57:12Z

However, I think the correctness and extendability is much more important for tokenizer didn't a long time compared with inference.

The performance is also important. This performance hit of 28x is quite big, so we cannot accept it like this.
I immediately see some unnecessary string copies that should be avoided. Probably things can be optimized further.

DOGEwbx · 2023-11-21T12:38:23Z

Hi, I just finished the regex optimization by changing the utf-8 string and regex to wchar(utf32), now it can finish the in 2040ms

llama.cpp

yhyu13 · 2023-11-25T03:55:22Z

@ggerganov Would you please review this pr or assign some one to review this pr, it is required for gguf to work with models with hf tokenizer which cannot provide setencepiece tokenizer https://github.com/deepseek-ai/DeepSeek-Coder#7-qa, thanks

teleprint-me · 2023-11-29T07:38:34Z

llama.cpp

@@ -2599,7 +2605,7 @@ static void llm_load_print_meta(llama_model_loader & ml, llama_model & model) {
    // hparams
    LLAMA_LOG_INFO("%s: format           = %s\n",     __func__, llama_file_version_name(ml.fver));
    LLAMA_LOG_INFO("%s: arch             = %s\n",     __func__, LLM_ARCH_NAMES.at(model.arch).c_str());
-    LLAMA_LOG_INFO("%s: vocab type       = %s\n",     __func__, vocab.type == LLAMA_VOCAB_TYPE_SPM ? "SPM" : "BPE"); // TODO: fix
+    LLAMA_LOG_INFO("%s: vocab type       = %s\n",     __func__, vocab.type == LLAMA_VOCAB_TYPE_SPM ? "SPM" : (vocab.type == LLAMA_VOCAB_TYPE_BPE ? "BPE" : "DEEPSEEKCODER")); // TODO: fix


This is SPM?

>>> from transformers import AutoTokenizer >>> model_directory = "models/deepseek-ai/deepseek-coder-6.7b-instruct" >>> tokenizer = AutoTokenizer.from_pretrained(model_directory) >>> tokenizer.vocab_files_names {'vocab_file': 'tokenizer.model', 'tokenizer_file': 'tokenizer.json'}

ggerganov · 2023-11-29T08:05:34Z

The tokenizer tests for falcon and deepseek-coder are not passing:

make tests

We need to fix this before merging

DOGEwbx · 2023-11-29T08:19:24Z

@ggerganov I just checkout to my branch and run

make test

all the tests are passed. Could you check if your HEAD is the same as the image above?

ggerganov · 2023-11-29T08:41:49Z

It works on Linux, but it fails on Mac OS and Windows:

https://github.com/ggerganov/llama.cpp/actions/runs/6952586370/job/19127179619

DOGEwbx · 2023-11-29T09:28:51Z

It works on Linux, but it fails on Mac OS and Windows:

https://github.com/ggerganov/llama.cpp/actions/runs/6952586370/job/19127179619

It seems like wchar_t on windows is implemented with char16_t

TheBloke · 2023-12-01T11:22:44Z

There's another active llama.cpp PR which enables making DeepSeek GGUFs, and it's the one I used to make my DeepSeek GGUFs on HF: #3633

This other PR uses Hugging Face AutoTokenizer, so the GGUF is produced directly from tokenizer.json, ie it should in theory tokenise identically to the source model.

I don't know if there will be any difference in output from the PR I used, vs this PR here. Anyone got any thoughts on that?

ggerganov · 2023-12-01T17:33:29Z

@TheBloke Thanks for pointing this out. AFAIU this PR is still needed in addition to #3633 in order to fix some incorrectly tokenized strings (like certain multi-digit numbers, etc.). @DOGEwbx can you confirm this is the case?

DOGEwbx · 2023-12-02T07:56:48Z

@ggerganov @TheBloke Hi guys, I just take a brief look on #3633 , that PR mainly focus on loading the vocab from tokenizer.json which has no difference with this PR. The mismatched problem originates from the pre-tokenizer mechanism of HuggingFace Tokenizer https://huggingface.co/docs/tokenizers/api/pre-tokenizers. It will split the input string to substrings and token across substrings will not be merged.

Take the Falcon tokenizer I mentioned before as an example, the pretokenzier will split the string "99000" to "990" and "00" according to the regex pre-tokenizer

{
        "type": "Split",
        "pattern": {
          "Regex": "[0-9][0-9][0-9]"
        },
        "behavior": "Isolated",
        "invert": false
      }

and it will be tokenized to "990" -> 30825, "00"-> 527.
However, in current implementation of pre-tokenizer, this mechanisim is not supported and all the Byte-level BPE share the same preporcess method.

In this process function, 99000 will not be splitted and BPE will give result of "99" "000" since "000" has higher priority compared with "990", which leads to the unmatched tokenizer encode result.

ggerganov

This is a big change in the tokenizer and I'm a little worried it might break something.

Tested LLaMA and Falcon:

SPM (LLaMA) produces the same results
BPE (Falcon) tokenization is quite slower, but the PPL for wikitext is better, so it should be OK

ggerganov

The CI is still failing - need to fix this before merging

dragnil1 · 2024-04-17T21:19:04Z

Hi, I just finished the regex optimization by changing the utf-8 string and regex to wchar(utf32), now it can finish the in 2040ms

@DOGEwbx , How did you convert the utf-8 regexes given by the RE-flex library to wchar ones? These wchar regexes are in utf32 format. But i guess, if they can be changed to utf16 format, then they can be used in platforms having wchar_t size with 16bit.

DOGEwbx · 2024-04-18T06:48:32Z

@dragnil1 Sorry for the late reply. It has been a while since last time I run the code. I checked my workspace and found a piece of code that might help. It seem's like I copied the unicode range from unicoe.h and convert it o utf-32 to replace regex groups like \p{L}, \p{P}

cpp_ranges = """
static const std::vector<std::pair<uint32_t, uint32_t>> punctuation_ranges = {
{0x21, 0x23}, {0x25, 0x2A}, {0x2C, 0x2F}, {0x3A, 0x3B}, {0x3F, 0x40}, {0x5B, 0x5D}, {0x5F, 0x5F}, {0x7B, 0x7B}, {0x7D, 0x7D}, {0xA1, 0xA1}, {0xA7, 0xA7}, {0xAB, 0xAB}, {0xB6, 0xB7}, {0xBB, 0xBB},
{0xBF, 0xBF}, {0x37E, 0x37E}, {0x387, 0x387}, {0x55A, 0x55F}, {0x589, 0x58A}, {0x5BE, 0x5BE}, {0x5C0, 0x5C0}, {0x5C3, 0x5C3}, {0x5C6, 0x5C6}, {0x5F3, 0x5F4}, {0x609, 0x60A}, {0x60C, 0x60D},
{0x61B, 0x61B}, {0x61E, 0x61F}, {0x66A, 0x66D}, {0x6D4, 0x6D4}, {0x700, 0x70D}, {0x7F7, 0x7F9}, {0x830, 0x83E}, {0x85E, 0x85E}, {0x964, 0x965}, {0x970, 0x970}, {0x9FD, 0x9FD}, {0xA76, 0xA76},
{0xAF0, 0xAF0}, {0xC77, 0xC77}, {0xC84, 0xC84}, {0xDF4, 0xDF4}, {0xE4F, 0xE4F}, {0xE5A, 0xE5B}, {0xF04, 0xF12}, {0xF14, 0xF14}, {0xF3A, 0xF3D}, {0xF85, 0xF85}, {0xFD0, 0xFD4}, {0xFD9, 0xFDA},
{0x104A, 0x104F}, {0x10FB, 0x10FB}, {0x1360, 0x1368}, {0x1400, 0x1400}, {0x166E, 0x166E}, {0x169B, 0x169C}, {0x16EB, 0x16ED}, {0x1735, 0x1736}, {0x17D4, 0x17D6}, {0x17D8, 0x17DA}, {0x1800, 0x180A},
{0x1944, 0x1945}, {0x1A1E, 0x1A1F}, {0x1AA0, 0x1AA6}, {0x1AA8, 0x1AAD}, {0x1B5A, 0x1B60}, {0x1BFC, 0x1BFF}, {0x1C3B, 0x1C3F}, {0x1C7E, 0x1C7F}, {0x1CC0, 0x1CC7}, {0x1CD3, 0x1CD3}, {0x2010, 0x2027},
{0x2030, 0x2043}, {0x2045, 0x2051}, {0x2053, 0x205E}, {0x207D, 0x207E}, {0x208D, 0x208E}, {0x2308, 0x230B}, {0x2329, 0x232A}, {0x2768, 0x2775}, {0x27C5, 0x27C6}, {0x27E6, 0x27EF}, {0x2983, 0x2998},
{0x29D8, 0x29DB}, {0x29FC, 0x29FD}, {0x2CF9, 0x2CFC}, {0x2CFE, 0x2CFF}, {0x2D70, 0x2D70}, {0x2E00, 0x2E2E}, {0x2E30, 0x2E4F}, {0x2E52, 0x2E52}, {0x3001, 0x3003}, {0x3008, 0x3011}, {0x3014, 0x301F},
{0x3030, 0x3030}, {0x303D, 0x303D}, {0x30A0, 0x30A0}, {0x30FB, 0x30FB}, {0xA4FE, 0xA4FF}, {0xA60D, 0xA60F}, {0xA673, 0xA673}, {0xA67E, 0xA67E}, {0xA6F2, 0xA6F7}, {0xA874, 0xA877}, {0xA8CE, 0xA8CF},
{0xA8F8, 0xA8FA}, {0xA8FC, 0xA8FC}, {0xA92E, 0xA92F}, {0xA95F, 0xA95F}, {0xA9C1, 0xA9CD}, {0xA9DE, 0xA9DF}, {0xAA5C, 0xAA5F}, {0xAADE, 0xAADF}, {0xAAF0, 0xAAF1}, {0xABEB, 0xABEB}, {0xFD3E, 0xFD3F},
{0xFE10, 0xFE19}, {0xFE30, 0xFE52}, {0xFE54, 0xFE61}, {0xFE63, 0xFE63}, {0xFE68, 0xFE68}, {0xFE6A, 0xFE6B}, {0xFF01, 0xFF03}, {0xFF05, 0xFF0A}, {0xFF0C, 0xFF0F}, {0xFF1A, 0xFF1B}, {0xFF1F, 0xFF20},
{0xFF3B, 0xFF3D}, {0xFF3F, 0xFF3F}, {0xFF5B, 0xFF5B}, {0xFF5D, 0xFF5D}, {0xFF5F, 0xFF65}, {0x10100, 0x10102}, {0x1039F, 0x1039F}, {0x103D0, 0x103D0}, {0x1056F, 0x1056F}, {0x10857, 0x10857},
{0x1091F, 0x1091F}, {0x1093F, 0x1093F}, {0x10A50, 0x10A58}, {0x10A7F, 0x10A7F}, {0x10AF0, 0x10AF6}, {0x10B39, 0x10B3F}, {0x10B99, 0x10B9C}, {0x10EAD, 0x10EAD}, {0x10F55, 0x10F59}, {0x11047, 0x1104D},
{0x110BB, 0x110BC}, {0x110BE, 0x110C1}, {0x11140, 0x11143}, {0x11174, 0x11175}, {0x111C5, 0x111C8}, {0x111CD, 0x111CD}, {0x111DB, 0x111DB}, {0x111DD, 0x111DF}, {0x11238, 0x1123D}, {0x112A9, 0x112A9},
{0x1144B, 0x1144F}, {0x1145A, 0x1145B}, {0x1145D, 0x1145D}, {0x114C6, 0x114C6}, {0x115C1, 0x115D7}, {0x11641, 0x11643}, {0x11660, 0x1166C}, {0x1173C, 0x1173E}, {0x1183B, 0x1183B}, {0x11944, 0x11946},
{0x119E2, 0x119E2}, {0x11A3F, 0x11A46}, {0x11A9A, 0x11A9C}, {0x11A9E, 0x11AA2}, {0x11C41, 0x11C45}, {0x11C70, 0x11C71}, {0x11EF7, 0x11EF8}, {0x11FFF, 0x11FFF}, {0x12470, 0x12474}, {0x16A6E, 0x16A6F},
{0x16AF5, 0x16AF5}, {0x16B37, 0x16B3B}, {0x16B44, 0x16B44}, {0x16E97, 0x16E9A}, {0x16FE2, 0x16FE2}, {0x1BC9F, 0x1BC9F}, {0x1DA87, 0x1DA8B}, {0x1E95E, 0x1E95F},
};
"""

# convert to python
digit_ranges = []
for line in cpp_ranges.splitlines():
    if not line.startswith("{"):
        continue
    line = line.strip("{}")
    line = line.replace(" ", "")
    line = line.replace("0x", "")
    line = line.split(",")
    for first, second in zip(line[::2],line[1::2]):
        digit_ranges.append((int(first.strip("{}"), 16), int(second.strip("{}"), 16)))

# print the symbols as UTF-32 regex
print(("".join(["\\U%08x-\\U%08x" % (first, second) for first, second in digit_ranges])).upper())

DOGEwbx · 2024-04-18T06:52:45Z

@dragnil1 There are also some code piece for the failed re-flex tries, it will outoput utf-8 format but it's too slow, hope it'll help.

#include <reflex/matcher.h> // reflex::Matcher, reflex::Input, reflex::Pattern
#include <reflex/stdmatcher.h>
#include <locale>
#include <regex>
#include <iostream>
#include <string>
#include <cstring>
#include <vector>


int main()
{
    std::vector<std::string> regex_list = {
        "[\\p{P}\\$\\+<=>\\^~\\|]+",
        "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
        "[0-9][0-9][0-9]"
        "\\s?\\p{L}+",
        "\\s?\\p{P}+",
        "\\p{N}",

    };
    for(auto s: regex_list){
        std::string regex_s = reflex::StdMatcher::convert(s, reflex::convert_flag::unicode);
        std::cout<<s<<std::endl<<regex_s<<std::endl;
        }    
}

BingxuanWang and others added 4 commits November 13, 2023 18:23

add new gpt2

5600bd8

fix falcon preprocess and add deepseek coder

c31263e

Merge pull request #1 from DOGEwbx/master

9fa2627

merge master

upload model and add git ignore

030886e

Galunid reviewed Nov 14, 2023

View reviewed changes

BingxuanWang added 3 commits November 15, 2023 12:16

merge changes

4e67165

Merge branch 'master' into regex_gpt2_preprocess

7d971ee

update and refactor

3f4185b

Galunid reviewed Nov 16, 2023

View reviewed changes

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

remove unused code

9cecb7a

DOGEwbx requested a review from Galunid November 16, 2023 07:55

Galunid reviewed Nov 16, 2023

View reviewed changes

convert-hf-to-gguf.py Show resolved Hide resolved

Galunid reviewed Nov 16, 2023

View reviewed changes

fix

4494a9f

ggerganov reviewed Nov 16, 2023

View reviewed changes

optimized performance

3a7f0c4

DOGEwbx requested a review from ggerganov November 21, 2023 12:38

Merge branch 'master' into regex_gpt2_preprocess

8c45f1c

cebtenzzre reviewed Nov 22, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

BingxuanWang added 2 commits November 22, 2023 11:30

remove printf

76f6831

fix format

87fe183

update for deepseek-llm

cb33629

teleprint-me reviewed Nov 29, 2023

View reviewed changes

clean-up : warnings, names

fecb61b

BingxuanWang added 3 commits November 30, 2023 09:53

Merge branch 'deepseek-llm' into regex_gpt2_preprocess

277c64d

update name format

68db4d5

update vocab

5596f78

DOGEwbx mentioned this pull request Dec 11, 2023

docs(README): update README.md deepseek-ai/DeepSeek-LLM#22

Open

BingxuanWang and others added 4 commits December 28, 2023 12:25

Merge branch 'master' into regex_gpt2_preprocess

417884e

Merge branch 'master' into HEAD

d24da31

llama : minor stuff

128c213

gitignore : fix

12ab343

ggerganov approved these changes Dec 29, 2023

View reviewed changes

tests : fix trailing whitespace

f81d205

ggerganov requested changes Dec 29, 2023

View reviewed changes

jaggzh mentioned this pull request Feb 12, 2024

Deepseek coder merge #5464

Closed

BingxuanWang mentioned this pull request Feb 18, 2024

Update README.md deepseek-ai/DeepSeek-Coder#101

Open

ggerganov mentioned this pull request Mar 10, 2024

llama : add Deepseek support #5981

Closed

christopherthompson81 mentioned this pull request Mar 13, 2024

Custom fine-tuned DeepSeek coder model unable to be quantized to Fp16 #5234

Closed

ggerganov mentioned this pull request Mar 16, 2024

Command-R (CohereForAI model) tokenization disagrees with HF implementation #6104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update gpt2 preprocess and add deepseek coder preprocess #4070

Update gpt2 preprocess and add deepseek coder preprocess #4070

DOGEwbx commented Nov 14, 2023

Galunid left a comment

DOGEwbx commented Nov 14, 2023

Galunid left a comment

ggerganov left a comment

DOGEwbx commented Nov 16, 2023

DOGEwbx commented Nov 16, 2023

ggerganov commented Nov 16, 2023

DOGEwbx commented Nov 21, 2023

yhyu13 commented Nov 25, 2023

teleprint-me Nov 29, 2023 •

edited

Loading

ggerganov commented Nov 29, 2023

DOGEwbx commented Nov 29, 2023 •

edited

Loading

ggerganov commented Nov 29, 2023

DOGEwbx commented Nov 29, 2023

TheBloke commented Dec 1, 2023

ggerganov commented Dec 1, 2023

DOGEwbx commented Dec 2, 2023

ggerganov left a comment

ggerganov left a comment

dragnil1 commented Apr 17, 2024

DOGEwbx commented Apr 18, 2024 •

edited

Loading

DOGEwbx commented Apr 18, 2024

Update gpt2 preprocess and add deepseek coder preprocess #4070

Are you sure you want to change the base?

Update gpt2 preprocess and add deepseek coder preprocess #4070

Conversation

DOGEwbx commented Nov 14, 2023

Galunid left a comment

Choose a reason for hiding this comment

DOGEwbx commented Nov 14, 2023

Galunid left a comment

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

DOGEwbx commented Nov 16, 2023

DOGEwbx commented Nov 16, 2023

ggerganov commented Nov 16, 2023

DOGEwbx commented Nov 21, 2023

yhyu13 commented Nov 25, 2023

teleprint-me Nov 29, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov commented Nov 29, 2023

DOGEwbx commented Nov 29, 2023 • edited Loading

ggerganov commented Nov 29, 2023

DOGEwbx commented Nov 29, 2023

TheBloke commented Dec 1, 2023

ggerganov commented Dec 1, 2023

DOGEwbx commented Dec 2, 2023

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

dragnil1 commented Apr 17, 2024

DOGEwbx commented Apr 18, 2024 • edited Loading

DOGEwbx commented Apr 18, 2024

teleprint-me Nov 29, 2023 •

edited

Loading

DOGEwbx commented Nov 29, 2023 •

edited

Loading

DOGEwbx commented Apr 18, 2024 •

edited

Loading