-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add Deepseek support #5981 #6252
Conversation
unicode.h
Outdated
std::vector<std::wstring> get_gpt2_regex(); | ||
std::vector<std::wstring> get_deepseek_coder_regex(); | ||
std::vector<std::wstring> get_deepseek_llm_regex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking the interface here should be:
std::vector<std::string> unicode_regex_split(const std::string & text, const std::vector<std::string> & regexes);
The implementation should be something like what regex_bpe_preprocess
currently is. It loops through the regex strings and if we have a known unicode representation (e.g. "//\s?\p{L}+" -> std::wstring
) - we apply it with std::wregex
. Else, if we have a custom implementation (for example see the GPT2 preprocess function) then we apply that.
The unicode
module should not have any kind of notion about GPT2, Deepseek or other model-related stuff. This information should be in llama.cpp
llama.cpp
Outdated
std::vector<std::string> bpe_deepseek_coder_preprocess(const std::string & text) { | ||
return regex_bpe_preprocess(text, get_deepseek_coder_regex()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following my previous comment, this should eventually become:
std::vector<std::string> bpe_deepseek_coder_preprocess(const std::string & text) { | |
return regex_bpe_preprocess(text, get_deepseek_coder_regex()); | |
} | |
std::vector<std::string> bpe_deepseek_coder_preprocess(const std::string & text) { | |
return unicode_regex_split(text, { | |
"[\\p{P}\\$\\+<=>\\^~\\|]+", | |
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", | |
"[0-9][0-9][0-9]" | |
"\\s?\\p{L}+", | |
"\\s?\\p{P}+", | |
"\\p{N}", | |
}); | |
} |
llama.cpp
Outdated
const llama_vocab & vocab; | ||
|
||
std::vector<llm_symbol> symbols; | ||
std::vector<llm_symbol> symbols_final; | ||
|
||
llm_bigram_bpe::queue work_queue; | ||
|
||
const std::vector<std::wstring> gpt2_regex = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a different idea about this - let me try to explain again:
In llama.cpp
, we want to keep the original regex strings as they have been specified by the model creators. For example:
's|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)
\s?\p{L}+
- etc.
Now, my understanding is that in C++ we cannot simply perform some of those regex matching due to the lack of support for some the regex patterns in the standard library. So to solve this issue, we create the unicode
module, which takes the regex strings from above as they are and performs a few different strategies to split the target string:
- If we have a known unicode representation generated in some way, we apply that using
std::wregex
. I.e. we check a constantstd::map<std::string, std::wstring>
for the presence of the regex - If not, we then check if we have a custom implementation of the regex via a function call (see
bpe_gpt2_preprocess()
on master which is a custom implementation of regex split with's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
) - Else, we just apply
std::regex
and hope for the best, or throw an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, in the unicode.cpp
, I have implemented the unicode_regex_split()
function which iterates through the given regexes and if some match is found, it uses the regex. Otherwise, it uses the modified bpe_gpt2_preprocess()
function renamed as unicode_custom_preprocess()
. Now, I have some questions regarding bpe_gpt2_preprocess()
. Can it handle the input of deepseek coder and deepseek llm? If no, so, I have to write custom function for them when regex is not found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it handle the input of deepseek coder and deepseek llm?
No, AFAIK it is not compatible with the deepseek regex. The way I understand it is that bpe_gpt2_preprocess()
(i.e. unicode_custom_preprocess()
) works only for the following regex (based on the comment in the code):
's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+
So the logic in unicode_regex_split()
has to check if this is the input regex and only apply this specific custom implementation in that case. For other regexes, we might want to implement more custom implementations in other functions and use them in unicode_regex_split()
in the future.
Note that this my understanding of how this part of the tokenizer is supposed to work. I could be wrong, so don't take all of these suggestions for granted.
In any case, the huge unicode constants like gpt2_regex
should not be located in llama.cpp
, but instead should be in unicode.cpp
.
46b1e9c
to
cc8d529
Compare
@dragnil1 Thank you for the help. With LLaMA v3 now switching to BPE tokenizer, this functionality becomes very important (see #6914). I'll focus on finalizing it ASAP and will likely use the work in this PR as a starting point. Will probably move the branch in the The tokenization and unicode handling is definitely not my strongest and favourite part of the codebase, so if you or anyone else have any insights, don't hesitate to share or help out. I think I have the understanding of how to implement BPE pre-processing support, but I could very well be missing something |
Thanks for letting me work on this pr. Sorry for the delay. I was working on it to get it work on windows. The recent commit passed the tests in ubuntu. But the tokenizer tests failed on windows. I had the idea of using std::wregex when the wchar_t size is 32 bits and std::regex when wchart_t size is 16. But using regex is giving SEGFAULT on tokenizer tests on windows. This is probably because the regex pattern from ReFlex library is too large than the the regex pattern used for wregex. While doing some research on it, I found the most efficient way will be to use boost library with icu support. But it will hamper the minimal dependency support of llama.cpp. Otherwise, we can use the standalone boost regex library. I was also thinking of converting the regex pattern produced by the ReFlex library to utf-32 pattern and utf-16 pattern regex pattern, that respectively works on ubuntu and probably works on windows. |
In the latest version #6920, I changed the order of the the regexes: first look for custom implementation and then look for a known equivalent regex. I also disabled DeepSeek code-paths temporary until we get the tests running as they were on I don't have a Windows environment to work on, so it's gonna take me some time to figure out where it goes wrong Edit: apparently the Windows build fails were unrelated - hopefully we have a baseline that works now |
Ok. I have found the reason of the test failing in windows. Some regex ranges are not valid in windows. Here is an example which will run.
Here is an example which will not run.
Both regex ranges are taken from gpt2 regex. |
Yes, I just noticed the error in the CI:
Any ideas how to resolve? |
We cannot use ranges which contain codepoints that requires greater than 2 bytes. We have to convert the 3 bytes or 4 bytes ranges to individual values. But this may cause the regex pattern to be large enough to result in a SEGFAULT. I will let you know if I can find a viable solution. |
ref #5981