-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizers questions and ... proposals? #6980
Comments
On Please take a look at the description in #6920 - this will be merged soon and it will introduce a pre-tokenizer field that |
Okey I will follow closely the developments, thank you very much |
Hey @ggerganov , I have been checking the PR and the pretokenization closely and I have some questions and doubts.
Thanks for all the help and attention I am receiving in my PRs and inquiries. |
I'm not sure about NFC normalization - need to understand what it is first. Lowercase should be easy to apply in the Not familiar with pre-compiled char maps |
Another question. In order to decide the pretokenizer we do this: chktok = tokenizer.encode(chktxt)
chkhsh = sha256(str(chktok).encode()).hexdigest() and we pretend to identify the pretrokenization behavior based on this. But does this actually make sense? The Wdyt @ggerganov ? |
@ggerganov @slaren @teleprint-me I have a question. If I were to consider adding support for |
Also, Perhaps a nice option to better scale the efforts to handle |
Large external dependencies like boost are out of the question. Small, self-contained libraries that can be bundled in the repository have a higher chance of being accepted by @ggerganov, but the preference is still to reduce dependency on external libraries. |
Yes, the resulting hashes would be different, but if you observe that the pre-tokenizer configs are the same, you can assign the same "name" for both models in order to reuse the pre-tokenizer. But also adding a duplicate one is fine.
From quick look, is NFC normalization similar to NFD normalization? We have the latter already implemented: Lines 472 to 485 in 3275e60
Seems simple enough to implement from scratch |
Hey @ggerganov , To be honest, I know NFC is related to NFD but not sure how wasy is to implement, I am trying to understand some implementations but I am quite new to this. (I believe is a little more complex as it combines symbols more than decomposing). I will try to dig deeper, thanks I know you can map to the same name from different hashes, but would be nice if it could actually detect it in a better way. (I have a small proposal in #7039 for some imrpovement). |
NFC simply merges similar characters together to reduce redundancy, but the importance and practicality of such an implementation is debatable. It really depends on the use case and needs of the project. I have no say here, but I agree with @ggerganov on implementing a function from scratch if it really is desired for some reason. There's a nifty online tool with a brief overview on what the concept is. Edit: It's important to keep in mind that there's a definite potential for data loss with the use of NFC, but everything is compressed here. It's all compression, so I suppose it's more practical to think in terms of how lossy it is in comparison to other methods. |
Hey @teleprint-me , As per wether it i should be used or not, I believe this is a discussion to be had when training the tokenizer, in the case of the library, I believe that if a model is trained with a tokenizer requiring this normalization process, we should be able to produce this normalization. As per wether we should implement ourselves, I agree and I am working in that direction. However I am finding it very hard to get the data needed for the mapping. Where did @ggerganov get the required data to complete the The algorithm I understand it goes as:
Is there any hint you could give me as per where I could look for? Thanks |
@iamlemec may be able to provide more details about the current implementation. Plus I have a question, In theory Thanks for all the help |
Hi @JoanFM! Very interesting work here. Having Jina embeddings would be great. I'm actually working on As for the provenance of the Hadn't noticed that little |
Hey @iamlemec, I think we should not change that, I believe that the change should tell us that NFD is not being properly used (at least in the way HF tokenizers) aim to do. I am not sure if maybe llama.cpp is not being extensively used with languages other than English? As per the SPM library, I had already noticed the library but I could not figure exactly what is doing and how one could implement that in C++ or what is the actual use it has. Thanks for the help!! |
Hey @iamlemec . I think that the I feel it seems like a |
Hm, definitely not intentional. Nice find |
I am trying to do some fixes here, but still not sure about implementation #7122 |
Ok, let's bring back the old |
I am not so sire about the most common cases. But I just found that it does not implement NFD as it is supposed to do.
I am doing a trial in #7122 , but not working yet |
So As for the canonical order, do we know if this will actually affect any specific examples? We're stripping out all of the accent mark characters right afterwards, so I'd be kind of suprised. And the reordering might be a bit expensive, which can start to impact speeds on smaller embedding models. |
Not sure how much it affects, but it is the deginition of the algorithm so this should be applied, otherwise they would not make it. I guess for non English language the affection will be greater. You can always check if no detection in nfd map has been made to skip the sorting. |
Still would be nice to see at least a single use case. Either way, can't we simply pre-compute the reordering in the Python script rather than doing it at runtime? |
I do not think is feasible right? |
The |
yes, but I mean I do not think u can sort that in the map so that u can skip the sorting during the normaliization |
Hey @iamlemec , I have been digging a little bit, and then I saw that in the case where the reordering happens, this would mean that NFD and NFC would differ between each other and with the original representation, but this seems to only happen in very strange characters which may not be relevant at all. See https://unicode.org/reports/tr15/#Multiple_Mark_Figure I even tried this experiment in Python (which I believe is complex enough) and it seems it does not apply, so I think it is okey for simplicity to skip it for now. So to fix NFD only having the a = "北京的清晨,空氣清新而寧靜,一个年轻的旅行者在长城上漫步,他从自己的故乡—서울에서 출발하여 아시아의 다양한 문화를 탐험하고자 하는 꿈을 품고 떠났다。彼は日本の古都、京都を訪れ、そこで美しい桜の花が満開の下で古典音楽のコンサートに参加しました。祭りの夜、彼は色とりどりの灯籠が空に浮かぶのを見て、その美しさに感動しました。その後、彼は印度のバラナシに到着し、गंगा की घाटों पर आध्यात्मिक शांति की खोज में जुट गया। वहाँ उसने दिवाली के उत्सव में हिस्सा लिया, जहां लाखों दीये जलाकर समृद्धि और खुशहाली की कामना की गई थी।この旅は彼にとって非常に啓発的であり、多くの異なる文化から新しいことを学び、新しい友達を作る機会を与えました。彼はこの経験を通じて、 異なる文化の間の共通点と相違点を理解するようになりました。España is your's mine's l'heure èspciâl café über naïve résumé cañón élite cañas Barça 例子 東京 こんにちは 你好 中国"
nfd = unicodedata.normalize('NFD', a)
nfc = unicodedata.normalize('NFC', a)
>>> nfd == nfc
False
>>> nfd == a
False
>>> nfc == a
True
>>> |
Thanks for looking into it @JoanFM! I do love learning about the rich complexity of Unicode. Yeah, I think main place this shows up is with languages that use multiple accents per base character, like Vietnamese. But at least in the WordPiece model, we strip these accents out anyway, so it shouldn't make a difference. Overall, it seems like embedding models tend to ignore accents pretty aggressively, possibly because English and Chinese are so dominant in that space right now. For instance, the original |
Yes, even trying to fix NFD in #7122 I struggled to find a test failing for that case. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Hello @ggerganov ,
Thanks for the great project. I have been trying to include the Jina Embedding models into
llama.cpp
as you can see in #6826.I have been successful on having it run for most of the models that Jina offers, (English, Spanish and German) but I cannot have it for
Chinese
.I havee seen that the issue comes from the
tokenization
part of the model and I have been digging more into the code forllama.cpp
as the one fromtokenizers
in HuggingFace.I have some questions that I will try to place here.
1st. - How to know which tokenizer needs to be used for each model. For instance, I see that the
SPM
andBPE
tokenizers here seems to work quite similarly but there are some discrepancies.2nd. - I have seen that the problem from the
Chinese
model when it comes to the differences in output compared to the usage oftransformers
comes from the fact that the model uses someNormalizers
andPreTokenizers
that are very hard to configure inllama.cpp
.I wonder if there would be need to do some refactoring in the tokenizer to enable the decoupling of the tokenizing logic with the surrouding normalization code, plus some options to have a reacher mapping of the tokenizer options in
transformers
and inllama.cpp
.I am not sure if my observations here make any sense or I am just missusing the project or missunderstanding some of the concepts.
Thank you for the great work and happy to bring some help.
The text was updated successfully, but these errors were encountered: