-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide pre tokenizer based on preprocessing of entry and not on tokens encoded #7039
Conversation
Can you explain why |
Hey @slaren , I am not so much convinced about the solution as about the problem itself, but let me show an example, I have a tokenizer which preprocessing main step is to do LowerCase, and some other things. If I do this, tokens = tokenizer.encode("Hello World my love")
print(tokenizer.decode(tokens)) I get I beliee this behavior would be invariant to the I am not sure if there is another non working edge cases, or if there is a better API in |
I think that we can all agree that a way to detect the pretokenization logic only would be desirable, I do not think that the current solution is going to scale, but the main challenge of doing is that is coming up with an implementation that can do that reliably. In your example, I can see that this may be able to detect cases where the pretokenizer transforms the words to lowercase, but my understanding is that for the most part pretokenizers are responsible for doing the initial splitting of the tokens, and I do not see how this approach can help with detecting that. |
Maybe u are right, the challenges I am facing now are with what Maybe the solution is not simply to decode the full, but map back to symbol from the ID for each token in the token vector. This would be a step towarss being more vocab independent. |
@slaren I did the edit, to avoid the |
The tokenizer is already trained. The merges are completed as the tokenizer is the output. The tokenizers job is simply to convert between numerical and symbolic representations. The ids in the tokenizer simply match the merged symbolic representations. We're simply going back and forth (translating) between these representations because the tokenizer is simply a mapping. In simple terms, tokens are just parts of words. Im not sure I see what the benefit to this is when were already checking amd verifying rhat the translations works as expected. Am I missing something here? |
Hey @teleprint-me , Wouldn't it be the case that if I have 2 tokenizers, let's say tokenizer A and tokenizer B with the following vocabs. A -> {"hello" : 0, "world": 1} and B -> {"world: 0, "hello": 1} Even if they may do the same preprocessing and normalizer steps, they would be recognized by the current algorithm as However, I still think that if the |
I added a new I think the assumption here is that it uses |
This looks like it has a better chance of detecting the pretokenizer token splitting instead of only the normalization, but as you noted |
I can try to detect if is slow, I will try to progress |
There is a common property. The tokenizers inherit from @property
def is_fast(self) -> bool:
return False It would be more reliable to test whether the tokenizer is a fast tokenizer. Both The con to this approach is that it forgoes using any other library in favor of a framework, e.g. choosing If A and B are not equal, even if the values in theory are the same, the mappings are not equal. >>> A = {"hello" : 0, "world": 1}
>>> B = {"world": 0, "hello": 1}
>>> A == B
False You can validate this with a simple assertion. Following this line of thought, the hash sums will also be different. We can think of A and B as having the same properties while having different identities, but this would be a non-sequitur. The logic would not be following the premise, essentially revealing a contradiction. e.g., "world" does not have a value of 0 in both A and B. |
Hey @teleprint-me, I can update the PR to use the is_fast assertion. As for the second part of the response, I do not follow. I just tried to make the point that the vocabulary being used should not affect the result of the hashing if what we aim to detect is the pretokenization steps (or identity) and not the full tokenization process. |
All I did was follow your example and line of thinking.
Yes, they would be recognized by the current algorithm as different pretokenizers because A and B are not the same. They are not equal. A != B. The keys are the same, but the values are different. The hashes produced for each tokenizer would be different.
The answer is no, they will not be the same. If you mean hashes, then you can not map back. Hashes are -in theory (at least until someone discovers a proof)- one-way. If you mean decoding, then the answer is still no, for the same reason. Just because the keys are the same does not mean the values are the same and vice versa. Any change in the key and/or value will yield a different hash. Even if a single bit is modified, it will yield a unique hash. If a hash is generated for A and another hash is generated for B, then they should be the same for their respective tokenizers assuming the input remains the same. If you're concerned about "exploding" vocabularies, which I don't think is really applicable to BPE because BPE compresses the mapped data, then I would set a constant to limit the input and output which is how it's currently implemented. The naive way would be to just read the entire vocab at once which is more expensive, especially as the vocabulary grows. Languages are finite, so there is an approximate upper limit, even as languages evolve over time. It's possible I'm not understanding something. It happens. Feel free to correct my understanding if you feel I've misunderstood your stance. |
Hey @teleprint-me . I think we are derailing from the point of the PR. My understanding is that the Therefore I think it would be best that the I believe this is not happening with the current implementation because in the current implementation the full tokenization process is being used, and I was just presenting a dummy example of a potential tokenizer logic that with the same I can give concrete cases of
I hope this clarifies. |
…-pre-tokenizer-hashing
It's just a conversion script. The tokenization already took place when the tokenizer was created. The encodings are used to create the hashes to identify the models vocabulary based on the encodings IDs. That's really it. If you'd like more information for this rationale, then you can reference PR #5981 which has follow up PRs going into more detail: #6252 #6920 Here is a condensed and detailed outline of what's happening in the update script. 21:00:28 | /mnt/valerie/forked/ggerganov/llama.cpp
(.venv) git:(master | θ) λ bpython
bpython version 0.24 on top of Python 3.12.3 /mnt/valerie/forked/ggerganov/llama.cpp/.venv/bin/python
>>> import transformers
>>> transformers.__version__
'4.40.1'
>>> from transformers import AutoTokenizer
>>> from pathlib import Path
>>> model_path = Path("/mnt/valerie/models/stabilityai/stablelm-2-1_6b-chat")
>>> model_path.exists()
True
>>> tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
>>> type(tokenizer)
<class 'transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast'>
>>> tokenizer.is_fast
True
>>> # The pre-tokenizer comes AFTER the normalization stage
>>> # There's no need to normalize as a result.
>>> # Source: https://huggingface.co/learn/nlp-course/chapter6/4
>>> tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, world!")
[('Hello', (0, 5)), (',', (5, 6)), ('Ġworld', (6, 12)), ('!', (12, 13))]
>>> # We do not need the merges. We need the encodings.
>>> tokenizer.encode("Hello, world!")
[9906, 11, 1917, 0]
>>> # The encodings are then hashed to identify the model
>>> from hashlib import sha256
>>> sha256(str(tokenizer.encode("Hello, world!")).encode()).hexdigest()
'dd4920b7ab0c639e3f71374738daf5646eb098fd397733fff8b939af0c3c3688'
>>> You're gambling by attempting to normalize the already normalized merges (it was already trained, merged, etc). >>> tokenizer.backend_tokenizer.normalizer.normalize_str("Hello, world!")
Traceback (most recent call last):
File "<input>", line 1, in <module>
tokenizer.backend_tokenizer.normalizer.normalize_str("Hello, world!")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'normalize_str'
>>> type(tokenizer.backend_tokenizer.normalizer)
<class 'NoneType'>
>>> This should never been done under any circumstance. You're relying on the underlying API here. This is all abstracted away and handled by private methods that are already baked into the class. I understand that you want to include Jina and I support your effort in that regard, but not like this. The conversion scripts should be model agnostic. Model specific details should be handled accordingly within their respective APIs. Perhaps looking into the embeddings implementation or some other form of support for cosine similarity would be a good idea? I don't know. @slaren is the expert in the backend and would be a better guide in that regard than I. @slaren Perhaps renaming the metadata property to encoding could guard against this type of confusion? @JoanFM It should be noted that the You're the experts here. I'm the novice. I understand that we're just people at the end of day and I will always resign to your expertise simply because you have more experience than I. I just think this is a matter of misinterpretation and I can understand why. |
Hey @teleprint-me, I agree that the way I am implementing here is far frlm being perfect because it relies on a convoluted API that is not there to be publicly available. This is exactly why I am keeping this PR as a draft so that we can keep the discussion flowing. My idea is to start thinking if what is being done right now is the best way to identify pre tokenizer configurations (IMHO it is not because of what I have been exposing), another discussion is if the problems this system has may be a better alternative than fixing with the available solutions. |
Yes, this is all correct - the current approach is not optimal in the sense that the hash is a function not only of the pre-tokenization, but also the actual vocabulary. So two different models that have the same pre-tokenization but different vocabs, would lead to different hashes. In such cases, the plan for now is to assign the same pre-tokenizer names/keys (e.g. "llama-spm", "llama-bpe"), or add new names and handle them in the same way in the |
Hello @ggerganov , Thanks for the explanation. At least you clarify that I was not missing something. I will continue with your recomendation and close this PR for the time being. I am also not an expert as learning as I try to implement/fix things. Thank you very much |
I open this PR to start the discussion of whether the way we have currently to decide which pretokenization logic to apply should be based on the actual tokens obtained or in the decoding of those. Since the tokens can be affected by the vocabulary, the number of pretokenizations would explode for different vocabularies even though the pretokenization logic should be the same.
If this is agreed, I can edit the hashes already computed for the models already considered.