Decide pre tokenizer based on preprocessing of entry and not on tokens encoded #7039

JoanFM · 2024-05-02T10:04:14Z

I open this PR to start the discussion of whether the way we have currently to decide which pretokenization logic to apply should be based on the actual tokens obtained or in the decoding of those. Since the tokens can be affected by the vocabulary, the number of pretokenizations would explode for different vocabularies even though the pretokenization logic should be the same.

If this is agreed, I can edit the hashes already computed for the models already considered.

slaren · 2024-05-02T11:04:57Z

Can you explain why decode(encode(x)) is a good way to detect the pretokenization logic? I would expect the result to be x.

JoanFM · 2024-05-02T11:24:40Z

Hey @slaren ,

I am not so much convinced about the solution as about the problem itself, but let me show an example, I have a tokenizer which preprocessing main step is to do LowerCase, and some other things.

If I do this,

tokens = tokenizer.encode("Hello World my love")
print(tokenizer.decode(tokens))

I get <s> hello world my love </s>

I beliee this behavior would be invariant to the vocabulary used.

I am not sure if there is another non working edge cases, or if there is a better API in tokenizers to do so.

slaren · 2024-05-02T11:43:29Z

I think that we can all agree that a way to detect the pretokenization logic only would be desirable, I do not think that the current solution is going to scale, but the main challenge of doing is that is coming up with an implementation that can do that reliably. In your example, I can see that this may be able to detect cases where the pretokenizer transforms the words to lowercase, but my understanding is that for the most part pretokenizers are responsible for doing the initial splitting of the tokens, and I do not see how this approach can help with detecting that.

JoanFM · 2024-05-02T11:48:42Z

I think that we can all agree that a way to detect the pretokenization logic only would be desirable, I do not think that the current solution is going to scale, but the main challenge of doing is that is coming up with an implementation that can do that reliably. In your example, I can see that this may be able to detect cases where the pretokenizer transforms the words to lowercase, but my understanding is that for the most part pretokenizers are responsible for doing the initial splitting of the tokens, and I do not see how this approach can help with detecting that.

Maybe u are right, the challenges I am facing now are with what tokenizers describe as normalizers, which may need to be handled differently.

Maybe the solution is not simply to decode the full, but map back to symbol from the ID for each token in the token vector. This would be a step towarss being more vocab independent.

JoanFM · 2024-05-02T13:33:25Z

@slaren I did the edit, to avoid the decode and just map back to the tokens instead of IDs to improve the invariance w.r.t vocabulary without changing much of the logic.

teleprint-me · 2024-05-02T14:17:24Z

The tokenizer is already trained. The merges are completed as the tokenizer is the output. The tokenizers job is simply to convert between numerical and symbolic representations. The ids in the tokenizer simply match the merged symbolic representations.

We're simply going back and forth (translating) between these representations because the tokenizer is simply a mapping.

In simple terms, tokens are just parts of words. Im not sure I see what the benefit to this is when were already checking amd verifying rhat the translations works as expected.

Am I missing something here?

JoanFM · 2024-05-02T14:22:52Z

The tokenizer is already trained. The merges are completed as the tokenizer is the output. The tokenizers job is simply to convert between numerical and symbolic representations. The ids in the tokenizer simply match the merged symbolic representations.

We're simply going back and forth (translating) between these representations because the tokenizer is simply a mapping.

In simple terms, tokens are just parts of words. Im not sure I see what the benefit to this is when were already checking amd verifying rhat the translations works as expected.

Am I missing something here?

Hey @teleprint-me ,

Wouldn't it be the case that if I have 2 tokenizers, let's say tokenizer A and tokenizer B with the following vocabs.

A -> {"hello" : 0, "world": 1} and B -> {"world: 0, "hello": 1}

Even if they may do the same preprocessing and normalizer steps, they would be recognized by the current algorithm as different pretokenizers? But if I map back to words, as suggested here, they will have the same hash and therefore recognize as needing the same preprocessing? .

However, I still think that if the vocabs are so different than some contain some tokens and some do not contain the tokens, the algorithm still is invalid.

JoanFM · 2024-05-02T14:42:11Z

I added a new change trying to apply the hashing based on the assumption that I can apply normalization and pre_tokenization in an isolated way.

I think the assumption here is that it uses FastTokenizers and I know is ugly to rely in so deep nested methods.

slaren · 2024-05-02T15:14:14Z

This looks like it has a better chance of detecting the pretokenizer token splitting instead of only the normalization, but as you noted backend_tokenizer is a property of PreTrainedTokenizerFast and it is not available in every tokenizer. I tried adding these changes to convert-hf-to-gguf-update.py and it fails with several models (at least llama3, falcon, starcoder and gpt-2).

JoanFM · 2024-05-02T15:23:51Z

This looks like it has a better chance of detecting the pretokenizer token splitting instead of only the normalization, but as you noted backend_tokenizer is a property of PreTrainedTokenizerFast and it is not available in every tokenizer. I tried adding these changes to convert-hf-to-gguf-update.py and it fails with several models (at least llama3, falcon, starcoder and gpt-2).

I can try to detect if is slow, I will try to progress

teleprint-me · 2024-05-04T23:10:39Z

@slaren

There is a common property. The tokenizers inherit from PreTrainedTokenizerBase. The higher level implementation details are specific to those tokenizers.

    @property
    def is_fast(self) -> bool:
        return False

It would be more reliable to test whether the tokenizer is a fast tokenizer. Both PreTrainedTokenizer and PreTrainedTokenizerFast have this property.

The con to this approach is that it forgoes using any other library in favor of a framework, e.g. choosing transformers over sentencepiece or any other method. You'd be wholly relying on a 3rd party framework which creates a dependency; We're doing this already, but it's in isolation.

@JoanFM

If A and B are not equal, even if the values in theory are the same, the mappings are not equal.

>>> A = {"hello" : 0, "world": 1} 
>>> B = {"world": 0, "hello": 1}
>>> A == B
False

You can validate this with a simple assertion.

Following this line of thought, the hash sums will also be different.

We can think of A and B as having the same properties while having different identities, but this would be a non-sequitur. The logic would not be following the premise, essentially revealing a contradiction. e.g., "world" does not have a value of 0 in both A and B.

JoanFM · 2024-05-05T09:11:35Z

@slaren

There is a common property. The tokenizers inherit from PreTrainedTokenizerBase. The higher level implementation details are specific to those tokenizers.
    @property
    def is_fast(self) -> bool:
        return False
It would be more reliable to test whether the tokenizer is a fast tokenizer. Both PreTrainedTokenizer and PreTrainedTokenizerFast have this property.

The con to this approach is that it forgoes using any other library in favor of a framework, e.g. choosing transformers over sentencepiece or any other method. You'd be wholly relying on a 3rd party framework which creates a dependency; We're doing this already, but it's in isolation.

@JoanFM

If A and B are not equal, even if the values in theory are the same, the mappings are not equal.
>>> A = {"hello" : 0, "world": 1} 
>>> B = {"world": 0, "hello": 1}
>>> A == B
False
You can validate this with a simple assertion.

Following this line of thought, the hash sums will also be different.

We can think of A and B as having the same properties while having different identities, but this would be a non-sequitur. The logic would not be following the premise, essentially revealing a contradiction. e.g., "world" does not have a value of 0 in both A and B.

Hey @teleprint-me,

I can update the PR to use the is_fast assertion.

As for the second part of the response, I do not follow. I just tried to make the point that the vocabulary being used should not affect the result of the hashing if what we aim to detect is the pretokenization steps (or identity) and not the full tokenization process.

teleprint-me · 2024-05-06T04:40:46Z

Wouldn't it be the case that if I have 2 tokenizers, let's say tokenizer A and tokenizer B with the following vocabs.
A -> {"hello" : 0, "world": 1} and B -> {"world: 0, "hello": 1}

All I did was follow your example and line of thinking.

Even if they may do the same preprocessing and normalizer steps, they would be recognized by the current algorithm as different pretokenizers?

Yes, they would be recognized by the current algorithm as different pretokenizers because A and B are not the same. They are not equal. A != B. The keys are the same, but the values are different. The hashes produced for each tokenizer would be different.

But if I map back to words, as suggested here, they will have the same hash and therefore recognize as needing the same preprocessing? .

The answer is no, they will not be the same. If you mean hashes, then you can not map back. Hashes are -in theory (at least until someone discovers a proof)- one-way. If you mean decoding, then the answer is still no, for the same reason.

Just because the keys are the same does not mean the values are the same and vice versa. Any change in the key and/or value will yield a different hash. Even if a single bit is modified, it will yield a unique hash.

If a hash is generated for A and another hash is generated for B, then they should be the same for their respective tokenizers assuming the input remains the same.

If you're concerned about "exploding" vocabularies, which I don't think is really applicable to BPE because BPE compresses the mapped data, then I would set a constant to limit the input and output which is how it's currently implemented. The naive way would be to just read the entire vocab at once which is more expensive, especially as the vocabulary grows. Languages are finite, so there is an approximate upper limit, even as languages evolve over time.

It's possible I'm not understanding something. It happens. Feel free to correct my understanding if you feel I've misunderstood your stance.

JoanFM · 2024-05-06T07:33:20Z

Hey @teleprint-me .

I think we are derailing from the point of the PR.

My understanding is that the pre-tokenizer deals with the logic before the actual tokenization takes place and assigns IDs.

Therefore I think it would be best that the same pre-tokenization steps leads to having the same pre-tokenization-key in the most seamless fashion (I know I can assign the same key even if they have different chkhsh, but I think that it would be best if this is automatically detected and one just detects who is using the same pretokenization.

I believe this is not happening with the current implementation because in the current implementation the full tokenization process is being used, and I was just presenting a dummy example of a potential tokenizer logic that with the same pre-tokenization would lead to different chkhsh.

I can give concrete cases of BPE models with different vocabs (in this case these are models for languages other than EN)

I hope this clarifies.

…-pre-tokenizer-hashing

teleprint-me · 2024-05-07T01:31:14Z

@JoanFM

It's just a conversion script. The tokenization already took place when the tokenizer was created.

The encodings are used to create the hashes to identify the models vocabulary based on the encodings IDs. That's really it.

If you'd like more information for this rationale, then you can reference PR #5981 which has follow up PRs going into more detail: #6252 #6920

Here is a condensed and detailed outline of what's happening in the update script.

21:00:28 | /mnt/valerie/forked/ggerganov/llama.cpp
(.venv) git:(master | θ) λ bpython
bpython version 0.24 on top of Python 3.12.3 /mnt/valerie/forked/ggerganov/llama.cpp/.venv/bin/python
>>> import transformers
>>> transformers.__version__
'4.40.1'
>>> from transformers import AutoTokenizer
>>> from pathlib import Path
>>> model_path = Path("/mnt/valerie/models/stabilityai/stablelm-2-1_6b-chat")
>>> model_path.exists()
True
>>> tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
>>> type(tokenizer)
<class 'transformers.models.gpt2.tokenization_gpt2_fast.GPT2TokenizerFast'>
>>> tokenizer.is_fast
True
>>> # The pre-tokenizer comes AFTER the normalization stage
>>> # There's no need to normalize as a result.
>>> # Source: https://huggingface.co/learn/nlp-course/chapter6/4
>>> tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, world!")
[('Hello', (0, 5)), (',', (5, 6)), ('Ġworld', (6, 12)), ('!', (12, 13))]
>>> # We do not need the merges. We need the encodings.
>>> tokenizer.encode("Hello, world!")
[9906, 11, 1917, 0]
>>> # The encodings are then hashed to identify the model
>>> from hashlib import sha256
>>> sha256(str(tokenizer.encode("Hello, world!")).encode()).hexdigest()
'dd4920b7ab0c639e3f71374738daf5646eb098fd397733fff8b939af0c3c3688'
>>>

You're gambling by attempting to normalize the already normalized merges (it was already trained, merged, etc).

>>> tokenizer.backend_tokenizer.normalizer.normalize_str("Hello, world!")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    tokenizer.backend_tokenizer.normalizer.normalize_str("Hello, world!")
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'normalize_str'
>>> type(tokenizer.backend_tokenizer.normalizer)
<class 'NoneType'>
>>>

This should never been done under any circumstance. You're relying on the underlying API here. This is all abstracted away and handled by private methods that are already baked into the class.

I understand that you want to include Jina and I support your effort in that regard, but not like this. The conversion scripts should be model agnostic. Model specific details should be handled accordingly within their respective APIs. Perhaps looking into the embeddings implementation or some other form of support for cosine similarity would be a good idea? I don't know. @slaren is the expert in the backend and would be a better guide in that regard than I.

@slaren Perhaps renaming the metadata property to encoding could guard against this type of confusion?

@JoanFM It should be noted that the get_vocab_base_pre method is auto-generated in the convert-hf-to-gguf-update.py script. So this change will effectively be overwritten at some point in the future regardless. I hope this clarifies any confusion and highlights my concern.

You're the experts here. I'm the novice. I understand that we're just people at the end of day and I will always resign to your expertise simply because you have more experience than I. I just think this is a matter of misinterpretation and I can understand why.

JoanFM · 2024-05-07T06:25:25Z

Hey @teleprint-me,

I agree that the way I am implementing here is far frlm being perfect because it relies on a convoluted API that is not there to be publicly available. This is exactly why I am keeping this PR as a draft so that we can keep the discussion flowing.

My idea is to start thinking if what is being done right now is the best way to identify pre tokenizer configurations (IMHO it is not because of what I have been exposing), another discussion is if the problems this system has may be a better alternative than fixing with the available solutions.

ggerganov · 2024-05-07T07:22:04Z

@JoanFM

Therefore I think it would be best that the same pre-tokenization steps leads to having the same pre-tokenization-key in the most seamless fashion (I know I can assign the same key even if they have different chkhsh, but I think that it would be best if this is automatically detected and one just detects who is using the same pretokenization.

My idea is to start thinking if what is being done right now is the best way to identify pre tokenizer configurations (IMHO it is not because of what I have been exposing), another discussion is if the problems this system has may be a better alternative than fixing with the available solutions.

Yes, this is all correct - the current approach is not optimal in the sense that the hash is a function not only of the pre-tokenization, but also the actual vocabulary. So two different models that have the same pre-tokenization but different vocabs, would lead to different hashes.

In such cases, the plan for now is to assign the same pre-tokenizer names/keys (e.g. "llama-spm", "llama-bpe"), or add new names and handle them in the same way in the tokenize() function (as you noted earlier). We should look for ways to improve / simplify this - I'm not too deep with the tokenizer intricacies in general, learning most of this stuff as we go. The current implementation seemed like the simplest thing we can do to at least have some partial BPE support available

JoanFM · 2024-05-07T07:26:11Z

@JoanFM

Therefore I think it would be best that the same pre-tokenization steps leads to having the same pre-tokenization-key in the most seamless fashion (I know I can assign the same key even if they have different chkhsh, but I think that it would be best if this is automatically detected and one just detects who is using the same pretokenization.

My idea is to start thinking if what is being done right now is the best way to identify pre tokenizer configurations (IMHO it is not because of what I have been exposing), another discussion is if the problems this system has may be a better alternative than fixing with the available solutions.

Yes, this is all correct - the current approach is not optimal in the sense that the hash is a function not only of the pre-tokenization, but also the actual vocabulary. So two different models that have the same pre-tokenization but different vocabs, would lead to different hashes.

In such cases, the plan for now is to assign the same pre-tokenizer names/keys (e.g. "llama-spm", "llama-bpe"), or add new names and handle them in the same way in the tokenize() function (as you noted earlier). We should look for ways to improve / simplify this - I'm not too deep with the tokenizer intricacies in general, learning most of this stuff as we go. The current implementation seemed like the simplest thing we can do to at least have some partial BPE support available

Hello @ggerganov ,

Thanks for the explanation. At least you clarify that I was not missing something. I will continue with your recomendation and close this PR for the time being.

I am also not an expert as learning as I try to implement/fix things.

Thank you very much

feat: change convert hf to gguf

a1aa65e

use conver ids to tokens

0672cd8

fix: only do pre tokenization and normalization

0f94ff7

fix: protect against slow tokenizer

c7a6c32

JoanFM changed the title ~~Decide base tokenizer based on decoded tokenized entry and not on tokens encoded~~ Decide pre tokenizer based on preprocessing of entry and not on tokens encoded May 2, 2024

JoanFM mentioned this pull request May 3, 2024

Tokenizers questions and ... proposals? #6980

Closed

fix: use is_fast assertion

ef5dd36

Merge branch 'master' of https://github.com/JoanFM/llama.cpp into fix…

3a4219a

…-pre-tokenizer-hashing

JoanFM closed this May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decide pre tokenizer based on preprocessing of entry and not on tokens encoded #7039

Decide pre tokenizer based on preprocessing of entry and not on tokens encoded #7039

JoanFM commented May 2, 2024 •

edited

Loading

slaren commented May 2, 2024

JoanFM commented May 2, 2024

slaren commented May 2, 2024

JoanFM commented May 2, 2024

JoanFM commented May 2, 2024

teleprint-me commented May 2, 2024 •

edited

Loading

JoanFM commented May 2, 2024 •

edited

Loading

JoanFM commented May 2, 2024

slaren commented May 2, 2024

JoanFM commented May 2, 2024

teleprint-me commented May 4, 2024 •

edited

Loading

JoanFM commented May 5, 2024

teleprint-me commented May 6, 2024

JoanFM commented May 6, 2024

teleprint-me commented May 7, 2024 •

edited

Loading

JoanFM commented May 7, 2024

ggerganov commented May 7, 2024

JoanFM commented May 7, 2024

Decide pre tokenizer based on preprocessing of entry and not on tokens encoded #7039

Decide pre tokenizer based on preprocessing of entry and not on tokens encoded #7039

Conversation

JoanFM commented May 2, 2024 • edited Loading

slaren commented May 2, 2024

JoanFM commented May 2, 2024

slaren commented May 2, 2024

JoanFM commented May 2, 2024

JoanFM commented May 2, 2024

teleprint-me commented May 2, 2024 • edited Loading

JoanFM commented May 2, 2024 • edited Loading

JoanFM commented May 2, 2024

slaren commented May 2, 2024

JoanFM commented May 2, 2024

teleprint-me commented May 4, 2024 • edited Loading

JoanFM commented May 5, 2024

teleprint-me commented May 6, 2024

JoanFM commented May 6, 2024

teleprint-me commented May 7, 2024 • edited Loading

JoanFM commented May 7, 2024

ggerganov commented May 7, 2024

JoanFM commented May 7, 2024

JoanFM commented May 2, 2024 •

edited

Loading

teleprint-me commented May 2, 2024 •

edited

Loading

JoanFM commented May 2, 2024 •

edited

Loading

teleprint-me commented May 4, 2024 •

edited

Loading

teleprint-me commented May 7, 2024 •

edited

Loading