Added support for building an `AddedVocabulary` based on a pre-existing `AddedVocabulary`. #1444

eaplatanios · 2024-01-23T06:02:17Z

(This PR is stacked on top of #1443 and so currently shows the changes for both PRs together)

The changes introduced in this PR are necessary if one wants to maintain the token IDs of the pre-existing AddedVocabulary and more importantly, they are necessary if one wants to build a fast tokenizer correctly based on a Python tokenizer.

Consider for example the Yi tokenizer. If one tries to load a fast tokenizer for Yi, then you will end up with token IDs 64000 and 64001 for the BOS and EOS tokens, respectively. That's because that tokenizer uses custom BOS and EOS tokens but assigns them known/pre-existing IDs via the added_tokens_decoder field. These IDs end up getting ignored when building the fast tokenizer due to the code block starting here. This is also evident from the recommendation of the Yi model authors to set use_fast=False as shown here.

That's at least partially because this library does not support building tokenizers using a pre-existing AddedVocabulary. This PR adds support for that. I plan to follow up with a PR in the transformers library after this PR is merged, updating that code block.

HuggingFaceDocBuilderDev · 2024-01-23T06:07:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2024-01-23T14:24:58Z

That looks nice I'll have a look! Thanks for opening it!

ArthurZucker

Thanks for the PR!
I just want to make sure this is actually needed:

Consider for example the Yi tokenizer. If one tries to load a fast tokenizer for Yi, then you will end up with token IDs 64000 and 64001 for the BOS and EOS tokens, respectively. That's because that tokenizer uses custom BOS and EOS tokens but assigns them known/pre-existing IDs via the added_tokens_decoder field. These IDs end up getting ignored when building the fast tokenizer due to the code block starting here. This is also evident from the recommendation of the Yi model authors to set use_fast=False as shown here.

If the EOS token and BOS token are part of the vocab, they have and should be associated with this ID. I am not sure I follow what the issue is here

eaplatanios · 2024-02-15T01:36:58Z

@ArthurZucker what happens with the Yi tokenizer is the following. The config of the tokenizer includes this field:

"added_tokens_decoder": {
    ...
    "1": {
        "content": "<|startoftext|>",
        "special": true
    },
    "2": {
        "content": "<|endoftext|>",
        "special": true
    },
    ...
}

So, when this tokenizer is loaded, it hits this code block. If you read that code, you'll notice that it takes the information included in the added_tokens_decoder field, it drops the token IDs entirely and then just looks up the special tokens in the underlying/base tokenizer's vocabulary. If they don't exist, it assigns new IDs to them which in this case are IDs 64000 and 64001. This causes code using this tokenizer to break because these IDs are invalid.

Instead, the recommendation when using the Yi model is to set use_fast=False as shown here to avoid this issue.

This PR adds functionality that will enable me to fix the underlying behavior in the linked code block from the transformers library so that we can use fast tokenizers with a wider set of models.

ArthurZucker · 2024-02-19T02:52:28Z

I understand your requirements but that is not the wait to go with tokenizers ! Tokenizers is aimed at being very restrained in the way you should interact with the object so I am not in favor of adding this. However, we can fix this in transformers. The idea is that if you are converting from a slow tokenizer, you should be properly adding the tokens in the convert_slow. The way we convert the tokenizer does not take into account the actual vocab.

    def vocab(self, proto):
        vocab = [
            ("<unk>", 0.0),
            ("<s>", 0.0),
            ("</s>", 0.0),
        ]
        vocab += [(piece.piece, piece.score) for piece in proto.pieces[3:]]
        return vocab

that is hard coded in the LlamaConverter.
Yi chat probably manually changed the added_tokens_decoder but that can lead to a lot of issues for the tokenizers (as the backend is pretty much stateless, you would need to re-compile the normalized regex etc etc).

We should modify the def vocab(self, proto) to take into account the added_tokens_decoder at least for the begin token.

WDYT?

eaplatanios · 2024-02-20T15:09:07Z

Hmm so let me take a step back and summarize the changes here and also try to understand your suggested fix. First of all, I assume the changes introduced in #1443 are uncontroversial, right? Because that's just adding some functions that appear to be missing in the first place. If so, could you please approve that one so we can decouple the changes?

Regarding the main change introduced in this PR (i.e., set_added_tokens_decoder), I understand your point and I agree that this change is non-ideal. However, I do not understand your proposed fix as described here:

We should modify the def vocab(self, proto) to take into account the added_tokens_decoder at least for the begin token.

Could you please elaborate? Also, how does it relate to this code block that I linked earlier which seems broken for Yi (due to the reuse of pre-existing token IDs as I described earlier).

ArthurZucker · 2024-02-28T00:14:02Z

Of course!
The idea is that when you convert a slow to a fast tokenizer in transformers, using from_slow will always re-build the vocab from scratch. But the current conversion always sets the first token and second tokens to <s> and </s>. While if you manually overwrite the tokenizer.json you can just replace these with whichever string you want.
If we update the LlamaConverter to make sure it properly builds the vocab based on the original_tokenizer.added_tokens_decoder then you will not have a problem anymore. On of the main goals of the tokenizers is to be stateless, so the changes in this PR are against this.

eaplatanios · 2024-03-21T18:12:35Z

@ArthurZucker thanks! In that case, I can look into that fix for transformers. However, I'd still need #1443 to get merged for supporting our use case, separate from what's discussed in this PR. Could you please provide a review on that one? In the meantime, I'll go ahead and close this PR.

ArthurZucker · 2024-03-22T01:30:55Z

huggingface/transformers#29797 is probably what you are looking for!

eaplatanios · 2024-03-25T21:01:10Z

@ArthurZucker that's great, thanks!

eaplatanios added 3 commits January 22, 2024 13:24

Fixes.

41b3f5a

Fixes.

25a7f91

Fixes.

625b080

eaplatanios added 3 commits January 23, 2024 04:01

Format fix.

97f845b

.

164f8fe

.

082f800

eaplatanios mentioned this pull request Jan 31, 2024

Added ability to inspect a 'Sequence' decoder and the AddedVocabulary. #1443

Merged

ArthurZucker reviewed Feb 13, 2024

View reviewed changes

eaplatanios closed this Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for building an `AddedVocabulary` based on a pre-existing `AddedVocabulary`. #1444

Added support for building an `AddedVocabulary` based on a pre-existing `AddedVocabulary`. #1444

eaplatanios commented Jan 23, 2024

HuggingFaceDocBuilderDev commented Jan 23, 2024

ArthurZucker commented Jan 23, 2024 •

edited

Loading

ArthurZucker left a comment

eaplatanios commented Feb 15, 2024

ArthurZucker commented Feb 19, 2024

eaplatanios commented Feb 20, 2024

ArthurZucker commented Feb 28, 2024

eaplatanios commented Mar 21, 2024

ArthurZucker commented Mar 22, 2024

eaplatanios commented Mar 25, 2024

Added support for building an AddedVocabulary based on a pre-existing AddedVocabulary. #1444

Added support for building an AddedVocabulary based on a pre-existing AddedVocabulary. #1444

Conversation

eaplatanios commented Jan 23, 2024

HuggingFaceDocBuilderDev commented Jan 23, 2024

ArthurZucker commented Jan 23, 2024 • edited Loading

ArthurZucker left a comment

Choose a reason for hiding this comment

eaplatanios commented Feb 15, 2024

ArthurZucker commented Feb 19, 2024

eaplatanios commented Feb 20, 2024

ArthurZucker commented Feb 28, 2024

eaplatanios commented Mar 21, 2024

ArthurZucker commented Mar 22, 2024

eaplatanios commented Mar 25, 2024

Added support for building an `AddedVocabulary` based on a pre-existing `AddedVocabulary`. #1444

Added support for building an `AddedVocabulary` based on a pre-existing `AddedVocabulary`. #1444

ArthurZucker commented Jan 23, 2024 •

edited

Loading