Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for building an AddedVocabulary based on a pre-existing AddedVocabulary. #1444

Closed

Conversation

eaplatanios
Copy link
Contributor

(This PR is stacked on top of #1443 and so currently shows the changes for both PRs together)

The changes introduced in this PR are necessary if one wants to maintain the token IDs of the pre-existing AddedVocabulary and more importantly, they are necessary if one wants to build a fast tokenizer correctly based on a Python tokenizer.

Consider for example the Yi tokenizer. If one tries to load a fast tokenizer for Yi, then you will end up with token IDs 64000 and 64001 for the BOS and EOS tokens, respectively. That's because that tokenizer uses custom BOS and EOS tokens but assigns them known/pre-existing IDs via the added_tokens_decoder field. These IDs end up getting ignored when building the fast tokenizer due to the code block starting here. This is also evident from the recommendation of the Yi model authors to set use_fast=False as shown here.

That's at least partially because this library does not support building tokenizers using a pre-existing AddedVocabulary. This PR adds support for that. I plan to follow up with a PR in the transformers library after this PR is merged, updating that code block.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jan 23, 2024

That looks nice I'll have a look! Thanks for opening it!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
I just want to make sure this is actually needed:

Consider for example the Yi tokenizer. If one tries to load a fast tokenizer for Yi, then you will end up with token IDs 64000 and 64001 for the BOS and EOS tokens, respectively. That's because that tokenizer uses custom BOS and EOS tokens but assigns them known/pre-existing IDs via the added_tokens_decoder field. These IDs end up getting ignored when building the fast tokenizer due to the code block starting here. This is also evident from the recommendation of the Yi model authors to set use_fast=False as shown here.

If the EOS token and BOS token are part of the vocab, they have and should be associated with this ID. I am not sure I follow what the issue is here

@eaplatanios
Copy link
Contributor Author

@ArthurZucker what happens with the Yi tokenizer is the following. The config of the tokenizer includes this field:

"added_tokens_decoder": {
    ...
    "1": {
        "content": "<|startoftext|>",
        "special": true
    },
    "2": {
        "content": "<|endoftext|>",
        "special": true
    },
    ...
}

This means that the tokenizer uses ID 1 for the special <|startoftext|> token and 2 for the special <|endoftext|> token. Now note that the Llama tokenizer upon which it's built always has ID 1 used for token <s> and 2 used for token </s>. That is, <|startoftext|> and <|endoftext|> are not in the original/base tokenizer.

So, when this tokenizer is loaded, it hits this code block. If you read that code, you'll notice that it takes the information included in the added_tokens_decoder field, it drops the token IDs entirely and then just looks up the special tokens in the underlying/base tokenizer's vocabulary. If they don't exist, it assigns new IDs to them which in this case are IDs 64000 and 64001. This causes code using this tokenizer to break because these IDs are invalid.

Instead, the recommendation when using the Yi model is to set use_fast=False as shown here to avoid this issue.

This PR adds functionality that will enable me to fix the underlying behavior in the linked code block from the transformers library so that we can use fast tokenizers with a wider set of models.

@ArthurZucker
Copy link
Collaborator

I understand your requirements but that is not the wait to go with tokenizers ! Tokenizers is aimed at being very restrained in the way you should interact with the object so I am not in favor of adding this. However, we can fix this in transformers. The idea is that if you are converting from a slow tokenizer, you should be properly adding the tokens in the convert_slow. The way we convert the tokenizer does not take into account the actual vocab.

    def vocab(self, proto):
        vocab = [
            ("<unk>", 0.0),
            ("<s>", 0.0),
            ("</s>", 0.0),
        ]
        vocab += [(piece.piece, piece.score) for piece in proto.pieces[3:]]
        return vocab

that is hard coded in the LlamaConverter.
Yi chat probably manually changed the added_tokens_decoder but that can lead to a lot of issues for the tokenizers (as the backend is pretty much stateless, you would need to re-compile the normalized regex etc etc).

We should modify the def vocab(self, proto) to take into account the added_tokens_decoder at least for the begin token.

WDYT?

@eaplatanios
Copy link
Contributor Author

Hmm so let me take a step back and summarize the changes here and also try to understand your suggested fix. First of all, I assume the changes introduced in #1443 are uncontroversial, right? Because that's just adding some functions that appear to be missing in the first place. If so, could you please approve that one so we can decouple the changes?

Regarding the main change introduced in this PR (i.e., set_added_tokens_decoder), I understand your point and I agree that this change is non-ideal. However, I do not understand your proposed fix as described here:

We should modify the def vocab(self, proto) to take into account the added_tokens_decoder at least for the begin token.

Could you please elaborate? Also, how does it relate to this code block that I linked earlier which seems broken for Yi (due to the reuse of pre-existing token IDs as I described earlier).

@ArthurZucker
Copy link
Collaborator

Of course!
The idea is that when you convert a slow to a fast tokenizer in transformers, using from_slow will always re-build the vocab from scratch. But the current conversion always sets the first token and second tokens to <s> and </s>. While if you manually overwrite the tokenizer.json you can just replace these with whichever string you want.
If we update the LlamaConverter to make sure it properly builds the vocab based on the original_tokenizer.added_tokens_decoder then you will not have a problem anymore. On of the main goals of the tokenizers is to be stateless, so the changes in this PR are against this.

@eaplatanios
Copy link
Contributor Author

@ArthurZucker thanks! In that case, I can look into that fix for transformers. However, I'd still need #1443 to get merged for supporting our use case, separate from what's discussed in this PR. Could you please provide a review on that one? In the meantime, I'll go ahead and close this PR.

@ArthurZucker
Copy link
Collaborator

huggingface/transformers#29797 is probably what you are looking for!

@eaplatanios
Copy link
Contributor Author

@ArthurZucker that's great, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants