-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Certain models require token id changes in their configs #4
Comments
Does this affect mt5 models also? @IamAdiSri |
@avacaondata Hi! Looking at the code on the HuggingFace repository this does affect mt5 models. Even if it doesn't you can run the fix above to be safe. It shouldn't cause issues. I'll add some steps to follow to be able to run the models as intended:
Saving the trimmed model and starting a new instance allows you to discard the full model and free up memory so I generally recommend doing that. |
Okay great I will try that out, thanks! @IamAdiSri The thing is, shouldn't we respect special tokens (such as decoder start token id) when trimming the tokenizer? I mean, we want to keep only tokens that are present in a certain vocabulary, plus the special tokens which are typically at the beginning of the vocabulary (idx < 150), as these are used in all cases, no matter which data you use for trimming. |
@avacaondata that is exactly what the library does. We save all the special tokens, however, after the model is trimmed their index in the embedding matrix may change. So we update the model config to let it know what the new indices for the same special tokens are and then it can reuse them just as before. The issue is that Huggingface has multiple mechanisms for special tokens. A lot of times the model has a default token id for special tokens or it asks the tokenizer for the id, both of which hftrim already preserves. However, in some cases this tokenid is inferred from the config, which the library does not currently update, and so you have to do it manually. I'll be fixing this in the next release so that this is also taken care of automatically. Also, like you noted most of the special tokens are at the start but this does not seem to be the case for the decoder token id i think, which is why it shifts indices. |
The regular MBart model (and not MBart-50) for example, has a config property
decoder_start_token_id
that needs to be updated after the model is trimmed. The model pulls this id from the config during the decoding phase.This change be made via the following command:
It is highly likely that other models also have this problem, and to account for this will require some breaking changes. Leaving this up as an issue to fix in release 4.
The text was updated successfully, but these errors were encountered: