Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Certain models require token id changes in their configs #4

Open
IamAdiSri opened this issue Dec 28, 2022 · 4 comments
Open

Certain models require token id changes in their configs #4

IamAdiSri opened this issue Dec 28, 2022 · 4 comments
Assignees

Comments

@IamAdiSri
Copy link
Owner

IamAdiSri commented Dec 28, 2022

The regular MBart model (and not MBart-50) for example, has a config property decoder_start_token_id that needs to be updated after the model is trimmed. The model pulls this id from the config during the decoding phase.

This change be made via the following command:

mt.trimmed_model.config.update({
    'decoder_start_token_id': tt.trimmed_tokenizer.convert_tokens_to_ids(
        tt.tokenizer.convert_ids_to_tokens(mt.model.config['decoder_start_token_id'])
    )
})

It is highly likely that other models also have this problem, and to account for this will require some breaking changes. Leaving this up as an issue to fix in release 4.

@IamAdiSri IamAdiSri self-assigned this Dec 28, 2022
@avacaondata
Copy link

avacaondata commented Mar 1, 2023

Does this affect mt5 models also? @IamAdiSri

@IamAdiSri
Copy link
Owner Author

IamAdiSri commented Mar 1, 2023

@avacaondata Hi! Looking at the code on the HuggingFace repository this does affect mt5 models. Even if it doesn't you can run the fix above to be safe. It shouldn't cause issues.

I'll add some steps to follow to be able to run the models as intended:

  1. Load the model and trim it.
  2. Update the decoder_start_token_id in the config, as shown above.
  3. Save the model and tokenizer. (optional)
  4. Reload a new instance of the model and tokenizer for use. (optional)

Saving the trimmed model and starting a new instance allows you to discard the full model and free up memory so I generally recommend doing that.

@avacaondata
Copy link

Okay great I will try that out, thanks! @IamAdiSri

The thing is, shouldn't we respect special tokens (such as decoder start token id) when trimming the tokenizer? I mean, we want to keep only tokens that are present in a certain vocabulary, plus the special tokens which are typically at the beginning of the vocabulary (idx < 150), as these are used in all cases, no matter which data you use for trimming.

@IamAdiSri
Copy link
Owner Author

IamAdiSri commented Mar 1, 2023

@avacaondata that is exactly what the library does. We save all the special tokens, however, after the model is trimmed their index in the embedding matrix may change. So we update the model config to let it know what the new indices for the same special tokens are and then it can reuse them just as before.

The issue is that Huggingface has multiple mechanisms for special tokens. A lot of times the model has a default token id for special tokens or it asks the tokenizer for the id, both of which hftrim already preserves. However, in some cases this tokenid is inferred from the config, which the library does not currently update, and so you have to do it manually. I'll be fixing this in the next release so that this is also taken care of automatically.

Also, like you noted most of the special tokens are at the start but this does not seem to be the case for the decoder token id i think, which is why it shifts indices.

@IamAdiSri IamAdiSri pinned this issue May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants