Certain models require token id changes in their configs #4

IamAdiSri · 2022-12-28T05:38:35Z

The regular MBart model (and not MBart-50) for example, has a config property decoder_start_token_id that needs to be updated after the model is trimmed. The model pulls this id from the config during the decoding phase.

This change be made via the following command:

mt.trimmed_model.config.update({
    'decoder_start_token_id': tt.trimmed_tokenizer.convert_tokens_to_ids(
        tt.tokenizer.convert_ids_to_tokens(mt.model.config['decoder_start_token_id'])
    )
})

It is highly likely that other models also have this problem, and to account for this will require some breaking changes. Leaving this up as an issue to fix in release 4.

The text was updated successfully, but these errors were encountered:

avacaondata · 2023-03-01T15:54:33Z

Does this affect mt5 models also? @IamAdiSri

IamAdiSri · 2023-03-01T16:51:52Z

@avacaondata Hi! Looking at the code on the HuggingFace repository this does affect mt5 models. Even if it doesn't you can run the fix above to be safe. It shouldn't cause issues.

I'll add some steps to follow to be able to run the models as intended:

Load the model and trim it.
Update the decoder_start_token_id in the config, as shown above.
Save the model and tokenizer. (optional)
Reload a new instance of the model and tokenizer for use. (optional)

Saving the trimmed model and starting a new instance allows you to discard the full model and free up memory so I generally recommend doing that.

avacaondata · 2023-03-01T18:06:55Z

Okay great I will try that out, thanks! @IamAdiSri

The thing is, shouldn't we respect special tokens (such as decoder start token id) when trimming the tokenizer? I mean, we want to keep only tokens that are present in a certain vocabulary, plus the special tokens which are typically at the beginning of the vocabulary (idx < 150), as these are used in all cases, no matter which data you use for trimming.

IamAdiSri · 2023-03-01T20:07:59Z

@avacaondata that is exactly what the library does. We save all the special tokens, however, after the model is trimmed their index in the embedding matrix may change. So we update the model config to let it know what the new indices for the same special tokens are and then it can reuse them just as before.

The issue is that Huggingface has multiple mechanisms for special tokens. A lot of times the model has a default token id for special tokens or it asks the tokenizer for the id, both of which hftrim already preserves. However, in some cases this tokenid is inferred from the config, which the library does not currently update, and so you have to do it manually. I'll be fixing this in the next release so that this is also taken care of automatically.

Also, like you noted most of the special tokens are at the start but this does not seem to be the case for the decoder token id i think, which is why it shifts indices.

IamAdiSri self-assigned this Dec 28, 2022

IamAdiSri pinned this issue May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Certain models require token id changes in their configs #4

Certain models require token id changes in their configs #4

IamAdiSri commented Dec 28, 2022 •

edited

Loading

avacaondata commented Mar 1, 2023 •

edited

Loading

IamAdiSri commented Mar 1, 2023 •

edited

Loading

avacaondata commented Mar 1, 2023

IamAdiSri commented Mar 1, 2023 •

edited

Loading

Certain models require token id changes in their configs #4

Certain models require token id changes in their configs #4

Comments

IamAdiSri commented Dec 28, 2022 • edited Loading

avacaondata commented Mar 1, 2023 • edited Loading

IamAdiSri commented Mar 1, 2023 • edited Loading

avacaondata commented Mar 1, 2023

IamAdiSri commented Mar 1, 2023 • edited Loading

IamAdiSri commented Dec 28, 2022 •

edited

Loading

avacaondata commented Mar 1, 2023 •

edited

Loading

IamAdiSri commented Mar 1, 2023 •

edited

Loading

IamAdiSri commented Mar 1, 2023 •

edited

Loading