Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zero-shot image classification pipeline] Remove tokenizer_kwargs #33174

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Aug 28, 2024

What does this PR do?

This PR is a follow-up of #29261, namely the tokenizer_kwargs argument is unnecessary, one can just update the model_input_names attribute of the tokenizer.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@NielsRogge NielsRogge requested a review from amyeroberts August 28, 2024 19:39
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR.

We can't nor shouldn't remove tokenizer_kwargs from the pipeline. It is both a breaking change in terms of the pipeline's functionality and will break for all existing Blip2 checkpoints saved with a bert tokenizer

Comment on lines -100 to -102
tokenizer_kwargs (`dict`, *optional*):
Additional dictionary of keyword arguments passed along to the tokenizer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be removed. tokenizer_kwargs is a fairly standard input to _sanitize_parameters as a way to control tokenizer behaviour. It would also be breaking for anyone using this in their pipelines.

We might be able to remove in the tests but it should stay here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm could you clarify? tokenizer_kwargs was added in #29261 which is not yet in a stable release. Hence removing this argument wouldn't break anything

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, in this case we can remove!

@@ -155,6 +155,8 @@ def convert_blip2_checkpoint(
else:
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")

tokenizer.model_input_names = ["input_ids", "attention_mask"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty hacky. It should at the very least only be done in the BertTokenizer branch to make explicit why this is needed (including an explanatory comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what's hacky about this, model_input_names is a genuine attribute of PretrainedTokenizer that has been there since BERT/GPT-2 came out (it's not used a lot). As can be seen here, a tokenizer will return token_type_ids if it's present in the model_input_names.

A less "hacky" way would be to pass model_input_names directly in the from_pretrained method, which the Transformers library also supports.

Copy link
Contributor Author

@NielsRogge NielsRogge Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm trying to say is that one can avoid lines like these by appropriately setting the model_input_names attribute of the tokenizer, which is something we can still do for the Blip2ImageTextRetrieval models on the hub as they are just added and not yet in a stable release.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, I see. Although both modifying model_input_names and return_token_ids are somewhat hacky: they modify a class attribute which might be depended upon for other behaviours / assumed to have certain properties. Passing in model_input_names to the init is better, as this is more likely to correctly propagate any changes to any other dependant attributes / variables.

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Aug 29, 2024

It is both a breaking change in terms of the pipeline's functionality and will break for all existing Blip2 checkpoints saved with a bert tokenizer

The Blip2 checkpoints compatible with this pipeline are the following:

The pipeline was updated in a just-merged PR (which also pushed those 2 checkpoints). My point would be to just set the model_input_names of the tokenizer of those 2 repos appropriately which would ensure the tokenizer_kwargs isn't needed - I also don't see it present in any other pipeline

@amyeroberts
Copy link
Collaborator

OK, as tokenizer_kwargs were just added for a model on main then we can remove. Could you open PRs on the BLIP2 checkpoints to update to make sure they are compatible?

@NielsRogge
Copy link
Contributor Author

It seems the model_input_names attribute doesn't get serialized in tokenizer_config.json (as it's a class attribute), cc @ArthurZucker. So I'm not sure it's possible to remove this.

@amyeroberts
Copy link
Collaborator

@NielsRogge In this case, we should just keep the tokenizer_kwargs in the pipeline, as they're a fairly standard input anyways. The issue regarding having to pass {return_token_ids: "False"} for BLIP will hopefully be something we can resolve once we can load processors in pipelines - #32514, as tokenizer.return_token_ids will be automatically set.

What should be done is document that we need to pass in {return_token_ids: "False"} to use these blip2 checkpoints in the pipeline

@ArthurZucker
Copy link
Collaborator

I think model_input_names should be serialized! (some model like mamba use GPTNeoXTokenizerFast, but don't use token_type_ids for example)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants