-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Zero-shot image classification pipeline] Remove tokenizer_kwargs #33174
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for opening this PR.
We can't nor shouldn't remove tokenizer_kwargs
from the pipeline. It is both a breaking change in terms of the pipeline's functionality and will break for all existing Blip2 checkpoints saved with a bert tokenizer
tokenizer_kwargs (`dict`, *optional*): | ||
Additional dictionary of keyword arguments passed along to the tokenizer. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be removed. tokenizer_kwargs
is a fairly standard input to _sanitize_parameters
as a way to control tokenizer behaviour. It would also be breaking for anyone using this in their pipelines.
We might be able to remove in the tests but it should stay here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm could you clarify? tokenizer_kwargs
was added in #29261 which is not yet in a stable release. Hence removing this argument wouldn't break anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, in this case we can remove!
@@ -155,6 +155,8 @@ def convert_blip2_checkpoint( | |||
else: | |||
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl") | |||
|
|||
tokenizer.model_input_names = ["input_ids", "attention_mask"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty hacky. It should at the very least only be done in the BertTokenizer branch to make explicit why this is needed (including an explanatory comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see what's hacky about this, model_input_names
is a genuine attribute of PretrainedTokenizer
that has been there since BERT/GPT-2 came out (it's not used a lot). As can be seen here, a tokenizer will return token_type_ids
if it's present in the model_input_names
.
A less "hacky" way would be to pass model_input_names
directly in the from_pretrained
method, which the Transformers library also supports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm trying to say is that one can avoid lines like these by appropriately setting the model_input_names
attribute of the tokenizer, which is something we can still do for the Blip2ImageTextRetrieval
models on the hub as they are just added and not yet in a stable release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I see. Although both modifying model_input_names
and return_token_ids
are somewhat hacky: they modify a class attribute which might be depended upon for other behaviours / assumed to have certain properties. Passing in model_input_names
to the init is better, as this is more likely to correctly propagate any changes to any other dependant attributes / variables.
The Blip2 checkpoints compatible with this pipeline are the following:
The pipeline was updated in a just-merged PR (which also pushed those 2 checkpoints). My point would be to just set the |
OK, as |
It seems the |
@NielsRogge In this case, we should just keep the What should be done is document that we need to pass in |
I think |
What does this PR do?
This PR is a follow-up of #29261, namely the
tokenizer_kwargs
argument is unnecessary, one can just update themodel_input_names
attribute of the tokenizer.