Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CLIP model config to set some kwargs for components #16609

Merged
merged 3 commits into from
Apr 6, 2022

Conversation

ydshieh
Copy link
Collaborator

@ydshieh ydshieh commented Apr 5, 2022

What does this PR do?

In CLIPModel, set output_attentions and output_hidden_states using CLIPModel.config if these values are specified in the configuration + not specified in the arguments.

(currently, these operations are done in its vision & text components separately, and cause a WIP CLIP PT/TF equivalence test failing - #16557)

Details

Currently, CLIPModel uses its 2 components' (vision_model and text_model) configurations to perform things like

(here self is CLIPVisionTransformer or CLIPTextTransformer)

output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions

If output_attentions/output_hidden_states are not passed to CLIPModel.forward at this line

output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,

but CLIPModel.config has these values set, CLIPModel.config.output_attentions and CLIPModel.config.output_hidden_states won't have any effect. This case happens here

# Output all for aggressive testing
config.output_hidden_states = True
if self.has_attentions:
config.output_attentions = True

Therefore, CLIP PT/TF equivalence test won't returns hidden_states/attentions for the PT model.

In TF,

def input_processing(func, config, input_ids, **kwargs):

will use config to set the kwargs at the CLIPModel level. These kwargs are passed to the 2 components, and CLIP PT/TF equivalence test returns hidden_states/attentions for the TF model.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 5, 2022

The documentation is not available anymore as the PR was closed or merged.

@ydshieh ydshieh marked this pull request as ready for review April 5, 2022 16:56
@ydshieh ydshieh requested review from sgugger and patil-suraj April 5, 2022 17:20
@ydshieh
Copy link
Collaborator Author

ydshieh commented Apr 5, 2022

cc @gante (just for information) since he is recently working on unpack_inputs & input_processing in TF

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks okay to me but will defer to @patil-suraj on this :-)
Thanks for your PR!

Copy link
Contributor

@patil-suraj patil-suraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for fixing this!

@ydshieh ydshieh changed the title Update vision & text components' config from CLIP model Use CLIP model config to set some kwargs for components Apr 6, 2022
@ydshieh ydshieh merged commit ae6a7a7 into huggingface:main Apr 6, 2022
@ydshieh ydshieh deleted the fix_clip_pt_tf_outputs branch April 6, 2022 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants