Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the Dataset construction #34

Open
Parul-Gupta opened this issue Aug 28, 2024 · 0 comments
Open

Question about the Dataset construction #34

Parul-Gupta opened this issue Aug 28, 2024 · 0 comments

Comments

@Parul-Gupta
Copy link

Hi FastComposer team,
Kudos on this insightful and amazing work and thanks for sharing the code with the community!

In the paper, in the Dataset Construction part (Section 5.1), it is mentioned that

Finally, we use a greedy matching algorithm to match noun
phrases with image segments. We do this by considering the product of the image-text similarity
score by the OpenCLIP model (CLIP-ViT-H-14-laion2B-s32B-b79K) and the label-text
similarity score by the Sentence-Transformer model (stsb-mpnet-base-v2).

Could you please clarify this further? If I understand correctly, the OpenCLIP features of image segments are matched with the sentence transformer features of the noun phrases. Is that correct?
If so, how is the Image segment given as input to the OpenCLIP model - Is the part of the image outside the segment masked (with 0 pixels or black colour)?
It would be great if you could share the code for this process too.

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant