Question about the Dataset construction #34

Parul-Gupta · 2024-08-28T12:52:21Z

Hi FastComposer team,
Kudos on this insightful and amazing work and thanks for sharing the code with the community!

In the paper, in the Dataset Construction part (Section 5.1), it is mentioned that

Finally, we use a greedy matching algorithm to match noun
phrases with image segments. We do this by considering the product of the image-text similarity
score by the OpenCLIP model (CLIP-ViT-H-14-laion2B-s32B-b79K) and the label-text
similarity score by the Sentence-Transformer model (stsb-mpnet-base-v2).

Could you please clarify this further? If I understand correctly, the OpenCLIP features of image segments are matched with the sentence transformer features of the noun phrases. Is that correct?
If so, how is the Image segment given as input to the OpenCLIP model - Is the part of the image outside the segment masked (with 0 pixels or black colour)?
It would be great if you could share the code for this process too.

Thanks a lot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the Dataset construction #34

Question about the Dataset construction #34

Parul-Gupta commented Aug 28, 2024

Question about the Dataset construction #34

Question about the Dataset construction #34

Comments

Parul-Gupta commented Aug 28, 2024