Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GLIP #12

Open
twangnh opened this issue Feb 8, 2024 · 3 comments
Open

GLIP #12

twangnh opened this issue Feb 8, 2024 · 3 comments

Comments

@twangnh
Copy link

twangnh commented Feb 8, 2024

Thanks for sharing the wonderful work, the paper differentiate GLIP with GroundingDINO, FIBER, the former is classified into open vocabulary object detection, while the latter is named bi-functional model(detect and reference comprehention), since GLIP can also be used for DOD (e.g., in omnilabel paper), could you please give more dissucssion on this ?

@Charles-Xie
Copy link
Member

Hi,

Thanks for your interest in our work.
I think the problem you raised is insightful and very worthy of discussion.
In our paper, we mainly discuss detection, REC (which is a representative of grounding) and their conjunction - DOD. We group existing methods by the tasks they are evaluated in their papers. As GLIP is tested for detection and Phrase Grounding (also a grounding task), we do not include it as a bi-functional methods for the sake of academic rigor.
However, from the broader view of the conjunction between detection and grounding, GLIP is surely one of the representative and pioneering works that pave the way for the DOD task. If we look beyond strict task forms and take DOD as the union of general detection and grounding, I think methods like GLIP, MDETR, FIBER, G-DINO are aimed at the same goal and all have the potential for DOD.
Currently, GLIP is also evaluated on D3 and its performance (19+ intra-full-mAP) is very close to more recent works like G-DINO even with the limited model size and data resource.
We would be very happy to discuss this further in the new version of the paper.

Best,
Chi

@twangnh
Copy link
Author

twangnh commented Feb 19, 2024

@Charles-Xie Hi Chi, thanks for the reply. I'm wondering what do you think about the similarity and differences between D3 and Omnilabel dataset(OmniLabel: A Challenging Benchmark for Language-Based Object Detection)?

@Charles-Xie
Copy link
Member

@twangnh
Thanks for the interesting question.

Omnilabel is a great work and I'm happy to see two works with similar motivations appear in a short time, which may show the direction of these works is promising and possibly acknowledged by some researchers in the community.

I will try to answer this below as a discussion, and the following only standards for my personal opinion.
If I'm understanding omnilabel correctly, I think both datasets can be regarded as a dataset for Described Object Detection (or, language-based object detection) in some sense. They both provide images with positive descriptions (associated with boxes for an image) and negative descriptions (associated with no boxes for an image).

The difference is also significant:
For omnilabel, the annotators design some positive and negative descriptions based on each image. The annotation style is more similar to REC datasets, but with negative descriptions. As a result, the description categories in this dataset are more than in DOD, and more diverse. However, one description may appear or not in another image, so the annotation is not complete on the dataset level, and only complete on the image level. This also results in less negative instances annotated.
For $D^3$, we design the descriptions for the whole dataset first, then the annotators annotate them on all images for positive and negative labels. The annotation style is more similar to detection datasets. The description categories are not as many as in omnilabel, but each category is completely labeled on the whole dataset like a standard detection dataset. This brings more negative instances to distinguish for a model, which can be challenging.
Some other differences may also exist, but I think the above explains the most important ones.

This is only my personal opinion. Thanks for asking. We hope to see more methods and datasets towards this direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants