-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds image-guided object detection support to OWL-ViT #18891
Conversation
Hi @alaradirik I added an initial version for the image-guided obj detection. I still have to add tests and some other cleanup, however, I've some doubts right now
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Dhruv,
Thank you for your contribution! I realized that I accidentally mislead you when discussing the one-shot detection implementation. I thought the query image would be embedded by the unmodified base model (OwlViTModel
) but here is the correct method as described in the Appendix A1.7 of the paper:
- Query image is embedded the same way as the target image is embedded (which you already implemented)
- query_image_embeds is forward propagated through class and box prediction heads to retrieve the class embeddings and box preds. The goal of retrieving box predictions is to choose one class embedding / prediction such that it is (1) dissimilar from rest of the predicted class embeddings and (2) has a corresponding box prediction that yields a high intersection of union with the query image.
This part corresponds to def embed_image_query()
function at line 110 in the original repo over here. You can disregard the query_box_yxyx
argument.
- The selected
class_embedding
/query_embedding
is used to query the target image in the same way textquery_embeds
are used withinOwlViTForObjectDetection
. You can see an example in the original repo over here (
_image_conditioning_py_callback()
function)
Hope this helps and sorry for the delay.
Hi @alaradirik, I made the changes as per the review comments, could you check if they're fine? I'm working on test cases currently. In the file here, is it okay if I reuse So the above line would return
and be re-used as
And apart from the test cases, are there any other changes that I need to make? |
Hi @unography, thank you for the contribution once again! As for your question regarding the tests, yes, it'd make sense to return We can add a line to this function to create |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good to me!
There are a few issues that needs to be addressed regarding a missing one-shot detection algorithm detail and state arguments left over from previous commits (from other PRs). Let me know if you have any questions.
@@ -1167,6 +1174,7 @@ def forward( | |||
|
|||
class OwlViTForObjectDetection(OwlViTPreTrainedModel): | |||
config_class = OwlViTConfig | |||
main_input_name = "pixel_values" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this line caused test errors previously, will need to double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by default I think the main_input_name
is input_ids
. Before when it was just text guided detection, the first param was input_ids
, and then pixel_values
. Now, since input_ids
can be None
, I made pixel_values
as the first param, which is why the test case was failing. Hence this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense! I'd comment out the slow flags (tests with @slow
decoraters are skipped) within tests/test_modeling_common.py
and run all tests to double check this.
You can run the tests from the root of transformers repo as follows:
# All tests
pytest tests/models/owlvit/test_modeling_owlvit.py
# Run only integration tests
pytest tests/models/owlvit/test_modeling_owlvit.py::OwlViTModelIntegrationTest
# Run a single test
pytest tests/models/owlvit/test_modeling_owlvit.py::OwlViTModelIntegrationTest::test_inference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too much of a breaking change, even for a newly released model.
Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the contribution once again! The code seems to be in great shape and I just left a couple of comments regarding minor style corrections and docstrings.
The only issue is the following tests fail:
- OwlViTVisionModelTest.test_model
- OwlViTVisionModelTest.test_model_outputs_equivalence
- OwlViTModelTest.test_model_outputs_equivalence
I believe this is due to making pixel_values
the main argument in OwlViTForObjectDetection.forward()
but I couldn't pinpoint the exact issue. @ydshieh could you take a look at the test scripts when you have time?
Hi, I couldn't see any test being run by CI. Could you share the error messges? @unography Could you follow the instruction below to refresh your CircleCI permission |
Sure @alaradirik , I'll go through the review comments and make the changes. And actually, on my local, I'm able to get the test cases passed, on running
I'll check once more Hi @ydshieh , I'm not able to refresh the permission for some reason, I get an error |
@ydshieh, of course, here is the full error log.
|
From the file, it looks like the latest version in this PR is different from the version that produced the error you provided above. See transformers/src/transformers/models/owlvit/modeling_owlvit.py Lines 893 to 902 in bb61e30
where there is
but not in your error message. |
I triggered it :) |
@ydshieh great, thank you! I hadn't pulled the latest changes on this branch. @unography we can merge this PR once the remaining minor issues are addressed, thank you again for the clean implementation :) |
Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
Co-authored-by: Alara Dirik <8944735+alaradirik@users.noreply.github.com>
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
9ebd950
to
bb61e30
Compare
Hi @alaradirik, sorry my notifications got messed up, I was able to go through the comments only now. Do I need to change anything for merging? Upstream url or anything else? |
Hey @unography no problem at all! I'm about to merge a clean PR with the correct upstream. Could you give me your email address so that I can add you as the co-author to my commits? |
@alaradirik sure, this is my email - k4r4n.dhruv@gmail.com |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
This adds support for doing object detection with OWL-ViT using query image(s)
For #18748
cc: @alaradirik