-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds image-guided object detection support to OWL-ViT #20136
Adds image-guided object detection support to OWL-ViT #20136
Conversation
@NielsRogge @sgugger sorry for the double PR, the upstream of the branch used in the other PR points to huggingface/transformers:img_guided_obj_det instead of main and I couldn't change the upstream. The reviews in the other PR are addressed but there are two failing tests I couldn't debug:
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Could you make sure to add @unography as co-author? I'd prefer to merge the original PR, but if it's not possible, I want to make sure the authorship is properly attributed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple more comments.
@@ -145,6 +223,10 @@ class OwlViTObjectDetectionOutput(ModelOutput): | |||
vision_model_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_patches + 1, hidden_size)`)): | |||
Last hidden states extracted from the [`OwlViTVisionModel`]. OWL-ViT represents images as a set of image | |||
patches where the total number of patches is (image_size / patch_size)**2. | |||
text_model_output (Tuple[`BaseModelOutputWithPooling`]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a deprecation warning to remove text_model_last_hidden_state
and vision_model_last_hidden_state
in the future?
Also, image_embeds
and class_embeds
should always return the projected last hidden states (similar to CLIP).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm adding a deprecation warning to OwlViTForObjectDetection.forward()
.class_embeds
is specific to OwlViT and I think it makes more sense to return OwlViT image_embeds
instead of the unmodified CLIP embeddings
Hi there! Maybe this is not the place to mention this, but just wanted to mention that the original implementation uses stochastic depth (https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/clip/layers.py#L235). They set it to 0.2 and 0.1 for the vision and text encoders (https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/configs/clip_b16.py#L132). I guess that's not really important if you guys don't plan to implement the training losses for detection, but if you do, maybe it's something to keep in mind :) |
a0a0d05
to
2308f3d
Compare
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
@@ -165,18 +284,15 @@ def __call__( | |||
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch | |||
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W) or (H, W, C), | |||
where C is a number of channels, H and W are image height and width. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason those spaces are removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are removed by make style
but it seemed like there were extra blank lines to begin with.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
@sgugger @NielsRogge could you do a final review when you're available? All tests are passing and I think all issues are addressed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some final comments.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure your dependencies for styling are the ones pinned by Transformers and revert all the changes you made to remove blank lines in examples? It makes them less readable in the documentation, and it is not caused by make fixup
by itself since the CI on main is green.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
It seems that running the example for image-guided od is still buggy:
Upon plotting the boxes, it is very off. This target query pair should work as it works in the scenic repo. Edit: tried both patch-16 and 32 model, same results (bad box predictions on target image) |
What's your Pillow version? We've seen that using Pillow==7.1.2 is essential for getting the expected results (and cc @alaradirik we should make sure the model works on any pillow version) |
@NielsRogge , ran output of models are as follows:
@alaradirik did you manage to run the example and get an appropriate prediction? Edit: You can see that y1 is 0 in this case which is already wrong if you look at the image, image shape is (480,640) so in this case the bbox is just covering the entire image. |
Hey @timothylimyl, thanks for bringing this up. I was able to replicate the issue on my local and confirmed that it's not OpenCV or Pillow related and stems from the post-processing method. I think it's due to changed default behaviour between PyTorch versions, I'll open a fix PR once I confirm this. CC @NielsRogge |
@timothylimyl sorry for the mixup, I thought this was a Pillow versioning issue we previously encountered and didn't realize the query image you are using is different . The post-process method returns coordinates in (x0, y0, x1, y1) format, the correct command to print the boxes is: Note that this still returns a bounding box that covers the entire image. This is because OWL-ViT is a text-conditioned model that uses CLIP as its backbone, the image-guided object detection method repurposes the trained text-conditioned model with the assumption that the query image contains a single object. In this case, you are just getting results for an image that could be described with more general terms ("a photo of of a cat sitting on top of a ...."). Here are the results for a cropped version of the query image you are using: |
hey @alaradirik in the other old PR I've uploaded an image and query (+ results) used in the official one. Maybe it's worth trying them as well since you can (subjectively) evaluate the result bboxes using the original results. I hope it helps :) |
Hi @FrancescoSaverioZuppichini, I'm not sure what you mean by subjectively evaluating the bounding boxes or which PR you are referring to? |
Hi @alaradirik, can you share your code that was used to generate the example? I tried cropping and basically I still just received one big bounding box:
|
also, I was confused by the comment |
@timothylimyl, you are right about the COCO API comment, we will update the docs shortly to reflect the correct returned data format. Here is the code I used and the resulting image but keep in mind that different crops can lead to different results and both text-guided and image-guided object detection requires experimentation. There is no need for the
|
Oh wow, that is very unexpected. Seems like the model is not very well trained/robust. The difference between your crop and mine is visually minimal yet the result differs by so much: versus If you crop slightly further up to [Does not work!!!] Do you reckon there could be something buggy with the code or is the model fundamentally not robust and require pretty exact crops for matching? It does not make much sense to me that crops have to be so exact as the feature embedding matching won't be that poor. |
@alaradirik to the "original" one #18891 |
Any updates? |
Hi @timothylimyl, feel free to open an issue with a reproducable code sample so we can discuss it there |
Hi @NielsRogge @timothylimyl @alaradirik @sgugger I have found the issue that causes image conditioning to be so sensitive. There was a small bug in the query selection, please see my PR: #23157 Best, |
What does this PR do?
Adds image-guided object detection method to
OwlViTForObjectDetection
class. This enables users to use a query image to search for similar objects in the input image.Co-Authored-By: Dhruv Karan k4r4n.dhruv@gmail.com
Fixes #18748
Before submitting
Pull Request section?
to it if that's the case.
Add image-guided object detection support to OWL-ViT #18748
documentation guidelines, and
here are tips on formatting docstrings.