Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds image-guided object detection support to OWL-ViT #20136

Merged
merged 11 commits into from
Nov 16, 2022

Conversation

alaradirik
Copy link
Contributor

@alaradirik alaradirik commented Nov 9, 2022

What does this PR do?

Adds image-guided object detection method to OwlViTForObjectDetection class. This enables users to use a query image to search for similar objects in the input image.

Co-Authored-By: Dhruv Karan k4r4n.dhruv@gmail.com

Fixes #18748

Before submitting

@alaradirik
Copy link
Contributor Author

alaradirik commented Nov 9, 2022

@NielsRogge @sgugger sorry for the double PR, the upstream of the branch used in the other PR points to huggingface/transformers:img_guided_obj_det instead of main and I couldn't change the upstream.

The reviews in the other PR are addressed but there are two failing tests I couldn't debug:

FAILED tests/pipelines/test_pipelines_zero_shot_object_detection.py::ZeroShotObjectDetectionPipelineTests::test_pt_OwlViTConfig_OwlViTForObjectDetection_CLIPTokenizerFast_OwlViTFeatureExtractor - IndexError: tuple index out of range
FAILED tests/pipelines/test_pipelines_zero_shot_object_detection.py::ZeroShotObjectDetectionPipelineTests::test_pt_OwlViTConfig_OwlViTForObjectDetection_CLIPTokenizer_OwlViTFeatureExtractor - IndexError: tuple index out of range

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@sgugger
Copy link
Collaborator

sgugger commented Nov 9, 2022

Could you make sure to add @unography as co-author? I'd prefer to merge the original PR, but if it's not possible, I want to make sure the authorship is properly attributed.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more comments.

src/transformers/models/owlvit/modeling_owlvit.py Outdated Show resolved Hide resolved
src/transformers/models/owlvit/modeling_owlvit.py Outdated Show resolved Hide resolved
src/transformers/models/owlvit/processing_owlvit.py Outdated Show resolved Hide resolved
src/transformers/models/owlvit/processing_owlvit.py Outdated Show resolved Hide resolved
src/transformers/models/owlvit/processing_owlvit.py Outdated Show resolved Hide resolved
@@ -145,6 +223,10 @@ class OwlViTObjectDetectionOutput(ModelOutput):
vision_model_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_patches + 1, hidden_size)`)):
Last hidden states extracted from the [`OwlViTVisionModel`]. OWL-ViT represents images as a set of image
patches where the total number of patches is (image_size / patch_size)**2.
text_model_output (Tuple[`BaseModelOutputWithPooling`]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a deprecation warning to remove text_model_last_hidden_state and vision_model_last_hidden_state in the future?

Also, image_embeds and class_embeds should always return the projected last hidden states (similar to CLIP).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm adding a deprecation warning to OwlViTForObjectDetection.forward().class_embeds is specific to OwlViT and I think it makes more sense to return OwlViT image_embeds instead of the unmodified CLIP embeddings

@ronhag
Copy link

ronhag commented Nov 10, 2022

Hi there! Maybe this is not the place to mention this, but just wanted to mention that the original implementation uses stochastic depth (https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/clip/layers.py#L235). They set it to 0.2 and 0.1 for the vision and text encoders (https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/configs/clip_b16.py#L132).

I guess that's not really important if you guys don't plan to implement the training losses for detection, but if you do, maybe it's something to keep in mind :)

@alaradirik alaradirik closed this Nov 14, 2022
@alaradirik alaradirik force-pushed the image_guided_detection branch from a0a0d05 to 2308f3d Compare November 14, 2022 08:30
@alaradirik alaradirik reopened this Nov 14, 2022
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@@ -165,18 +284,15 @@ def __call__(
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W) or (H, W, C),
where C is a number of channels, H and W are image height and width.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason those spaces are removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are removed by make style but it seemed like there were extra blank lines to begin with.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@alaradirik
Copy link
Contributor Author

@sgugger @NielsRogge could you do a final review when you're available? All tests are passing and I think all issues are addressed.

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, some final comments.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make sure your dependencies for styling are the ones pinned by Transformers and revert all the changes you made to remove blank lines in examples? It makes them less readable in the documentation, and it is not caused by make fixup by itself since the CI on main is green.

tests/test_modeling_common.py Outdated Show resolved Hide resolved
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@alaradirik alaradirik merged commit a00b7e8 into huggingface:main Nov 16, 2022
@timothylimyl
Copy link

timothylimyl commented Nov 30, 2022

It seems that running the example for image-guided od is still buggy:

import requests
from PIL import Image
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection
import numpy as np
import cv2 

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
query_url = "http://images.cocodataset.org/val2017/000000001675.jpg"
query_image = Image.open(requests.get(query_url, stream=True).raw)
inputs = processor(images=image, query_images=query_image, return_tensors="pt")
with torch.no_grad():
    outputs = model.image_guided_detection(**inputs)
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_image_guided_detection(
    outputs=outputs, threshold=0.6, nms_threshold=0.3, target_sizes=target_sizes
)

i = 0  # Retrieve predictions for the first image
plot_image = np.array(image)
boxes, scores = results[i]["boxes"], results[i]["scores"]
score_threshold = 0.2
for box, score in zip(boxes, scores):
    if score < score_threshold:
        continue

    box = [int(i) for i in box.tolist()]
    plot_image = cv2.rectangle(plot_image, (box[0],box[1]), (box[0]+box[2], box[1]+box[3]), (0, 255, 0), 2)

cv2.imshow("", plot_image)
q = cv2.waitKey(0)

Upon plotting the boxes, it is very off. This target query pair should work as it works in the scenic repo.

Edit: tried both patch-16 and 32 model, same results (bad box predictions on target image)

@NielsRogge
Copy link
Contributor

Upon plotting the boxes, it is very off. This target query pair should work as it works in the scenic repo.

What's your Pillow version? We've seen that using Pillow==7.1.2 is essential for getting the expected results (and cc @alaradirik we should make sure the model works on any pillow version)

@timothylimyl
Copy link

timothylimyl commented Dec 1, 2022

@NielsRogge , ran pip install Pillow==7.1.2 and got the same outputs in this example.

output of models are as follows:

boxes: tensor([[  7.6539,  -0.9177, 646.1529, 474.4720]])
scores: tensor([1.0000])

@alaradirik did you manage to run the example and get an appropriate prediction?

Edit: You can see that y1 is 0 in this case which is already wrong if you look at the image, image shape is (480,640) so in this case the bbox is just covering the entire image.

@alaradirik
Copy link
Contributor Author

Hey @timothylimyl, thanks for bringing this up. I was able to replicate the issue on my local and confirmed that it's not OpenCV or Pillow related and stems from the post-processing method. I think it's due to changed default behaviour between PyTorch versions, I'll open a fix PR once I confirm this.

CC @NielsRogge

@alaradirik
Copy link
Contributor Author

alaradirik commented Dec 1, 2022

@timothylimyl sorry for the mixup, I thought this was a Pillow versioning issue we previously encountered and didn't realize the query image you are using is different .

The post-process method returns coordinates in (x0, y0, x1, y1) format, the correct command to print the boxes is:
plot_image = cv2.rectangle(plot_image, box[:2], box[2:], (0, 255, 0), 2)

Note that this still returns a bounding box that covers the entire image. This is because OWL-ViT is a text-conditioned model that uses CLIP as its backbone, the image-guided object detection method repurposes the trained text-conditioned model with the assumption that the query image contains a single object. In this case, you are just getting results for an image that could be described with more general terms ("a photo of of a cat sitting on top of a ....").

Here are the results for a cropped version of the query image you are using:
cropped
new_results

@FrancescoSaverioZuppichini
Copy link
Contributor

FrancescoSaverioZuppichini commented Dec 1, 2022

hey @alaradirik in the other old PR I've uploaded an image and query (+ results) used in the official one. Maybe it's worth trying them as well since you can (subjectively) evaluate the result bboxes using the original results. I hope it helps :)

@alaradirik
Copy link
Contributor Author

Hi @FrancescoSaverioZuppichini, I'm not sure what you mean by subjectively evaluating the bounding boxes or which PR you are referring to?

@timothylimyl
Copy link

Hi @alaradirik, can you share your code that was used to generate the example?

I tried cropping and basically I still just received one big bounding box:

import requests
from PIL import Image
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection
import numpy as np
import cv2 

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
query_url = "http://images.cocodataset.org/val2017/000000001675.jpg"
query_image = Image.open(requests.get(query_url, stream=True).raw)
query_image  = np.array(query_image)[:280,:]
query_image = Image.fromarray(query_image)


inputs = processor(images=image, query_images=query_image, return_tensors="pt")
with torch.no_grad():
    outputs = model.image_guided_detection(**inputs)
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_image_guided_detection(
    outputs=outputs, threshold=0.6, nms_threshold=0.3, target_sizes=target_sizes
)

i = 0  # Retrieve predictions for the first image
plot_image = np.array(image)
boxes, scores = results[i]["boxes"], results[i]["scores"]
score_threshold = 0.2
for box, score in zip(boxes, scores):
    if score < score_threshold:
        continue

    box = [int(i) for i in box.tolist()]
    plot_image = cv2.rectangle(plot_image, box[:2], box[2:], (0, 255, 0), 2)

cv2.imshow("", plot_image)
q = cv2.waitKey(0)

@timothylimyl
Copy link

also, I was confused by the comment COCO API as I believe that coco bbox are in the format x,y,w,h while PASCAL VOC XML is x1,y1,x2,y2 which is what we are expecting here.

@alaradirik
Copy link
Contributor Author

@timothylimyl, you are right about the COCO API comment, we will update the docs shortly to reflect the correct returned data format.

Here is the code I used and the resulting image but keep in mind that different crops can lead to different results and both text-guided and image-guided object detection requires experimentation. There is no need for the score_threshold variable, you can directly use the threshold argument of the post-processing method to filter out low probability bounding boxes.

import requests

import cv2 
import torch
import numpy as np
from PIL import Image
from transformers import OwlViTProcessor, OwlViTForObjectDetection


processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
query_url = "http://images.cocodataset.org/val2017/000000001675.jpg"
query_image = Image.open(requests.get(query_url, stream=True).raw)
query_image =np.array(query_image)[:340]
query_image = Image.fromarray(query_image)

inputs = processor(images=image, query_images=query_image, return_tensors="pt")

with torch.no_grad():
    outputs = model.image_guided_detection(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_image_guided_detection(
    outputs=outputs, threshold=0.6, nms_threshold=0.3, target_sizes=target_sizes
)


img = cv2.cvtColor(np.array(image), cv2.COLOR_BGR2RGB)
boxes, scores = results[0]["boxes"], results[0]["scores"]

for box, score in zip(boxes, scores):
    box = [int(i) for i in box.tolist()]
    img = cv2.rectangle(img, box[:2], box[2:], (255, 0, 0), 5)

cv2.imshow("", img)
q = cv2.waitKey(0)

result

@timothylimyl
Copy link

Oh wow, that is very unexpected. Seems like the model is not very well trained/robust. The difference between your crop and mine is visually minimal yet the result differs by so much:

1
[Does not work]

versus

2
[Works]

If you crop slightly further up to :360 then there will be no bounding boxes again (only the one covering the whole image).

2

[Does not work!!!]

Do you reckon there could be something buggy with the code or is the model fundamentally not robust and require pretty exact crops for matching? It does not make much sense to me that crops have to be so exact as the feature embedding matching won't be that poor.

@FrancescoSaverioZuppichini
Copy link
Contributor

@alaradirik to the "original" one #18891

mpierrau pushed a commit to mpierrau/transformers that referenced this pull request Dec 15, 2022
)

Adds image-guided object detection method to OwlViTForObjectDetection class as described in the original paper. One-shot/ image-guided object detection enables users to use a query image to search for similar objects in the input image.

Co-Authored-By: Dhruv Karan k4r4n.dhruv@gmail.com
@timothylimyl
Copy link

Any updates?

@NielsRogge
Copy link
Contributor

Hi @timothylimyl, feel free to open an issue with a reproducable code sample so we can discuss it there

@orrzohar
Copy link
Contributor

orrzohar commented May 5, 2023

Hi @NielsRogge @timothylimyl @alaradirik @sgugger

I have found the issue that causes image conditioning to be so sensitive. There was a small bug in the query selection, please see my PR: #23157

Best,
Orr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add image-guided object detection support to OWL-ViT
8 participants