Adds image-guided object detection support to OWL-ViT #20136

alaradirik · 2022-11-09T11:18:55Z

What does this PR do?

Adds image-guided object detection method to OwlViTForObjectDetection class. This enables users to use a query image to search for similar objects in the input image.

Co-Authored-By: Dhruv Karan k4r4n.dhruv@gmail.com

Fixes #18748

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ X] Did you read the contributor guideline,
Pull Request section?
[X ] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Add image-guided object detection support to OWL-ViT #18748
[X ] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[X ] Did you write any new necessary tests?

alaradirik · 2022-11-09T11:20:31Z

@NielsRogge @sgugger sorry for the double PR, the upstream of the branch used in the other PR points to huggingface/transformers:img_guided_obj_det instead of main and I couldn't change the upstream.

The reviews in the other PR are addressed but there are two failing tests I couldn't debug:

FAILED tests/pipelines/test_pipelines_zero_shot_object_detection.py::ZeroShotObjectDetectionPipelineTests::test_pt_OwlViTConfig_OwlViTForObjectDetection_CLIPTokenizerFast_OwlViTFeatureExtractor - IndexError: tuple index out of range
FAILED tests/pipelines/test_pipelines_zero_shot_object_detection.py::ZeroShotObjectDetectionPipelineTests::test_pt_OwlViTConfig_OwlViTForObjectDetection_CLIPTokenizer_OwlViTFeatureExtractor - IndexError: tuple index out of range

HuggingFaceDocBuilderDev · 2022-11-09T11:42:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

sgugger · 2022-11-09T14:05:03Z

Could you make sure to add @unography as co-author? I'd prefer to merge the original PR, but if it's not possible, I want to make sure the authorship is properly attributed.

sgugger

A couple more comments.

src/transformers/models/owlvit/modeling_owlvit.py

src/transformers/models/owlvit/processing_owlvit.py

src/transformers/models/owlvit/feature_extraction_owlvit.py

src/transformers/models/owlvit/modeling_owlvit.py

NielsRogge · 2022-11-09T14:41:14Z

src/transformers/models/owlvit/modeling_owlvit.py

@@ -145,6 +223,10 @@ class OwlViTObjectDetectionOutput(ModelOutput):
        vision_model_last_hidden_state (`torch.FloatTensor` of shape `(batch_size, num_patches + 1, hidden_size)`)):
            Last hidden states extracted from the [`OwlViTVisionModel`]. OWL-ViT represents images as a set of image
            patches where the total number of patches is (image_size / patch_size)**2.
+        text_model_output (Tuple[`BaseModelOutputWithPooling`]):


Can we add a deprecation warning to remove text_model_last_hidden_state and vision_model_last_hidden_state in the future?

Also, image_embeds and class_embeds should always return the projected last hidden states (similar to CLIP).

Yes, I'm adding a deprecation warning to OwlViTForObjectDetection.forward().class_embeds is specific to OwlViT and I think it makes more sense to return OwlViT image_embeds instead of the unmodified CLIP embeddings

src/transformers/models/owlvit/modeling_owlvit.py

ronhag · 2022-11-10T17:46:09Z

Hi there! Maybe this is not the place to mention this, but just wanted to mention that the original implementation uses stochastic depth (https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/clip/layers.py#L235). They set it to 0.2 and 0.1 for the vision and text encoders (https://github.com/google-research/scenic/blob/main/scenic/projects/owl_vit/configs/clip_b16.py#L132).

I guess that's not really important if you guys don't plan to implement the training losses for detection, but if you do, maybe it's something to keep in mind :)

HuggingFaceDocBuilderDev · 2022-11-14T11:43:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

src/transformers/models/owlvit/feature_extraction_owlvit.py

NielsRogge · 2022-11-14T14:06:28Z

src/transformers/models/owlvit/feature_extraction_owlvit.py

@@ -165,18 +284,15 @@ def __call__(
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. In case of a NumPy array/PyTorch tensor, each image should be of shape (C, H, W) or (H, W, C),
                where C is a number of channels, H and W are image height and width.
-


Any reason those spaces are removed?

I think they are removed by make style but it seemed like there were extra blank lines to begin with.

src/transformers/models/owlvit/modeling_owlvit.py

HuggingFaceDocBuilderDev · 2022-11-14T15:33:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

HuggingFaceDocBuilderDev · 2022-11-14T16:21:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

alaradirik · 2022-11-15T07:30:41Z

@sgugger @NielsRogge could you do a final review when you're available? All tests are passing and I think all issues are addressed.

src/transformers/models/owlvit/feature_extraction_owlvit.py

src/transformers/models/owlvit/modeling_owlvit.py

NielsRogge

LGTM, some final comments.

HuggingFaceDocBuilderDev · 2022-11-15T11:16:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

sgugger

Can you make sure your dependencies for styling are the ones pinned by Transformers and revert all the changes you made to remove blank lines in examples? It makes them less readable in the documentation, and it is not caused by make fixup by itself since the CI on main is green.

tests/test_modeling_common.py

HuggingFaceDocBuilderDev · 2022-11-15T19:54:38Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

src/transformers/models/owlvit/modeling_owlvit.py

HuggingFaceDocBuilderDev · 2022-11-16T05:58:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

timothylimyl · 2022-11-30T12:49:25Z

It seems that running the example for image-guided od is still buggy:

import requests
from PIL import Image
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection
import numpy as np
import cv2 

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
query_url = "http://images.cocodataset.org/val2017/000000001675.jpg"
query_image = Image.open(requests.get(query_url, stream=True).raw)
inputs = processor(images=image, query_images=query_image, return_tensors="pt")
with torch.no_grad():
    outputs = model.image_guided_detection(**inputs)
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_image_guided_detection(
    outputs=outputs, threshold=0.6, nms_threshold=0.3, target_sizes=target_sizes
)

i = 0  # Retrieve predictions for the first image
plot_image = np.array(image)
boxes, scores = results[i]["boxes"], results[i]["scores"]
score_threshold = 0.2
for box, score in zip(boxes, scores):
    if score < score_threshold:
        continue

    box = [int(i) for i in box.tolist()]
    plot_image = cv2.rectangle(plot_image, (box[0],box[1]), (box[0]+box[2], box[1]+box[3]), (0, 255, 0), 2)

cv2.imshow("", plot_image)
q = cv2.waitKey(0)

Upon plotting the boxes, it is very off. This target query pair should work as it works in the scenic repo.

Edit: tried both patch-16 and 32 model, same results (bad box predictions on target image)

NielsRogge · 2022-11-30T13:52:35Z

Upon plotting the boxes, it is very off. This target query pair should work as it works in the scenic repo.

What's your Pillow version? We've seen that using Pillow==7.1.2 is essential for getting the expected results (and cc @alaradirik we should make sure the model works on any pillow version)

timothylimyl · 2022-12-01T01:36:19Z

@NielsRogge , ran pip install Pillow==7.1.2 and got the same outputs in this example.

output of models are as follows:

boxes: tensor([[  7.6539,  -0.9177, 646.1529, 474.4720]])
scores: tensor([1.0000])

@alaradirik did you manage to run the example and get an appropriate prediction?

Edit: You can see that y1 is 0 in this case which is already wrong if you look at the image, image shape is (480,640) so in this case the bbox is just covering the entire image.

alaradirik · 2022-12-01T07:55:23Z

Hey @timothylimyl, thanks for bringing this up. I was able to replicate the issue on my local and confirmed that it's not OpenCV or Pillow related and stems from the post-processing method. I think it's due to changed default behaviour between PyTorch versions, I'll open a fix PR once I confirm this.

CC @NielsRogge

alaradirik · 2022-12-01T11:57:24Z

@timothylimyl sorry for the mixup, I thought this was a Pillow versioning issue we previously encountered and didn't realize the query image you are using is different .

The post-process method returns coordinates in (x0, y0, x1, y1) format, the correct command to print the boxes is:
plot_image = cv2.rectangle(plot_image, box[:2], box[2:], (0, 255, 0), 2)

Note that this still returns a bounding box that covers the entire image. This is because OWL-ViT is a text-conditioned model that uses CLIP as its backbone, the image-guided object detection method repurposes the trained text-conditioned model with the assumption that the query image contains a single object. In this case, you are just getting results for an image that could be described with more general terms ("a photo of of a cat sitting on top of a ....").

Here are the results for a cropped version of the query image you are using:

FrancescoSaverioZuppichini · 2022-12-01T12:04:18Z

hey @alaradirik in the other old PR I've uploaded an image and query (+ results) used in the official one. Maybe it's worth trying them as well since you can (subjectively) evaluate the result bboxes using the original results. I hope it helps :)

alaradirik · 2022-12-01T12:16:07Z

Hi @FrancescoSaverioZuppichini, I'm not sure what you mean by subjectively evaluating the bounding boxes or which PR you are referring to?

timothylimyl · 2022-12-02T03:43:00Z

Hi @alaradirik, can you share your code that was used to generate the example?

I tried cropping and basically I still just received one big bounding box:

import requests
from PIL import Image
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection
import numpy as np
import cv2 

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
query_url = "http://images.cocodataset.org/val2017/000000001675.jpg"
query_image = Image.open(requests.get(query_url, stream=True).raw)
query_image  = np.array(query_image)[:280,:]
query_image = Image.fromarray(query_image)


inputs = processor(images=image, query_images=query_image, return_tensors="pt")
with torch.no_grad():
    outputs = model.image_guided_detection(**inputs)
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_image_guided_detection(
    outputs=outputs, threshold=0.6, nms_threshold=0.3, target_sizes=target_sizes
)

i = 0  # Retrieve predictions for the first image
plot_image = np.array(image)
boxes, scores = results[i]["boxes"], results[i]["scores"]
score_threshold = 0.2
for box, score in zip(boxes, scores):
    if score < score_threshold:
        continue

    box = [int(i) for i in box.tolist()]
    plot_image = cv2.rectangle(plot_image, box[:2], box[2:], (0, 255, 0), 2)

cv2.imshow("", plot_image)
q = cv2.waitKey(0)

timothylimyl · 2022-12-02T03:45:37Z

also, I was confused by the comment COCO API as I believe that coco bbox are in the format x,y,w,h while PASCAL VOC XML is x1,y1,x2,y2 which is what we are expecting here.

alaradirik · 2022-12-02T07:51:09Z

@timothylimyl, you are right about the COCO API comment, we will update the docs shortly to reflect the correct returned data format.

Here is the code I used and the resulting image but keep in mind that different crops can lead to different results and both text-guided and image-guided object detection requires experimentation. There is no need for the score_threshold variable, you can directly use the threshold argument of the post-processing method to filter out low probability bounding boxes.

import requests

import cv2 
import torch
import numpy as np
from PIL import Image
from transformers import OwlViTProcessor, OwlViTForObjectDetection


processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
query_url = "http://images.cocodataset.org/val2017/000000001675.jpg"
query_image = Image.open(requests.get(query_url, stream=True).raw)
query_image =np.array(query_image)[:340]
query_image = Image.fromarray(query_image)

inputs = processor(images=image, query_images=query_image, return_tensors="pt")

with torch.no_grad():
    outputs = model.image_guided_detection(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_image_guided_detection(
    outputs=outputs, threshold=0.6, nms_threshold=0.3, target_sizes=target_sizes
)


img = cv2.cvtColor(np.array(image), cv2.COLOR_BGR2RGB)
boxes, scores = results[0]["boxes"], results[0]["scores"]

for box, score in zip(boxes, scores):
    box = [int(i) for i in box.tolist()]
    img = cv2.rectangle(img, box[:2], box[2:], (255, 0, 0), 5)

cv2.imshow("", img)
q = cv2.waitKey(0)

timothylimyl · 2022-12-05T03:44:10Z

Oh wow, that is very unexpected. Seems like the model is not very well trained/robust. The difference between your crop and mine is visually minimal yet the result differs by so much:

[Does not work]

versus

[Works]

If you crop slightly further up to :360 then there will be no bounding boxes again (only the one covering the whole image).

[Does not work!!!]

Do you reckon there could be something buggy with the code or is the model fundamentally not robust and require pretty exact crops for matching? It does not make much sense to me that crops have to be so exact as the feature embedding matching won't be that poor.

FrancescoSaverioZuppichini · 2022-12-09T16:14:26Z

@alaradirik to the "original" one #18891

) Adds image-guided object detection method to OwlViTForObjectDetection class as described in the original paper. One-shot/ image-guided object detection enables users to use a query image to search for similar objects in the input image. Co-Authored-By: Dhruv Karan k4r4n.dhruv@gmail.com

timothylimyl · 2022-12-24T02:02:58Z

Any updates?

NielsRogge · 2022-12-26T13:01:37Z

Hi @timothylimyl, feel free to open an issue with a reproducable code sample so we can discuss it there

orrzohar · 2023-05-05T00:15:16Z

Hi @NielsRogge @timothylimyl @alaradirik @sgugger

I have found the issue that causes image conditioning to be so sensitive. There was a small bug in the query selection, please see my PR: #23157

Best,
Orr

alaradirik requested review from sgugger and NielsRogge November 9, 2022 11:19

sgugger reviewed Nov 9, 2022

View reviewed changes

NielsRogge reviewed Nov 9, 2022

View reviewed changes

src/transformers/models/owlvit/feature_extraction_owlvit.py Outdated Show resolved Hide resolved

NielsRogge reviewed Nov 9, 2022

View reviewed changes

src/transformers/models/owlvit/feature_extraction_owlvit.py Outdated Show resolved Hide resolved

NielsRogge reviewed Nov 9, 2022

View reviewed changes

src/transformers/models/owlvit/modeling_owlvit.py Outdated Show resolved Hide resolved

NielsRogge reviewed Nov 9, 2022

View reviewed changes

src/transformers/models/owlvit/modeling_owlvit.py Outdated Show resolved Hide resolved

NielsRogge reviewed Nov 9, 2022

View reviewed changes

src/transformers/models/owlvit/modeling_owlvit.py Outdated Show resolved Hide resolved

NielsRogge reviewed Nov 9, 2022

View reviewed changes

src/transformers/models/owlvit/modeling_owlvit.py Outdated Show resolved Hide resolved

alaradirik closed this Nov 14, 2022

alaradirik force-pushed the image_guided_detection branch from a0a0d05 to 2308f3d Compare November 14, 2022 08:30

fix merge conflicts, readd changes

72282f6

alaradirik reopened this Nov 14, 2022

NielsRogge reviewed Nov 14, 2022

View reviewed changes

src/transformers/models/owlvit/feature_extraction_owlvit.py Outdated Show resolved Hide resolved

NielsRogge reviewed Nov 14, 2022

View reviewed changes

src/transformers/models/owlvit/modeling_owlvit.py Outdated Show resolved Hide resolved

alaradirik and others added 2 commits November 14, 2022 18:09

fix pipeline test

7c75e58

Merge branch 'huggingface:main' into image_guided_detection

a27d3d2

alaradirik mentioned this pull request Nov 14, 2022

Adds image-guided object detection support to OWL-ViT #18891

Closed

fix typos

c7b1afc

run make fixup

2935b38

Merge branch 'huggingface:main' into image_guided_detection

41f36ec

NielsRogge reviewed Nov 15, 2022

View reviewed changes

src/transformers/models/owlvit/feature_extraction_owlvit.py Show resolved Hide resolved

NielsRogge reviewed Nov 15, 2022

View reviewed changes

src/transformers/models/owlvit/modeling_owlvit.py Show resolved Hide resolved

NielsRogge approved these changes Nov 15, 2022

View reviewed changes

address reviews

03eb48e

sgugger reviewed Nov 15, 2022

View reviewed changes

tests/test_modeling_common.py Outdated Show resolved Hide resolved

address reviews

4c733cd

sgugger approved these changes Nov 15, 2022

View reviewed changes

src/transformers/models/owlvit/modeling_owlvit.py Outdated Show resolved Hide resolved

re-add blank lines

a125785

alaradirik merged commit a00b7e8 into huggingface:main Nov 16, 2022

orrzohar mentioned this pull request May 9, 2023

OwlVit gives different results compared to original colab version #21206

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds image-guided object detection support to OWL-ViT #20136

Adds image-guided object detection support to OWL-ViT #20136

alaradirik commented Nov 9, 2022 •

edited by NielsRogge

Loading

alaradirik commented Nov 9, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 9, 2022

sgugger commented Nov 9, 2022

sgugger left a comment

NielsRogge Nov 9, 2022

alaradirik Nov 9, 2022

ronhag commented Nov 10, 2022

HuggingFaceDocBuilderDev commented Nov 14, 2022

NielsRogge Nov 14, 2022

alaradirik Nov 14, 2022

HuggingFaceDocBuilderDev commented Nov 14, 2022

HuggingFaceDocBuilderDev commented Nov 14, 2022

alaradirik commented Nov 15, 2022

NielsRogge left a comment

HuggingFaceDocBuilderDev commented Nov 15, 2022

sgugger left a comment

HuggingFaceDocBuilderDev commented Nov 15, 2022

HuggingFaceDocBuilderDev commented Nov 16, 2022

timothylimyl commented Nov 30, 2022 •

edited

Loading

NielsRogge commented Nov 30, 2022

timothylimyl commented Dec 1, 2022 •

edited

Loading

alaradirik commented Dec 1, 2022

alaradirik commented Dec 1, 2022 •

edited

Loading

FrancescoSaverioZuppichini commented Dec 1, 2022 •

edited

Loading

alaradirik commented Dec 1, 2022

timothylimyl commented Dec 2, 2022

timothylimyl commented Dec 2, 2022

alaradirik commented Dec 2, 2022

timothylimyl commented Dec 5, 2022

FrancescoSaverioZuppichini commented Dec 9, 2022

timothylimyl commented Dec 24, 2022

NielsRogge commented Dec 26, 2022

orrzohar commented May 5, 2023

Adds image-guided object detection support to OWL-ViT #20136

Adds image-guided object detection support to OWL-ViT #20136

Conversation

alaradirik commented Nov 9, 2022 • edited by NielsRogge Loading

What does this PR do?

Before submitting

alaradirik commented Nov 9, 2022 • edited Loading

HuggingFaceDocBuilderDev commented Nov 9, 2022

sgugger commented Nov 9, 2022

sgugger left a comment

Choose a reason for hiding this comment

NielsRogge Nov 9, 2022

Choose a reason for hiding this comment

alaradirik Nov 9, 2022

Choose a reason for hiding this comment

ronhag commented Nov 10, 2022

HuggingFaceDocBuilderDev commented Nov 14, 2022

NielsRogge Nov 14, 2022

Choose a reason for hiding this comment

alaradirik Nov 14, 2022

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 14, 2022

HuggingFaceDocBuilderDev commented Nov 14, 2022

alaradirik commented Nov 15, 2022

NielsRogge left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 15, 2022

sgugger left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 15, 2022

HuggingFaceDocBuilderDev commented Nov 16, 2022

timothylimyl commented Nov 30, 2022 • edited Loading

NielsRogge commented Nov 30, 2022

timothylimyl commented Dec 1, 2022 • edited Loading

alaradirik commented Dec 1, 2022

alaradirik commented Dec 1, 2022 • edited Loading

FrancescoSaverioZuppichini commented Dec 1, 2022 • edited Loading

alaradirik commented Dec 1, 2022

timothylimyl commented Dec 2, 2022

timothylimyl commented Dec 2, 2022

alaradirik commented Dec 2, 2022

timothylimyl commented Dec 5, 2022

FrancescoSaverioZuppichini commented Dec 9, 2022

timothylimyl commented Dec 24, 2022

NielsRogge commented Dec 26, 2022

orrzohar commented May 5, 2023

alaradirik commented Nov 9, 2022 •

edited by NielsRogge

Loading

alaradirik commented Nov 9, 2022 •

edited

Loading

timothylimyl commented Nov 30, 2022 •

edited

Loading

timothylimyl commented Dec 1, 2022 •

edited

Loading

alaradirik commented Dec 1, 2022 •

edited

Loading

FrancescoSaverioZuppichini commented Dec 1, 2022 •

edited

Loading