OwlVit gives different results compared to original colab version #21206

darwinharianto · 2023-01-20T05:23:22Z

System Info

Using huggingface space and google colab

Who can help?

@adirik

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

cat picture from http://images.cocodataset.org/val2017/000000039769.jpg
remote control image from https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSRUGcH7a3DO5Iz1sknxU5oauEq9T_q4hyU3nuTFHiO0NMSg37x

Expected behavior

Being excited with the results of OwlVit, I tried to input some random image to see the results.
Having no experience on jax, my first option is to search out on huggingface space.

Given a query of remote control, and a cat picture, I wanted to get picture of remote controls.
https://huggingface.co/spaces/adirik/image-guided-owlvit

The results is not really what I expected (no box on remotes).

Then I checked for results on colab version, if they behave the same way.
https://colab.research.google.com/github/google-research/scenic/blob/main/scenic/projects/owl_vit/notebooks/OWL_ViT_inference_playground.ipynb#scrollTo=AQGAM16fReow

It correctly draw boxes on the remotes.

I am not sure what is happening, which part should I look at to determine what causes this difference?

The text was updated successfully, but these errors were encountered:

NielsRogge · 2023-01-20T18:09:24Z

Yes we had a hard time making the Space output the same bounding boxes as in Colab (eventually it worked on the cats image). It had to do with the Pillow version.

So I'm guessing there might be a difference in Pillow versions here as well

Cc @alaradirik

darwinharianto · 2023-01-23T00:42:29Z

Do you mean Pillow changes the input value?
I tried another image

space model cant detect cat inside this image, but colab version can detect it

alaradirik · 2023-01-24T08:44:01Z

@darwinharianto thanks for bringing the issue up, I'm looking into it!

github-actions · 2023-02-19T15:01:46Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

darwinharianto · 2023-02-21T06:55:41Z

Kindly bumping

MaslikovEgor · 2023-03-15T09:24:21Z

Kindly reminder

github-actions · 2023-04-08T15:02:37Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sgugger · 2023-04-10T13:02:58Z

cc @alaradirik and @amyeroberts

RRoundTable · 2023-04-22T09:18:47Z

I got the same issues.
This is original repo results.

And this is huggingface demo.

text_queries = text_queries.split(",")
target_sizes = torch.Tensor([img.shape[:2]])
inputs = processor(text=text_queries, images=img, return_tensors="pt").to(device)  
with torch.no_grad():
    outputs = model(**inputs)

outputs.logits = outputs.logits.cpu()
outputs.pred_boxes = outputs.pred_boxes.cpu()
results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

The rocket bounding box score is different. (0.15 vs more than 0.21)

With lvis-api, the performance is not reproduced. (mAP = 0.095)

MaslikovEgor · 2023-05-09T12:56:27Z

It seems problem still exist. I mentioned about problem here.

#23157 (comment)

Maybe the best way is to cover with model predictions end-to-end tests on batch of images. This approach help us to be sure about changes

RRoundTable · 2023-05-09T14:06:12Z

@MaslikovEgor I agree with you. I have end-to-end test with lvis-api (both huggingface owlvit and google/scenic owl-vit). But owl vit in huggingface is not reproduced. (mAP = 0.095)

baseline: mAp 0.193

RRoundTable · 2023-05-09T14:40:58Z

I want to fix this problem, but it would be efficient if I knew where to start. Can you give me a suggestion? @alaradirik

orrzohar · 2023-05-09T14:56:55Z

Hi @MaslikovEgor,

The demo didn't work before this fix as well (see #20136). Try running coco evaluation with image conditioning before/after this fix, mAP@0.5 increases from 6 to 37. This is still below the expected 44, but closer to the reported/expected performance. I am still trying to figure out why.
Best,
Orr

orrzohar · 2023-05-09T15:16:06Z

@RRoundTable, the issues you are reporting seem to do with the text-conditioned evaluation. This means that the issues probably stem from the forward pass/post-processing.

In your LVIS eval, did you make sure to implement a new post-processor that incorporates all the changes needed for eval? If helpful, I can add my function to 'processor' or something, please notice there are a few changes compared with normal inference.

RRoundTable · 2023-05-09T23:52:48Z

@orrzohar, Yes. I tested with text-conditioned evaluation.

In my LVIS eval, I just used huggingface's postprocessor and preprocessor. It would be helpful if you contribute some functions.

transformers[torch] == 4.28.1

# example script
import requests
from PIL import Image
import torch
import glob
import os
import argparse
import json
from tqdm import tqdm

from transformers import OwlViTProcessor, OwlViTForObjectDetection

parser = argparse.ArgumentParser()
parser.add_argument("--dataset-path", type=str, required=True)
parser.add_argument("--text-query-path", type=str required=True)
parser.add_argument("--save-path", default="owl-vit-result.json", type=str)
parser.add_argument("--batch-size", default=64, type=int)
args = parser.parse_args()

model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model.to(device)


with open(args.text_query_path, "r") as f:
    text_query = f.read()

images = glob.glob(os.path.join(args.dataset_path, "*"))
image_ids = [img_path.split("/")[-1].split(".")[0] for img_path in images]

instances = []
N = len(images)

with torch.no_grad():
    for i in tqdm(range(N // args.batch_size + 1)):
        image_ids = []
        batch_images = []
        target_sizes = []
        for img_path in images[i * args.batch_size: (i+1) * args.batch_size]:
            image_ids.append(int(img_path.split("/")[-1].split(".")[0]))
            image = Image.open(img_path).convert("RGB")
            batch_images.append(image)
            target_sizes.append((image.size[1], image.size[0]))
        target_sizes = torch.Tensor(target_sizes)
        target_sizes = target_sizes.to(device)
        texts = [text_query.split(",")] * len(batch_images)
        inputs = processor(text=texts, images=batch_images, return_tensors="pt")
        inputs = inputs.to(device)
        outputs = model(**inputs)
        # Target image sizes (height, width) to rescale box predictions [batch_size, 2]

        # Convert outputs (bounding boxes and class logits) to COCO API
        results = processor.post_process(outputs=outputs, target_sizes=target_sizes)
        for image_id, res in zip(image_ids, results):
            for bbox, score, label in zip(res["boxes"], res["scores"], res["labels"]):
                # tensor to numpy
                bbox = bbox.cpu().detach().numpy()
                score = score.cpu().detach().numpy()
                label = label.cpu().detach().numpy()
                # bbox format: xyxy -> xywh
                x1, y1, x2, y2 = bbox
                bbox = [int(x1), int(y1), int(x2-x1), int(y2-y1)]
                instance = {}
                instance["image_id"] = image_id
                instance["bbox"] = bbox # TODO
                instance["score"] = float(score)
                instance["category_id"] = int(label) + 1 # TODO
                instances.append(instance)

github-actions · 2023-06-03T15:02:40Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

orrzohar · 2023-06-04T03:28:41Z

Hi @RRoundTable ,

I added a PR with the appropriate evaluation protocol

#23982

Best,
Orr

haizadtarik · 2023-07-19T03:17:14Z

Hi! @alaradirik,
I'm using transformers==4.30.2 but still encountered the same issue. Any thought on this?

Query image:

Result from colab:

Result from huggingface:

github-actions · 2023-09-10T08:03:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

amyeroberts · 2023-09-13T18:02:50Z

cc @rafaelpadilla

NielsRogge · 2023-09-25T08:36:04Z

Hi folks, I've investigated the difference, will be solved in PR below. TLDR: image preprocessing is done differently in the original Colab (involves padding the image to a square), whereas the HF implementation used center cropping. The model itself is fine, logits are exactly the same as original implementation on the same inputs.

NielsRogge · 2023-10-14T09:38:20Z

Hi folks, since OWLv2 was now added in #26668, you will see that results match one-on-one with the original Google Colab notebook provided by the authors.

If you also want to get one-on-one matching results for OWLv1, then you will need to use Owlv2Processor (which internally uses Owlv2ImageProcessor) instead of OwlViTProcessor as it uses the exact same image preprocessing settings as the Colab notebook. We cannot change this for v1 due to backwards compatibility.

github-actions · 2023-11-08T08:07:42Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

rishabh-akridata · 2024-03-01T13:20:56Z

@RRoundTable I have trying to reproduce the results(AP) values on lvis dataset using the example script that you have provided. Did you manage to reproduce the results?

rishabh-akridata · 2024-03-06T06:19:14Z

@NielsRogge I am using the Owlv2 processor, but still not able to get the same results.

NielsRogge · 2024-03-06T06:36:35Z

@rishabh-akridata please provide a script that reproduces your issue

rishabh-akridata · 2024-03-06T06:39:12Z

@NielsRogge Please find the script below.


import skimage
import os
import matplotlib.pyplot as plt
from copy import deepcopy
import numpy as np
import cv2
import requests
from PIL import Image
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection, Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlvit-base-patch32", size={"height":768, "width":768})
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

filename = os.path.join(skimage.data_dir, 'astronaut.png')
image = Image.open(filename)
texts = ['face', 'rocket', 'nasa badge', 'star-spangled banner']
inputs = processor(text=texts, images=image, return_tensors="pt")
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
outputs = model(**inputs)
# Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.2)
# results = post_process_object_detection_evaluation(outputs, target_sizes=target_sizes, pred_per_im=10)
# font
font = cv2.FONT_HERSHEY_SIMPLEX
# fontScale
fontScale = 0.5
# Red color in BGR
color = (0, 255, 0)
# Line thickness of 2 px
thickness = 2

boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
image_to_plot = deepcopy(np.array(image))
image_to_plot = image_to_plot.astype(np.uint8)
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    xmin, ymin, xmax, ymax = int(box[0]), int(box[1]), int(box[2]), int(box[3])
    cv2.rectangle(image_to_plot, (xmin, ymin), (xmax, ymax), (255, 0, 0), 2)
    rounded_score = round(float(score), 2)
    # Using cv2.putText() method
    cv2.putText(image_to_plot, f"{texts[label]}:{rounded_score}", (xmin, ymax), font, fontScale,
                    color, thickness, cv2.LINE_AA, False)

plt.imshow(image_to_plot)
plt.show()

rishabh-akridata · 2024-03-06T06:44:43Z

@NielsRogge Also I tried to use the below processor as well. But the facing the same issue.
processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble", size={"height":768, "width":768})

rishabh-akridata · 2024-03-06T06:50:28Z

@NielsRogge When I reduce the conf_threshold to 0.1, then I get some detections but with very low confidence and also the boxes are not the same as described in the official colab notebook.

rishabh-akridata · 2024-03-06T07:22:58Z

@NielsRogge Please ignore this one, I am looking into the results of different model variant. I am able to get the same results as mentioned in the colab notebook. Sorry for inconvenience caused.

Thanks.

iMayuqi · 2024-08-02T15:56:34Z

@RRoundTable
Hello, I also encountered the recurrence problem of invoking owl-vit model in Transformer package. With reference to your code, the overall mAp I reproduced on lvis data set is 14.9, while on rare it is 6.4. I suspect the problem is that original scenic's original model uses padding when loading images, and may prompt category names and boundingbox restores to offset image_padding before testing.

May I ask if you succeeded in replicating it at last? May I ask you for the code? It will be of great help to me. Thank you!

NielsRogge · 2024-08-03T09:50:56Z

Hi @iMayuqi to reproduce the results I would recommend using Owlv2Processor instead of OwlProcessor, as it includes the padding.

darwinharianto changed the title ~~OwlVit gives different results then the one in colab~~ OwlVit gives different results compared to original colab version Jan 20, 2023

alaradirik mentioned this issue May 8, 2023

Fixing class embedding selection in owl-vit #23157

Merged

5 tasks

alaradirik closed this as completed May 8, 2023

alaradirik reopened this May 9, 2023

orrzohar mentioned this issue Jun 4, 2023

owl-vit-eval-postprocessor #23982

Closed

5 tasks

huggingface deleted a comment from github-actions bot Jun 30, 2023

huggingface deleted a comment from github-actions bot Aug 16, 2023

haizadtarik mentioned this issue Sep 23, 2023

[OWL-ViT] How to perform inference on local system? google-research/scenic#836

Closed

NielsRogge mentioned this issue Sep 25, 2023

Add OWLv2 #26379

Closed

1 task

github-actions bot closed this as completed Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OwlVit gives different results compared to original colab version #21206

OwlVit gives different results compared to original colab version #21206

darwinharianto commented Jan 20, 2023 •

edited

Loading

NielsRogge commented Jan 20, 2023 •

edited

Loading

darwinharianto commented Jan 23, 2023 •

edited

Loading

alaradirik commented Jan 24, 2023

github-actions bot commented Feb 19, 2023

darwinharianto commented Feb 21, 2023

MaslikovEgor commented Mar 15, 2023

github-actions bot commented Apr 8, 2023

sgugger commented Apr 10, 2023

RRoundTable commented Apr 22, 2023

MaslikovEgor commented May 9, 2023

RRoundTable commented May 9, 2023

RRoundTable commented May 9, 2023

orrzohar commented May 9, 2023

orrzohar commented May 9, 2023

RRoundTable commented May 9, 2023

github-actions bot commented Jun 3, 2023

orrzohar commented Jun 4, 2023

haizadtarik commented Jul 19, 2023

github-actions bot commented Sep 10, 2023

amyeroberts commented Sep 13, 2023

NielsRogge commented Sep 25, 2023 •

edited

Loading

NielsRogge commented Oct 14, 2023 •

edited

Loading

github-actions bot commented Nov 8, 2023

rishabh-akridata commented Mar 1, 2024 •

edited

Loading

rishabh-akridata commented Mar 6, 2024

NielsRogge commented Mar 6, 2024

rishabh-akridata commented Mar 6, 2024 •

edited

Loading

rishabh-akridata commented Mar 6, 2024

rishabh-akridata commented Mar 6, 2024

rishabh-akridata commented Mar 6, 2024

iMayuqi commented Aug 2, 2024

NielsRogge commented Aug 3, 2024

OwlVit gives different results compared to original colab version #21206

OwlVit gives different results compared to original colab version #21206

Comments

darwinharianto commented Jan 20, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

NielsRogge commented Jan 20, 2023 • edited Loading

darwinharianto commented Jan 23, 2023 • edited Loading

alaradirik commented Jan 24, 2023

github-actions bot commented Feb 19, 2023

darwinharianto commented Feb 21, 2023

MaslikovEgor commented Mar 15, 2023

github-actions bot commented Apr 8, 2023

sgugger commented Apr 10, 2023

RRoundTable commented Apr 22, 2023

MaslikovEgor commented May 9, 2023

RRoundTable commented May 9, 2023

RRoundTable commented May 9, 2023

orrzohar commented May 9, 2023

orrzohar commented May 9, 2023

RRoundTable commented May 9, 2023

github-actions bot commented Jun 3, 2023

orrzohar commented Jun 4, 2023

haizadtarik commented Jul 19, 2023

github-actions bot commented Sep 10, 2023

amyeroberts commented Sep 13, 2023

NielsRogge commented Sep 25, 2023 • edited Loading

NielsRogge commented Oct 14, 2023 • edited Loading

github-actions bot commented Nov 8, 2023

rishabh-akridata commented Mar 1, 2024 • edited Loading

rishabh-akridata commented Mar 6, 2024

NielsRogge commented Mar 6, 2024

rishabh-akridata commented Mar 6, 2024 • edited Loading

rishabh-akridata commented Mar 6, 2024

rishabh-akridata commented Mar 6, 2024

rishabh-akridata commented Mar 6, 2024

iMayuqi commented Aug 2, 2024

NielsRogge commented Aug 3, 2024

darwinharianto commented Jan 20, 2023 •

edited

Loading

NielsRogge commented Jan 20, 2023 •

edited

Loading

darwinharianto commented Jan 23, 2023 •

edited

Loading

NielsRogge commented Sep 25, 2023 •

edited

Loading

NielsRogge commented Oct 14, 2023 •

edited

Loading

rishabh-akridata commented Mar 1, 2024 •

edited

Loading

rishabh-akridata commented Mar 6, 2024 •

edited

Loading