Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OwlVit gives different results compared to original colab version #21206

Closed
2 of 4 tasks
darwinharianto opened this issue Jan 20, 2023 · 32 comments
Closed
2 of 4 tasks

OwlVit gives different results compared to original colab version #21206

darwinharianto opened this issue Jan 20, 2023 · 32 comments

Comments

@darwinharianto
Copy link

darwinharianto commented Jan 20, 2023

System Info

Using huggingface space and google colab

Who can help?

@adirik

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

cat picture from http://images.cocodataset.org/val2017/000000039769.jpg
remote control image from https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSRUGcH7a3DO5Iz1sknxU5oauEq9T_q4hyU3nuTFHiO0NMSg37x

Expected behavior

Being excited with the results of OwlVit, I tried to input some random image to see the results.
Having no experience on jax, my first option is to search out on huggingface space.

Given a query of remote control, and a cat picture, I wanted to get picture of remote controls.
https://huggingface.co/spaces/adirik/image-guided-owlvit
Screenshot 2023-01-20 at 14 13 13
The results is not really what I expected (no box on remotes).

Then I checked for results on colab version, if they behave the same way.
https://colab.research.google.com/github/google-research/scenic/blob/main/scenic/projects/owl_vit/notebooks/OWL_ViT_inference_playground.ipynb#scrollTo=AQGAM16fReow
Screenshot 2023-01-20 at 14 14 02
It correctly draw boxes on the remotes.

I am not sure what is happening, which part should I look at to determine what causes this difference?

@darwinharianto darwinharianto changed the title OwlVit gives different results then the one in colab OwlVit gives different results compared to original colab version Jan 20, 2023
@NielsRogge
Copy link
Contributor

NielsRogge commented Jan 20, 2023

Yes we had a hard time making the Space output the same bounding boxes as in Colab (eventually it worked on the cats image). It had to do with the Pillow version.

So I'm guessing there might be a difference in Pillow versions here as well

Cc @alaradirik

@darwinharianto
Copy link
Author

darwinharianto commented Jan 23, 2023

Do you mean Pillow changes the input value?
I tried another image
Screenshot 2023-01-23 at 9 41 05
space model cant detect cat inside this image, but colab version can detect it
Screenshot 2023-01-23 at 9 42 07

@alaradirik
Copy link
Contributor

@darwinharianto thanks for bringing the issue up, I'm looking into it!

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@darwinharianto
Copy link
Author

Kindly bumping

@MaslikovEgor
Copy link

Kindly reminder

@github-actions
Copy link

github-actions bot commented Apr 8, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@sgugger
Copy link
Collaborator

sgugger commented Apr 10, 2023

cc @alaradirik and @amyeroberts

@RRoundTable
Copy link

I got the same issues.
This is original repo results.

image

And this is huggingface demo.

text_queries = text_queries.split(",")
target_sizes = torch.Tensor([img.shape[:2]])
inputs = processor(text=text_queries, images=img, return_tensors="pt").to(device)  
with torch.no_grad():
    outputs = model(**inputs)

outputs.logits = outputs.logits.cpu()
outputs.pred_boxes = outputs.pred_boxes.cpu()
results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

image

The rocket bounding box score is different. (0.15 vs more than 0.21)

With lvis-api, the performance is not reproduced. (mAP = 0.095)

@MaslikovEgor
Copy link

It seems problem still exist. I mentioned about problem here.

#23157 (comment)

Maybe the best way is to cover with model predictions end-to-end tests on batch of images. This approach help us to be sure about changes

@RRoundTable
Copy link

@MaslikovEgor I agree with you. I have end-to-end test with lvis-api (both huggingface owlvit and google/scenic owl-vit). But owl vit in huggingface is not reproduced. (mAP = 0.095)

@RRoundTable
Copy link

I want to fix this problem, but it would be efficient if I knew where to start. Can you give me a suggestion? @alaradirik

@orrzohar
Copy link
Contributor

orrzohar commented May 9, 2023

Hi @MaslikovEgor,

The demo didn't work before this fix as well (see #20136). Try running coco evaluation with image conditioning before/after this fix, mAP@0.5 increases from 6 to 37. This is still below the expected 44, but closer to the reported/expected performance. I am still trying to figure out why.
Best,
Orr

@orrzohar
Copy link
Contributor

orrzohar commented May 9, 2023

@RRoundTable, the issues you are reporting seem to do with the text-conditioned evaluation. This means that the issues probably stem from the forward pass/post-processing.

In your LVIS eval, did you make sure to implement a new post-processor that incorporates all the changes needed for eval? If helpful, I can add my function to 'processor' or something, please notice there are a few changes compared with normal inference.

@RRoundTable
Copy link

@orrzohar, Yes. I tested with text-conditioned evaluation.

In my LVIS eval, I just used huggingface's postprocessor and preprocessor. It would be helpful if you contribute some functions.

transformers[torch] == 4.28.1
# example script
import requests
from PIL import Image
import torch
import glob
import os
import argparse
import json
from tqdm import tqdm

from transformers import OwlViTProcessor, OwlViTForObjectDetection

parser = argparse.ArgumentParser()
parser.add_argument("--dataset-path", type=str, required=True)
parser.add_argument("--text-query-path", type=str required=True)
parser.add_argument("--save-path", default="owl-vit-result.json", type=str)
parser.add_argument("--batch-size", default=64, type=int)
args = parser.parse_args()

model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model.to(device)


with open(args.text_query_path, "r") as f:
    text_query = f.read()

images = glob.glob(os.path.join(args.dataset_path, "*"))
image_ids = [img_path.split("/")[-1].split(".")[0] for img_path in images]

instances = []
N = len(images)

with torch.no_grad():
    for i in tqdm(range(N // args.batch_size + 1)):
        image_ids = []
        batch_images = []
        target_sizes = []
        for img_path in images[i * args.batch_size: (i+1) * args.batch_size]:
            image_ids.append(int(img_path.split("/")[-1].split(".")[0]))
            image = Image.open(img_path).convert("RGB")
            batch_images.append(image)
            target_sizes.append((image.size[1], image.size[0]))
        target_sizes = torch.Tensor(target_sizes)
        target_sizes = target_sizes.to(device)
        texts = [text_query.split(",")] * len(batch_images)
        inputs = processor(text=texts, images=batch_images, return_tensors="pt")
        inputs = inputs.to(device)
        outputs = model(**inputs)
        # Target image sizes (height, width) to rescale box predictions [batch_size, 2]

        # Convert outputs (bounding boxes and class logits) to COCO API
        results = processor.post_process(outputs=outputs, target_sizes=target_sizes)
        for image_id, res in zip(image_ids, results):
            for bbox, score, label in zip(res["boxes"], res["scores"], res["labels"]):
                # tensor to numpy
                bbox = bbox.cpu().detach().numpy()
                score = score.cpu().detach().numpy()
                label = label.cpu().detach().numpy()
                # bbox format: xyxy -> xywh
                x1, y1, x2, y2 = bbox
                bbox = [int(x1), int(y1), int(x2-x1), int(y2-y1)]
                instance = {}
                instance["image_id"] = image_id
                instance["bbox"] = bbox # TODO
                instance["score"] = float(score)
                instance["category_id"] = int(label) + 1 # TODO
                instances.append(instance)

@github-actions
Copy link

github-actions bot commented Jun 3, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@orrzohar
Copy link
Contributor

orrzohar commented Jun 4, 2023

Hi @RRoundTable ,

I added a PR with the appropriate evaluation protocol

#23982

Best,
Orr

@huggingface huggingface deleted a comment from github-actions bot Jun 30, 2023
@haizadtarik
Copy link

Hi! @alaradirik,
I'm using transformers==4.30.2 but still encountered the same issue. Any thought on this?

Query image:
image

Result from colab:
image

Result from huggingface:
image

@huggingface huggingface deleted a comment from github-actions bot Aug 16, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@amyeroberts
Copy link
Collaborator

cc @rafaelpadilla

@NielsRogge
Copy link
Contributor

NielsRogge commented Sep 25, 2023

Hi folks, I've investigated the difference, will be solved in PR below. TLDR: image preprocessing is done differently in the original Colab (involves padding the image to a square), whereas the HF implementation used center cropping. The model itself is fine, logits are exactly the same as original implementation on the same inputs.

@NielsRogge NielsRogge mentioned this issue Sep 25, 2023
1 task
@NielsRogge
Copy link
Contributor

NielsRogge commented Oct 14, 2023

Hi folks, since OWLv2 was now added in #26668, you will see that results match one-on-one with the original Google Colab notebook provided by the authors.

If you also want to get one-on-one matching results for OWLv1, then you will need to use Owlv2Processor (which internally uses Owlv2ImageProcessor) instead of OwlViTProcessor as it uses the exact same image preprocessing settings as the Colab notebook. We cannot change this for v1 due to backwards compatibility.

Copy link

github-actions bot commented Nov 8, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@rishabh-akridata
Copy link

rishabh-akridata commented Mar 1, 2024

@RRoundTable I have trying to reproduce the results(AP) values on lvis dataset using the example script that you have provided. Did you manage to reproduce the results?

@rishabh-akridata
Copy link

@NielsRogge I am using the Owlv2 processor, but still not able to get the same results.

@NielsRogge
Copy link
Contributor

@rishabh-akridata please provide a script that reproduces your issue

@rishabh-akridata
Copy link

rishabh-akridata commented Mar 6, 2024

@NielsRogge Please find the script below.


import skimage
import os
import matplotlib.pyplot as plt
from copy import deepcopy
import numpy as np
import cv2
import requests
from PIL import Image
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection, Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlvit-base-patch32", size={"height":768, "width":768})
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

filename = os.path.join(skimage.data_dir, 'astronaut.png')
image = Image.open(filename)
texts = ['face', 'rocket', 'nasa badge', 'star-spangled banner']
inputs = processor(text=texts, images=image, return_tensors="pt")
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
outputs = model(**inputs)
# Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.2)
# results = post_process_object_detection_evaluation(outputs, target_sizes=target_sizes, pred_per_im=10)
# font
font = cv2.FONT_HERSHEY_SIMPLEX
# fontScale
fontScale = 0.5
# Red color in BGR
color = (0, 255, 0)
# Line thickness of 2 px
thickness = 2

boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
image_to_plot = deepcopy(np.array(image))
image_to_plot = image_to_plot.astype(np.uint8)
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    xmin, ymin, xmax, ymax = int(box[0]), int(box[1]), int(box[2]), int(box[3])
    cv2.rectangle(image_to_plot, (xmin, ymin), (xmax, ymax), (255, 0, 0), 2)
    rounded_score = round(float(score), 2)
    # Using cv2.putText() method
    cv2.putText(image_to_plot, f"{texts[label]}:{rounded_score}", (xmin, ymax), font, fontScale,
                    color, thickness, cv2.LINE_AA, False)

plt.imshow(image_to_plot)
plt.show()

download (1)

@rishabh-akridata
Copy link

@NielsRogge Also I tried to use the below processor as well. But the facing the same issue.
processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble", size={"height":768, "width":768})

@rishabh-akridata
Copy link

@NielsRogge When I reduce the conf_threshold to 0.1, then I get some detections but with very low confidence and also the boxes are not the same as described in the official colab notebook.
download

@rishabh-akridata
Copy link

@NielsRogge Please ignore this one, I am looking into the results of different model variant. I am able to get the same results as mentioned in the colab notebook. Sorry for inconvenience caused.

Thanks.

@iMayuqi
Copy link

iMayuqi commented Aug 2, 2024

@RRoundTable
Hello, I also encountered the recurrence problem of invoking owl-vit model in Transformer package. With reference to your code, the overall mAp I reproduced on lvis data set is 14.9, while on rare it is 6.4. I suspect the problem is that original scenic's original model uses padding when loading images, and may prompt category names and boundingbox restores to offset image_padding before testing.

May I ask if you succeeded in replicating it at last? May I ask you for the code? It will be of great help to me. Thank you!

@NielsRogge
Copy link
Contributor

Hi @iMayuqi to reproduce the results I would recommend using Owlv2Processor instead of OwlProcessor, as it includes the padding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests