Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Owlvit docs test #18257

Merged
merged 95 commits into from
Jul 26, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
bd08fd0
add owlvit model skeleton
alaradirik Jun 16, 2022
cff1597
add class and box predictor heads
alaradirik Jun 17, 2022
3fb93b5
convert modified flax clip to pytorch
alaradirik Jun 21, 2022
6b80535
fix box and class predictors
alaradirik Jun 22, 2022
a57c8c3
add OwlViTImageTextEmbedder
alaradirik Jun 22, 2022
298acc4
convert class and box head checkpoints
alaradirik Jun 23, 2022
aa62cf3
convert image text embedder checkpoints
alaradirik Jun 23, 2022
eed0c47
add object detection head
alaradirik Jun 23, 2022
9dfae2e
fix bugs
alaradirik Jun 27, 2022
12b3554
update conversion script
alaradirik Jun 27, 2022
6e88bdc
update conversion script
alaradirik Jun 27, 2022
d342a81
fix q,v,k,out weight conversion conversion
alaradirik Jun 27, 2022
5a15207
add owlvit object detection output
alaradirik Jun 28, 2022
6adfabd
fix bug in image embedder
alaradirik Jun 28, 2022
ef94525
fix bugs in text embedder
alaradirik Jun 28, 2022
d4315a3
fix positional embeddings
alaradirik Jun 28, 2022
e385e33
fix bug in inference mode vision pooling
alaradirik Jun 29, 2022
985025e
update docs, init tokenizer and processor files
alaradirik Jun 29, 2022
6653465
support batch processing
alaradirik Jun 30, 2022
5e6e8b4
add OwlViTProcessor
alaradirik Jun 30, 2022
2e63dde
remove merge conflicts
alaradirik Jul 1, 2022
79083c5
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 1, 2022
35f9f31
readd owlvit imports
alaradirik Jul 1, 2022
78b7837
fix bug in OwlViTProcessor imports
alaradirik Jul 1, 2022
d919422
fix bugs in processor
alaradirik Jul 1, 2022
4635688
update docs
alaradirik Jul 1, 2022
8a1c825
fix bugs in processor
alaradirik Jul 1, 2022
363f4d5
update owlvit docs
alaradirik Jul 1, 2022
161cb2a
add OwlViTFeatureExtractor
alaradirik Jul 1, 2022
58aa6ce
style changes, add postprocess method to feature extractor
alaradirik Jul 4, 2022
37e3281
add feature extractor and processor tests
alaradirik Jul 4, 2022
261ed39
add object detection tests
alaradirik Jul 4, 2022
cf0591c
update conversion script
alaradirik Jul 5, 2022
02f3a00
update config paths
alaradirik Jul 5, 2022
ab0be98
update config paths
alaradirik Jul 5, 2022
2b215f5
fix configuration paths and bugs
alaradirik Jul 5, 2022
f97d3de
fix bugs in OwlViT tests
alaradirik Jul 5, 2022
1949b63
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 5, 2022
8680f13
add import checks to processor
alaradirik Jul 5, 2022
e6f51de
fix docs and minor issues
alaradirik Jul 6, 2022
e15988d
fix docs and minor issues
alaradirik Jul 6, 2022
b73a66d
fix bugs and issues
alaradirik Jul 7, 2022
68dd41d
fix bugs and issues
alaradirik Jul 7, 2022
11d5928
fix bugs and issues
alaradirik Jul 7, 2022
cef935d
fix bugs and issues
alaradirik Jul 8, 2022
34069b0
update docs and examples
alaradirik Jul 8, 2022
c4aa766
fix bugs and issues
alaradirik Jul 8, 2022
40a6504
update conversion script, fix positional embeddings
alaradirik Jul 8, 2022
9ce1942
process 2D input ids, update tests
alaradirik Jul 11, 2022
b330dfa
fix style and quality issues
alaradirik Jul 11, 2022
051aea6
update docs
alaradirik Jul 11, 2022
bf903f9
update docs and imports
alaradirik Jul 11, 2022
3592af5
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 11, 2022
60749fe
update OWL-ViT index.md
alaradirik Jul 11, 2022
ee007d6
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 12, 2022
6f1aa2d
fix bug in OwlViT feature ext tests
alaradirik Jul 12, 2022
6af7248
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 12, 2022
ba03dbf
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 12, 2022
865510c
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 13, 2022
df9313d
fix code examples, return_dict by default
alaradirik Jul 13, 2022
57d1b68
return_dict by default
alaradirik Jul 13, 2022
253af8b
minor fixes, add tests to processor
alaradirik Jul 13, 2022
3e180da
small fixes
alaradirik Jul 13, 2022
43c04af
add output_attentions arg to main model
alaradirik Jul 13, 2022
efc1ad3
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 13, 2022
8ceea4e
fix bugs
alaradirik Jul 13, 2022
4d416fe
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 14, 2022
4099199
remove output_hidden_states arg from main model
alaradirik Jul 14, 2022
e73b129
update self.config variables
alaradirik Jul 14, 2022
0f3d56f
add option to return last_hidden_states
alaradirik Jul 14, 2022
47c55ea
fix bug in config variables
alaradirik Jul 14, 2022
db70aee
fix copied from statements
alaradirik Jul 14, 2022
ea1452b
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 20, 2022
456bbb3
fix small issues and bugs
alaradirik Jul 20, 2022
c6cd321
fix bugs
alaradirik Jul 20, 2022
57c2cb8
fix bugs, support greyscale images
alaradirik Jul 21, 2022
7ba2c41
run fixup
alaradirik Jul 21, 2022
8c560cb
update repo name
alaradirik Jul 21, 2022
ef2b4f5
merge OwlViTImageTextEmbedder with obj detection head
alaradirik Jul 21, 2022
dfbc6b5
fix merge conflict
alaradirik Jul 21, 2022
27a5ce5
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 21, 2022
405685a
fix merge conflict
alaradirik Jul 21, 2022
a66a879
make fixup
alaradirik Jul 21, 2022
32525bd
fix bugs
alaradirik Jul 22, 2022
1f931eb
fix bugs
alaradirik Jul 22, 2022
1867147
Merge branch 'huggingface:main' into owlvit
alaradirik Jul 22, 2022
75e5ccf
add additional processor test
alaradirik Jul 22, 2022
b991473
merge owlvit
alaradirik Jul 22, 2022
5203060
Merge branch 'huggingface:main' into main
alaradirik Jul 22, 2022
ce41dec
fix docs and add owlvit docs test
alaradirik Jul 22, 2022
1580745
fix minor bug in post_process, add to processor
alaradirik Jul 22, 2022
2c71be4
improve owlvit code examples
alaradirik Jul 22, 2022
19ec210
Merge branch 'huggingface:main' into owlvit-docs-test
alaradirik Jul 22, 2022
7b1f4df
fix hardcoded image size
alaradirik Jul 25, 2022
09f4ed1
Merge branch 'huggingface:main' into owlvit-docs-test
alaradirik Jul 26, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 19 additions & 12 deletions docs/source/en/model_doc/owlvit.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -39,19 +39,26 @@ OWL-ViT is a zero-shot text-conditioned object detection model. OWL-ViT uses [CL

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=[["a photo of a cat", "a photo of a dog"]], images=image, return_tensors="pt")

>>> texts = [["a photo of a cat", "a photo of a dog"]]
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved
>>> inputs = processor(text=texts, images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs["logits"] # Prediction logits of shape [batch_size, num_patches, num_max_text_queries]
>>> boxes = outputs["pred_boxes"] # Object box boundaries of shape [batch_size, num_patches, 4]

>>> batch_size = boxes.shape[0]
>>> for i in range(batch_size): # Loop over sets of images and text queries
... boxes = outputs["pred_boxes"][i]
... logits = torch.max(outputs["logits"][i], dim=-1)
... scores = torch.sigmoid(logits.values)
... labels = logits.indices

>>> # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
>>> target_sizes = torch.Tensor([image.size[::-1]])
>>> # Convert outputs (bounding boxes and class logits) to COCO API
>>> results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

>>> i = 0 # Retrieve predictions for the first image for the corresponding text queries
>>> text = texts[i]
>>> boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

>>> score_threshold = 0.1
>>> for box, score, label in zip(boxes, scores, labels):
... box = [round(i, 2) for i in box.tolist()]
... if score >= score_threshold:
... print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
Detected a photo of a cat with confidence 0.243 at location [1.42, 50.69, 308.58, 370.48]
Detected a photo of a cat with confidence 0.298 at location [348.06, 20.56, 642.33, 372.61]
```

This model was contributed by [adirik](https://huggingface.co/adirik). The original code can be found [here](https://github.com/google-research/scenic/tree/main/scenic/projects/owl_vit).
Expand Down
12 changes: 6 additions & 6 deletions src/transformers/models/owlvit/feature_extraction_owlvit.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@

if is_torch_available():
import torch
from torch import nn

logger = logging.get_logger(__name__)

Expand Down Expand Up @@ -109,18 +108,19 @@ def post_process(self, outputs, target_sizes):
`List[Dict]`: A list of dictionaries, each dictionary containing the scores, labels and boxes for an image
in the batch as predicted by the model.
"""
out_logits, out_bbox = outputs.logits, outputs.pred_boxes
logits, boxes = outputs.logits, outputs.pred_boxes
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved

if len(out_logits) != len(target_sizes):
if len(logits) != len(target_sizes):
raise ValueError("Make sure that you pass in as many target sizes as the batch dimension of the logits")
if target_sizes.shape[1] != 2:
raise ValueError("Each element of target_sizes must contain the size (h, w) of each image of the batch")

prob = nn.functional.softmax(out_logits, -1)
scores, labels = prob[..., :-1].max(-1)
probs = torch.max(logits, dim=-1)
scores = torch.sigmoid(probs.values)
labels = probs.indices

# Convert to [x0, y0, x1, y1] format
boxes = center_to_corners_format(out_bbox)
boxes = center_to_corners_format(boxes)

# Convert from relative [0, 1] to absolute [0, height] coordinates
img_h, img_w = target_sizes.unbind(1)
Expand Down
32 changes: 20 additions & 12 deletions src/transformers/models/owlvit/modeling_owlvit.py
Original file line number Diff line number Diff line change
Expand Up @@ -1300,23 +1300,31 @@ def forward(
>>> import torch
>>> from transformers import OwlViTProcessor, OwlViTForObjectDetection

>>> model = OwlViTModel.from_pretrained("google/owlvit-base-patch32")
>>> processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
>>> model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> inputs = processor(text=[["a photo of a cat", "a photo of a dog"]], images=image, return_tensors="pt")
>>> texts = [["a photo of a cat", "a photo of a dog"]]
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved
>>> inputs = processor(text=texts, images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> logits = outputs["logits"] # Prediction logits of shape [batch_size, num_patches, num_max_text_queries]
>>> boxes = outputs["pred_boxes"] # Object box boundaries of shape # [batch_size, num_patches, 4]

>>> batch_size = boxes.shape[0]
>>> for i in range(batch_size): # Loop over sets of images and text queries
... boxes = outputs["pred_boxes"][i]
... logits = torch.max(outputs["logits"][i], dim=-1)
... scores = torch.sigmoid(logits.values)
... labels = logits.indices

>>> # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
>>> target_sizes = torch.Tensor([image.size[::-1]])
>>> # Convert outputs (bounding boxes and class logits) to COCO API
>>> results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

>>> i = 0 # Retrieve predictions for the first image for the corresponding text queries
>>> text = texts[i]
>>> boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

>>> score_threshold = 0.1
>>> for box, score, label in zip(boxes, scores, labels):
... box = [round(i, 2) for i in box.tolist()]
... if score >= score_threshold:
... print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
Detected a photo of a cat with confidence 0.243 at location [1.42, 50.69, 308.58, 370.48]
Detected a photo of a cat with confidence 0.298 at location [348.06, 20.56, 642.33, 372.61]
```"""
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
Expand Down
7 changes: 7 additions & 0 deletions src/transformers/models/owlvit/processing_owlvit.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,13 @@ def __call__(self, text=None, images=None, padding="max_length", return_tensors=
else:
return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)

def post_process(self, *args, **kwargs):
"""
This method forwards all its arguments to [`OwlViTFeatureExtractor.post_process`]. Please refer to the
docstring of this method for more information.
"""
return self.feature_extractor.post_process(*args, **kwargs)

def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to CLIPTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
Expand Down
1 change: 1 addition & 0 deletions utils/documentation_tests.txt
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ src/transformers/models/mobilevit/modeling_mobilevit.py
src/transformers/models/opt/modeling_opt.py
src/transformers/models/opt/modeling_tf_opt.py
src/transformers/models/opt/modeling_flax_opt.py
src/transformers/models/owlvit/modeling_owlvit.py
src/transformers/models/pegasus/modeling_pegasus.py
src/transformers/models/plbart/modeling_plbart.py
src/transformers/models/poolformer/modeling_poolformer.py
Expand Down