CLIP #11445

patil-suraj · 2021-04-26T09:06:55Z

What does this PR do?

This PR adds the CLIP model.

CLIP is a multi-modal vision+language model which uses a transformer model for encoding both the images and text.

The model here is designed such that both CLIPTextModel and CLIPVisionModel can be loaded independently, and composed together to get the CLIPModel.
Both CLIPTextModel and CLIPVisionModel use the shared encoder class CLIPEncoder.
The config classes are also kept in separate i.e CLIPTextConfig and CLIPVisionConfig. This could be kept in one config class but then we would have to add two arguments for each config value i.e text_hidden_size for text model vision_hidden_size for vision model etc.
One issue here is that when we load an individual model, like CLIPTextModel using the weights of the whole CLIPModel
the config ends up containing both text and vision config dicts, this does not cause any issue but could be confusing to look at.

One important thing to note here is that CLIP's tokenizer does have a pad token defined for it, but they use 0 as pad_token_id to pad the text, but the token, but the token associated with 0 is not a pad token. So here, to able to do padding I've added pad_token_id as a property which returns 0. I would be happy to hear if there is some other way to achieve this.

Also, I've added a processor class here but not sure if we really need it for this model. We could easily use the extractor for the vision model and tokenizer for the text model.

Would love your review about the design @LysandreJik , @patrickvonplaten , @sgugger.

patil-suraj · 2021-05-10T18:52:59Z

All green!!
I've addressed most of the suggestions, notably

new processor API => as discussed with @patrickvonplaten and @LysandreJik processor's __call__ now accepts both the text and/or images and returns a single encoding dict. as_target_processor is now removed. The API is as follows

model = CLIPModel.from_pretrained(checkpoint)
inputs = CLIPProcessor(texts=..., images=..., some_other_kwargs)
outputs = model(**inputs)

the encode_text and encode_image methods are renamed to get_text_features and get_image_features
Added fast tokenizer.

Ready for second review @LysandreJik @sgugger @patrickvonplaten

sgugger

Great work! A few last loose ends to tie up (in particular don't forget to replace all checkpoint names in the docstrings by ones in the openai namespace!) and it should be good to merge.

docs/source/model_doc/clip.rst

src/transformers/models/clip/__init__.py

sgugger · 2021-05-10T22:46:53Z

src/transformers/models/clip/configuration_clip.py

+logger = logging.get_logger(__name__)
+
+CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "valhalla/clip-vit-base-patch32": "https://huggingface.co/valhalla/clip-vit-base-patch32/resolve/main/config.json",


Still standing :-)

src/transformers/models/clip/modeling_clip.py

src/transformers/models/clip/processing_clip.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

LysandreJik

Great, LGTM! Only need to update from valhalla namespace to openai namespace and it looks good to me.

* begin second draft * fix import, style * add loss * fix embeds, logits_scale, and projection * fix imports * add conversion script * add feature_extractor and processor * style * add tests for tokenizer, extractor and processor * add vision model tests * add weight init * add more tests * fix save_load test * model output, dosstrings, causal mask * config doc * add clip model tests * return dict * bigin integration test * add integration tests * fix-copies * fix init * Clip => CLIP * fix module name * docs * fix doc * output_dim => projection_dim * fix checkpoint names * remoe fast tokenizer file * fix conversion script * fix tests, quality * put causal mask on device * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fix attribute test * style * address sylvains comments * style * fix docstrings * add qucik_gelu in activations, docstrings * clean-up attention test * fix act fun * fix config * fix torchscript tests * even batch_size * remove comment * fix ouput tu_tuple * fix save load tests * fix add tokens test * add fast tokenizer * update copyright * new processor API * fix docs * docstrings * docs * fix doc * fix doc * fix tokenizer * fix import in doc example * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * check types of config * valhalla => openai * load image using url * fix test * typo Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

lycfight · 2021-08-27T15:24:25Z

All green!!
I've addressed most of the suggestions, notably

new processor API => as discussed with @patrickvonplaten and @LysandreJik processor's __call__ now accepts both the text and/or images and returns a single encoding dict. as_target_processor is now removed. The API is as follows
model = CLIPModel.from_pretrained(checkpoint)
inputs = CLIPProcessor(texts=..., images=..., some_other_kwargs)
outputs = model(**inputs)
the encode_text and encode_image methods are renamed to get_text_features and get_image_features

Added fast tokenizer.

Ready for second review @LysandreJik @sgugger @patrickvonplaten

How to use processor in getitem()? I got an error"RuntimeError: stack expects each tensor to be equal size, but got [1, 11] at entry 0 and [1, 13] at entry 1" ,as follow:
def getitem(self, idx):
img_id = self.img_ids[idx]
# randomly pick one caption from the image captions
text = random.choice(self.img_id_to_captions[img_id])
img_filename = self.img_id_to_filename[img_id]
img_path = op.join(self.img_dir, img_filename)
img = Image.open(img_path)
input = self.processor(text = text, images = img, return_tensors = "pt", padding = True)
return input
I thought processor might need other args, inherited from pretraintokenizerbase,such as padding.But I couldn't find it at processor's call in doc.

patil-suraj · 2021-09-01T06:25:15Z

Hi @lycfight could you please open an issue with a minimal code snippet so we could take a look. Thanks :)

lycfight · 2021-09-05T08:26:56Z

Hi @lycfight could you please open an issue with a minimal code snippet so we could take a look. Thanks :)

of course

patil-suraj added PR for Model Addition WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress labels Apr 26, 2021

patil-suraj changed the title ~~CLIP~~ [WIP] CLIP Apr 26, 2021

patil-suraj mentioned this pull request Apr 26, 2021

[WIP] CLIP #10426

Closed

patil-suraj added 8 commits April 27, 2021 12:25

begin second draft

f7c69f0

fix import, style

d16616e

add loss

6ee4b38

fix embeds, logits_scale, and projection

d2bfe72

fix imports

4d72698

add conversion script

a029d78

add feature_extractor and processor

aeefa40

style

2657004

patil-suraj force-pushed the add-clip branch from cc5a22e to 2657004 Compare April 27, 2021 06:57

patil-suraj added 17 commits April 27, 2021 16:13

add tests for tokenizer, extractor and processor

906af2e

add vision model tests

769b2b0

add weight init

6ccdc3f

add more tests

61ad25c

fix save_load test

a99223a

model output, dosstrings, causal mask

cbd4a9a

config doc

23daf4c

add clip model tests

a57c794

return dict

f0feb91

bigin integration test

b687431

add integration tests

1cb0c31

fix-copies

127ff36

fix init

0255588

Clip => CLIP

669b0db

fix module name

a14eca6

docs

74abb18

fix doc

fdd3af6

patil-suraj added 10 commits May 10, 2021 19:54

add fast tokenizer

7ab67e1

update copyright

304b5d3

new processor API

941ae40

fix docs

96daff1

docstrings

857fd26

docs

93b04d2

fix doc

507ac1c

fix doc

79ed0c7

fix tokenizer

8db79a1

fix import in doc example

d297c80

sgugger approved these changes May 10, 2021

View reviewed changes

patrickvonplaten approved these changes May 10, 2021

View reviewed changes

Apply suggestions from code review

f6d64ad

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

patil-suraj changed the title ~~[WIP] CLIP~~ CLIP May 11, 2021

patil-suraj removed the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label May 11, 2021

check types of config

549579a

LysandreJik approved these changes May 11, 2021

View reviewed changes

patil-suraj added 4 commits May 11, 2021 16:16

valhalla => openai

bbde665

load image using url

a0e0f29

fix test

38627d6

typo

6785829

patil-suraj merged commit 8719afa into huggingface:master May 12, 2021

patil-suraj deleted the add-clip branch May 12, 2021 08:18

lycfight mentioned this pull request Sep 5, 2021

where processor should i put in a training code? #13427

Closed

lycfight mentioned this pull request Sep 5, 2021

Difference between logit_scale initialisation in Transformers CLIP and the original OpenAI implementation. #13430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLIP #11445

CLIP #11445

patil-suraj commented Apr 26, 2021 •

edited

Loading

patil-suraj commented May 10, 2021

sgugger left a comment

sgugger May 10, 2021

LysandreJik left a comment

lycfight commented Aug 27, 2021 •

edited

Loading

patil-suraj commented Sep 1, 2021

lycfight commented Sep 5, 2021

CLIP #11445

CLIP #11445

Conversation

patil-suraj commented Apr 26, 2021 • edited Loading

What does this PR do?

patil-suraj commented May 10, 2021

sgugger left a comment

Choose a reason for hiding this comment

sgugger May 10, 2021

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

lycfight commented Aug 27, 2021 • edited Loading

patil-suraj commented Sep 1, 2021

lycfight commented Sep 5, 2021

patil-suraj commented Apr 26, 2021 •

edited

Loading

lycfight commented Aug 27, 2021 •

edited

Loading