Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLIP #11445

Merged
merged 64 commits into from
May 12, 2021
Merged

CLIP #11445

merged 64 commits into from
May 12, 2021

Conversation

patil-suraj
Copy link
Contributor

@patil-suraj patil-suraj commented Apr 26, 2021

What does this PR do?

This PR adds the CLIP model.

CLIP is a multi-modal vision+language model which uses a transformer model for encoding both the images and text.

  • The model here is designed such that both CLIPTextModel and CLIPVisionModel can be loaded independently, and composed together to get the CLIPModel.
  • Both CLIPTextModel and CLIPVisionModel use the shared encoder class CLIPEncoder.
  • The config classes are also kept in separate i.e CLIPTextConfig and CLIPVisionConfig. This could be kept in one config class but then we would have to add two arguments for each config value i.e text_hidden_size for text model vision_hidden_size for vision model etc.
    One issue here is that when we load an individual model, like CLIPTextModel using the weights of the whole CLIPModel
    the config ends up containing both text and vision config dicts, this does not cause any issue but could be confusing to look at.

One important thing to note here is that CLIP's tokenizer does have a pad token defined for it, but they use 0 as pad_token_id to pad the text, but the token, but the token associated with 0 is not a pad token. So here, to able to do padding I've added pad_token_id as a property which returns 0. I would be happy to hear if there is some other way to achieve this.

Also, I've added a processor class here but not sure if we really need it for this model. We could easily use the extractor for the vision model and tokenizer for the text model.

Would love your review about the design @LysandreJik , @patrickvonplaten , @sgugger.

@patil-suraj patil-suraj added PR for Model Addition WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress labels Apr 26, 2021
@patil-suraj patil-suraj changed the title CLIP [WIP] CLIP Apr 26, 2021
@patil-suraj patil-suraj mentioned this pull request Apr 26, 2021
@patil-suraj
Copy link
Contributor Author

All green!!
I've addressed most of the suggestions, notably

  • new processor API => as discussed with @patrickvonplaten and @LysandreJik processor's __call__ now accepts both the text and/or images and returns a single encoding dict. as_target_processor is now removed. The API is as follows
model = CLIPModel.from_pretrained(checkpoint)
inputs = CLIPProcessor(texts=..., images=..., some_other_kwargs)
outputs = model(**inputs)
  • the encode_text and encode_image methods are renamed to get_text_features and get_image_features
  • Added fast tokenizer.

Ready for second review @LysandreJik @sgugger @patrickvonplaten

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! A few last loose ends to tie up (in particular don't forget to replace all checkpoint names in the docstrings by ones in the openai namespace!) and it should be good to merge.

docs/source/model_doc/clip.rst Outdated Show resolved Hide resolved
src/transformers/models/clip/__init__.py Outdated Show resolved Hide resolved
logger = logging.get_logger(__name__)

CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"valhalla/clip-vit-base-patch32": "https://huggingface.co/valhalla/clip-vit-base-patch32/resolve/main/config.json",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still standing :-)

src/transformers/models/clip/modeling_clip.py Show resolved Hide resolved
src/transformers/models/clip/processing_clip.py Outdated Show resolved Hide resolved
src/transformers/models/clip/processing_clip.py Outdated Show resolved Hide resolved
src/transformers/models/clip/processing_clip.py Outdated Show resolved Hide resolved
src/transformers/models/clip/processing_clip.py Outdated Show resolved Hide resolved
src/transformers/models/clip/processing_clip.py Outdated Show resolved Hide resolved
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@patil-suraj patil-suraj changed the title [WIP] CLIP CLIP May 11, 2021
@patil-suraj patil-suraj removed the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label May 11, 2021
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, LGTM! Only need to update from valhalla namespace to openai namespace and it looks good to me.

@patil-suraj patil-suraj merged commit 8719afa into huggingface:master May 12, 2021
@patil-suraj patil-suraj deleted the add-clip branch May 12, 2021 08:18
Iwontbecreative pushed a commit to Iwontbecreative/transformers that referenced this pull request Jul 15, 2021
* begin second draft

* fix import, style

* add loss

* fix embeds, logits_scale, and projection

* fix imports

* add conversion script

* add feature_extractor and processor

* style

* add tests for tokenizer, extractor and processor

* add vision model tests

* add weight init

* add more tests

* fix save_load  test

* model output, dosstrings, causal mask

* config doc

* add clip model tests

* return dict

* bigin integration test

* add integration tests

* fix-copies

* fix init

* Clip => CLIP

* fix module name

* docs

* fix doc

* output_dim => projection_dim

* fix checkpoint names

* remoe fast tokenizer file

* fix conversion script

* fix tests, quality

* put causal mask on device

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix attribute test

* style

* address sylvains comments

* style

* fix docstrings

* add qucik_gelu in activations, docstrings

* clean-up attention test

* fix act fun

* fix config

* fix torchscript tests

* even batch_size

* remove comment

* fix ouput tu_tuple

* fix save load tests

* fix add tokens test

* add fast tokenizer

* update copyright

* new processor API

* fix docs

* docstrings

* docs

* fix doc

* fix doc

* fix tokenizer

* fix import in doc example

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* check types of config

* valhalla => openai

* load image using url

* fix test

* typo

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@lycfight
Copy link

lycfight commented Aug 27, 2021

All green!!
I've addressed most of the suggestions, notably

  • new processor API => as discussed with @patrickvonplaten and @LysandreJik processor's __call__ now accepts both the text and/or images and returns a single encoding dict. as_target_processor is now removed. The API is as follows
model = CLIPModel.from_pretrained(checkpoint)
inputs = CLIPProcessor(texts=..., images=..., some_other_kwargs)
outputs = model(**inputs)
  • the encode_text and encode_image methods are renamed to get_text_features and get_image_features
  • Added fast tokenizer.

Ready for second review @LysandreJik @sgugger @patrickvonplaten

How to use processor in getitem()? I got an error"RuntimeError: stack expects each tensor to be equal size, but got [1, 11] at entry 0 and [1, 13] at entry 1" ,as follow:
def getitem(self, idx):
img_id = self.img_ids[idx]
# randomly pick one caption from the image captions
text = random.choice(self.img_id_to_captions[img_id])
img_filename = self.img_id_to_filename[img_id]
img_path = op.join(self.img_dir, img_filename)
img = Image.open(img_path)
input = self.processor(text = text, images = img, return_tensors = "pt", padding = True)
return input
I thought processor might need other args, inherited from pretraintokenizerbase,such as padding.But I couldn't find it at processor's call in doc.

@patil-suraj
Copy link
Contributor Author

Hi @lycfight could you please open an issue with a minimal code snippet so we could take a look. Thanks :)

@lycfight
Copy link

lycfight commented Sep 5, 2021

Hi @lycfight could you please open an issue with a minimal code snippet so we could take a look. Thanks :)

of course

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants