-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLIP #11445
CLIP #11445
Conversation
All green!!
model = CLIPModel.from_pretrained(checkpoint)
inputs = CLIPProcessor(texts=..., images=..., some_other_kwargs)
outputs = model(**inputs)
Ready for second review @LysandreJik @sgugger @patrickvonplaten |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! A few last loose ends to tie up (in particular don't forget to replace all checkpoint names in the docstrings by ones in the openai namespace!) and it should be good to merge.
logger = logging.get_logger(__name__) | ||
|
||
CLIP_PRETRAINED_CONFIG_ARCHIVE_MAP = { | ||
"valhalla/clip-vit-base-patch32": "https://huggingface.co/valhalla/clip-vit-base-patch32/resolve/main/config.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still standing :-)
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, LGTM! Only need to update from valhalla
namespace to openai
namespace and it looks good to me.
* begin second draft * fix import, style * add loss * fix embeds, logits_scale, and projection * fix imports * add conversion script * add feature_extractor and processor * style * add tests for tokenizer, extractor and processor * add vision model tests * add weight init * add more tests * fix save_load test * model output, dosstrings, causal mask * config doc * add clip model tests * return dict * bigin integration test * add integration tests * fix-copies * fix init * Clip => CLIP * fix module name * docs * fix doc * output_dim => projection_dim * fix checkpoint names * remoe fast tokenizer file * fix conversion script * fix tests, quality * put causal mask on device * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fix attribute test * style * address sylvains comments * style * fix docstrings * add qucik_gelu in activations, docstrings * clean-up attention test * fix act fun * fix config * fix torchscript tests * even batch_size * remove comment * fix ouput tu_tuple * fix save load tests * fix add tokens test * add fast tokenizer * update copyright * new processor API * fix docs * docstrings * docs * fix doc * fix doc * fix tokenizer * fix import in doc example * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * check types of config * valhalla => openai * load image using url * fix test * typo Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
How to use processor in getitem()? I got an error"RuntimeError: stack expects each tensor to be equal size, but got [1, 11] at entry 0 and [1, 13] at entry 1" ,as follow: |
Hi @lycfight could you please open an issue with a minimal code snippet so we could take a look. Thanks :) |
of course |
What does this PR do?
This PR adds the CLIP model.
CLIP is a multi-modal vision+language model which uses a transformer model for encoding both the images and text.
CLIPTextModel
andCLIPVisionModel
can be loaded independently, and composed together to get theCLIPModel
.CLIPTextModel
andCLIPVisionModel
use the shared encoder classCLIPEncoder
.CLIPTextConfig
andCLIPVisionConfig
. This could be kept in one config class but then we would have to add two arguments for each config value i.etext_hidden_size
for text modelvision_hidden_size
for vision model etc.One issue here is that when we load an individual model, like
CLIPTextModel
using the weights of the wholeCLIPModel
the config ends up containing both text and vision config dicts, this does not cause any issue but could be confusing to look at.
One important thing to note here is that CLIP's tokenizer does have a pad token defined for it, but they use 0 as
pad_token_id
to pad the text, but the token, but the token associated with 0 is not a pad token. So here, to able to do padding I've addedpad_token_id
as aproperty
which returns 0. I would be happy to hear if there is some other way to achieve this.Also, I've added a processor class here but not sure if we really need it for this model. We could easily use the extractor for the vision model and tokenizer for the text model.
Would love your review about the design @LysandreJik , @patrickvonplaten , @sgugger.