-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Blip2ForImageTextRetrieval for multimodal feature extraction #25612
Closed
jpizarrom
wants to merge
26
commits into
huggingface:main
from
jpizarrom:add_blip2_image_text_retrieval_model
Closed
Changes from all commits
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
4023732
Add Blip2ForImageTextRetrieval
jpizarrom 8eb718a
add Blip2ModelWithProjection
jpizarrom 786da89
use gpu on Blip2ForImageTextRetrieval.forward doctest
jpizarrom 188e3a7
use gpu on Blip2ModelWithProjection.forward doctest
jpizarrom d1cc037
use float16 on Blip2ForImageTextRetrieval.forward doctest
jpizarrom 5f72231
add _tied_weights_keys to Blip2ForImageTextRetrieval
jpizarrom e099caa
add temp param to Blip2ForImageTextRetrieval
jpizarrom a1ab97f
add Blip2TextModelWithProjection and Blip2VisionModelWithProjection
jpizarrom 18d5340
use cuda and float16 in doctest Blip2VisionModelWithProjection
jpizarrom 0a227d0
rename Blip2ModelWithProjection to Blip2ModelWithoutLM
jpizarrom 43fb263
add image_text_hidden_size to docstring
jpizarrom f8b0ed5
remove image_text_hidden_size from BlipConfig
jpizarrom 401b8b8
use Blip2ModelWithoutLMConfig in convert script
jpizarrom a0f7142
remove not used text_model_tester
jpizarrom 46adfd5
restore image_text_hidden_size in BlipConfig
jpizarrom a2c098e
rename Blip2ModelWithoutLMConfig.from_vision_qformer_configs
jpizarrom 532f5ae
remove Blip2ModelWithoutLMConfig
jpizarrom ce86d4c
remove Blip2ModelWithProjection
jpizarrom 253e067
remove _tied_weights_keys in Blip2ForImageTextRetrieval
jpizarrom 81aea68
remove unused code: blip2_loss
jpizarrom 04e2668
remove unused Blip2Output
jpizarrom 3d2dfbd
remove Blip2ModelWithoutLM from check_repo
jpizarrom b9343ba
add qformer_text_input line in the docstring
jpizarrom 6b65330
add tests for Blip2ForImageTextRetrieval and Blip2VisionModelWithProj…
jpizarrom afd66ca
Merge branch 'main' into add_blip2_image_text_retrieval_model
jpizarrom 47acd93
add skip on test_training_gradient_checkpointing_use_reentrant
jpizarrom File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -176,6 +176,8 @@ class Blip2QFormerConfig(PretrainedConfig): | |
The frequency of adding cross-attention to the Transformer layers. | ||
encoder_hidden_size (`int`, *optional*, defaults to 1408): | ||
The hidden size of the hidden states for cross-attention. | ||
qformer_text_input (`bool`, *optional*, defaults to `False`): | ||
Whether to use BERT-style embeddings. | ||
|
||
Examples: | ||
|
||
|
@@ -209,6 +211,7 @@ def __init__( | |
position_embedding_type="absolute", | ||
cross_attention_frequency=2, | ||
encoder_hidden_size=1408, | ||
qformer_text_input=False, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add a line in the docstring above to explain the aim of this arg? 🙏 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
**kwargs, | ||
): | ||
super().__init__(pad_token_id=pad_token_id, **kwargs) | ||
|
@@ -227,6 +230,7 @@ def __init__( | |
self.position_embedding_type = position_embedding_type | ||
self.cross_attention_frequency = cross_attention_frequency | ||
self.encoder_hidden_size = encoder_hidden_size | ||
self.qformer_text_input = qformer_text_input | ||
|
||
@classmethod | ||
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig": | ||
|
@@ -266,7 +270,8 @@ class Blip2Config(PretrainedConfig): | |
Dictionary of configuration options used to initialize any [`PretrainedConfig`]. | ||
num_query_tokens (`int`, *optional*, defaults to 32): | ||
The number of query tokens passed through the Transformer. | ||
|
||
image_text_hidden_size (`int`, *optional*, defaults to 256): | ||
Dimentionality of the hidden state of the image-text fusion layer. | ||
kwargs (*optional*): | ||
Dictionary of keyword arguments. | ||
|
||
|
@@ -302,7 +307,15 @@ class Blip2Config(PretrainedConfig): | |
|
||
model_type = "blip-2" | ||
|
||
def __init__(self, vision_config=None, qformer_config=None, text_config=None, num_query_tokens=32, **kwargs): | ||
def __init__( | ||
self, | ||
vision_config=None, | ||
qformer_config=None, | ||
text_config=None, | ||
num_query_tokens=32, | ||
image_text_hidden_size=256, | ||
**kwargs, | ||
): | ||
super().__init__(**kwargs) | ||
|
||
if vision_config is None: | ||
|
@@ -326,6 +339,7 @@ def __init__(self, vision_config=None, qformer_config=None, text_config=None, nu | |
self.is_encoder_decoder = self.text_config.is_encoder_decoder | ||
|
||
self.num_query_tokens = num_query_tokens | ||
self.image_text_hidden_size = image_text_hidden_size | ||
self.qformer_config.encoder_hidden_size = self.vision_config.hidden_size | ||
self.use_decoder_only_language_model = self.text_config.model_type in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES | ||
self.initializer_factor = 1.0 | ||
|
@@ -353,3 +367,23 @@ def from_vision_qformer_text_configs( | |
text_config=text_config.to_dict(), | ||
**kwargs, | ||
) | ||
|
||
@classmethod | ||
def from_vision_qformer_configs( | ||
cls, | ||
vision_config: Blip2VisionConfig, | ||
qformer_config: Blip2QFormerConfig, | ||
**kwargs, | ||
): | ||
r""" | ||
Instantiate a [`Blip2Config`] (or a derived class) from a BLIP-2 vision and Q-Former model configurations. | ||
|
||
Returns: | ||
[`Blip2Config`]: An instance of a configuration object | ||
""" | ||
|
||
return cls( | ||
vision_config=vision_config.to_dict(), | ||
qformer_config=qformer_config.to_dict(), | ||
**kwargs, | ||
) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change this to a name which indicates it's a flag e.g.
use_qformer_text_input
,use_text_embeddings
etc.? The name indicates that the value would be a text input itself