You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
text_feature: [batch, 77, 768] In cross-attention, we scale the text_feature to the same channel size of the image feature.
But do you have the same matrix shape for k, v generated by text features and q, generated by image features in this way? Can you do matrix multiplication?
The qkv matrices do not need to have the same shape; they only need to have the same channel dimension.
The k and v matrices have the same shape is ok.
Hello, I would like to ask you what kind of shapes are text_features, and what kind of scale transformations did you use to generate the K,V.
The text was updated successfully, but these errors were encountered: