About CA #9

mawei-north · 2024-08-30T04:09:03Z

Hello, I would like to ask you what kind of shapes are text_features, and what kind of scale transformations did you use to generate the K,V.

zhengchen1999 · 2024-08-30T05:43:19Z

text_feature: [batch, 77, 768]
In cross-attention, we scale the text_feature to the same channel size of the image feature.

mawei-north · 2024-10-16T01:36:10Z

text_feature: [batch, 77, 768] In cross-attention, we scale the text_feature to the same channel size of the image feature.

But do you have the same matrix shape for k, v generated by text features and q, generated by image features in this way? Can you do matrix multiplication?

zhengchen1999 · 2024-10-16T01:41:40Z

The qkv matrices do not need to have the same shape; they only need to have the same channel dimension.
The k and v matrices have the same shape is ok.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About CA #9

About CA #9

mawei-north commented Aug 30, 2024

zhengchen1999 commented Aug 30, 2024

mawei-north commented Oct 16, 2024

zhengchen1999 commented Oct 16, 2024

About CA #9

About CA #9

Comments

mawei-north commented Aug 30, 2024

zhengchen1999 commented Aug 30, 2024

mawei-north commented Oct 16, 2024

zhengchen1999 commented Oct 16, 2024