Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transformations in MiniViT paper #224

Open
gudrb opened this issue Feb 22, 2024 · 9 comments
Open

transformations in MiniViT paper #224

gudrb opened this issue Feb 22, 2024 · 9 comments

Comments

@gudrb
Copy link

gudrb commented Feb 22, 2024

Hello, I have a question about the transformations in the MiniViT paper.

I could find the first transformation (implemented in the MiniAttention class) in the code:

attn = self.conv_l(attn)

However, i couldn't find the second transformation in the code (which should be before or inside the MLP in the MiniBlock class)

class MiniBlock(nn.Module):

Could you please let me know where the second transformation is?

@wkcn
Copy link
Contributor

wkcn commented Feb 23, 2024

Hi @gudrb , thanks for your attention to our work!

In Mini-DeiT, the transformation for MLP is the relative position encoding

out += self.rpe_v(attn)

In Mini-Swin, the transformation for MLP is the depth-wise convolution layer

self.local_conv_list = nn.ModuleList()

@gudrb
Copy link
Author

gudrb commented Feb 23, 2024

On the MiniViT paper,

We make several modifi�cations on DeiT: First, we remove the [class] token. The
model is attached with a global average pooling layer and a
fully-connected layer for image classification. We also utilize relative position encoding to introduce inductive bias to
boost the model convergence [52,59].
Finally, based on our
observation that transformation for FFN only brings limited
performance gains in DeiT, we remove the block to speed up
both training and inference.

-> Does this mean that in MiniDeiT model, IRPE is utilized (for the value), and the MLP transformation is removed, leaving only the attention transformation?

@wkcn
Copy link
Contributor

wkcn commented Feb 23, 2024

On the MiniViT paper,

We make several modifi�cations on DeiT: First, we remove the [class] token. The model is attached with a global average pooling layer and a fully-connected layer for image classification. We also utilize relative position encoding to introduce inductive bias to boost the model convergence [52,59]. Finally, based on our observation that transformation for FFN only brings limited performance gains in DeiT, we remove the block to speed up both training and inference.

-> Does this mean that in MiniDeiT model, IRPE is utilized (for the value), and the MLP transformation is removed, leaving only the attention transformation?

Yes. I correct my statement. There is no transformation for FFN in Mini-DeiT. iRPE is utilized for only the key.

@gudrb
Copy link
Author

gudrb commented Jul 1, 2024

Hello,

I have a question regarding the implementation of layer normalization in the MiniViT paper and the corresponding code. Specifically, I am referring to how layer normalization is applied between transformer blocks.

In the MiniViT paper, it is mentioned that layer normalization between transformer blocks is not shared, and I believe the code reflects this. However, I am confused about how the RepeatedModuleList applies layer normalization multiple times and how it ensures that the normalizations are not shared.

Here is the relevant code snippet for the MiniBlock class:

if repeated_times > 1:

thank you.

@wkcn
Copy link
Contributor

wkcn commented Jul 2, 2024

Hi @gudrb ,

The following code creates a list of LayerNorm, where the number of LayerNorm is repeated_times.

self.norm1 = RepeatedModuleList([norm_layer(dim) for _ in range(repeated_times)], repeated_times)
self.norm2 = RepeatedModuleList([norm_layer(dim) for _ in range(repeated_times)], repeated_times)

RepeatedModuleList will select the self._repeated_id-th LayerNorm to forward.

r = self._repeated_id
return self.instances[r](*args, **kwargs)

In RepeatedMiniBlock, _repeated_id is updated. Therefore, each LayerNorm, conv and RPE are executed once but other modules are executed multiple times.

def forward(self, x):
for i, t in enumerate(range(self.repeated_times)):
def set_repeated_id(m):
m._repeated_id = i
self.block.apply(set_repeated_id)
x = self.block(x)
return x

@gudrb
Copy link
Author

gudrb commented Jul 2, 2024

Hello,

Thank you for your kind reply.

I noticed that Relative Position Encoding (RPE) is applied only on the key value. In the MiniViT paper, I couldn't see the explicit application of it in the equations.

20240702_164802
Does this mean that (K^T_m) already represents the image with the relative position applied (using the piecewise function, product method, contextual mode, and unshared)?

Thank you!

@wkcn
Copy link
Contributor

wkcn commented Jul 3, 2024

Hi @gudrb , here is the application of the weight transformation.

if self.conv_l is not None:
attn = self.conv_l(attn)
attn = attn.softmax(dim=-1)
if self.conv_w is not None:
attn = self.conv_w(attn)

@gudrb
Copy link
Author

gudrb commented Jul 3, 2024

20240702_164802
In the equations provided in the MiniViT paper, is K_m^T actually representing (K'_m + r_m)^T, where r are trainable positional identifiers? In the code, iRPE is used, but the notation is not explicitly shown in the equations from the paper. Could you confirm if this interpretation is correct?

@wkcn
Copy link
Contributor

wkcn commented Jul 3, 2024

In the equation 7, we ignore the relative position encoding.
The iRPE is only applied on Mini-DeiT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants