transformations in MiniViT paper #224

We make several modifi�cations on DeiT: First, we remove the [class] token. The
model is attached with a global average pooling layer and a
fully-connected layer for image classification. We also utilize relative position encoding to introduce inductive bias to
boost the model convergence [52,59]. Finally, based on our
observation that transformation for FFN only brings limited
performance gains in DeiT, we remove the block to speed up
both training and inference.

-> Does this mean that in MiniDeiT model, IRPE is utilized (for the value), and the MLP transformation is removed, leaving only the attention transformation?

wkcn · 2024-02-23T05:20:09Z

On the MiniViT paper,

We make several modifi�cations on DeiT: First, we remove the [class] token. The model is attached with a global average pooling layer and a fully-connected layer for image classification. We also utilize relative position encoding to introduce inductive bias to boost the model convergence [52,59]. Finally, based on our observation that transformation for FFN only brings limited performance gains in DeiT, we remove the block to speed up both training and inference.

-> Does this mean that in MiniDeiT model, IRPE is utilized (for the value), and the MLP transformation is removed, leaving only the attention transformation?

Yes. I correct my statement. There is no transformation for FFN in Mini-DeiT. iRPE is utilized for only the key.

Cream/MiniViT/Mini-DeiT/mini_vision_transformer.py

Line 97 in 4a13c40

attn += self.rpe_k(q)

Cream/MiniViT/Mini-DeiT/mini_deit_models.py

Line 17 in 4a13c40

rpe_on='k',

gudrb · 2024-07-01T07:57:48Z

Hello,

I have a question regarding the implementation of layer normalization in the MiniViT paper and the corresponding code. Specifically, I am referring to how layer normalization is applied between transformer blocks.

In the MiniViT paper, it is mentioned that layer normalization between transformer blocks is not shared, and I believe the code reflects this. However, I am confused about how the RepeatedModuleList applies layer normalization multiple times and how it ensures that the normalizations are not shared.

Here is the relevant code snippet for the MiniBlock class:

Cream/MiniViT/Mini-DeiT/mini_vision_transformer.py

Line 144 in 4a13c40

if repeated_times > 1:

thank you.

wkcn · 2024-07-02T01:42:56Z

Hi @gudrb ,

The following code creates a list of LayerNorm, where the number of LayerNorm is repeated_times.

Cream/MiniViT/Mini-DeiT/mini_vision_transformer.py

Lines 145 to 146 in 4a13c40

    
           self.norm1 = RepeatedModuleList([norm_layer(dim) for _ in range(repeated_times)], repeated_times) 
        
           self.norm2 = RepeatedModuleList([norm_layer(dim) for _ in range(repeated_times)], repeated_times)

RepeatedModuleList will select the self._repeated_id-th LayerNorm to forward.

Cream/MiniViT/Mini-DeiT/mini_vision_transformer.py

Lines 28 to 29 in 4a13c40

    
           r = self._repeated_id 
        
           return self.instances[r](*args, **kwargs)

In RepeatedMiniBlock, _repeated_id is updated. Therefore, each LayerNorm, conv and RPE are executed once but other modules are executed multiple times.

Cream/MiniViT/Mini-DeiT/mini_vision_transformer.py

Lines 174 to 180 in 4a13c40

    
           def forward(self, x): 
        
               for i, t in enumerate(range(self.repeated_times)): 
        
                   def set_repeated_id(m): 
        
                       m._repeated_id = i 
        
                   self.block.apply(set_repeated_id) 
        
                   x = self.block(x) 
        
               return x

gudrb · 2024-07-02T07:58:10Z

Hello,

Thank you for your kind reply.

I noticed that Relative Position Encoding (RPE) is applied only on the key value. In the MiniViT paper, I couldn't see the explicit application of it in the equations.

Does this mean that (K^T_m) already represents the image with the relative position applied (using the piecewise function, product method, contextual mode, and unshared)?

Thank you!

wkcn · 2024-07-03T02:34:38Z

Hi @gudrb , here is the application of the weight transformation.

Cream/MiniViT/Mini-DeiT/mini_vision_transformer.py

Lines 103 to 109 in 4a13c40

    
           if self.conv_l is not None: 
        
               attn = self.conv_l(attn) 
        
           attn = attn.softmax(dim=-1) 
        
           if self.conv_w is not None: 
        
               attn = self.conv_w(attn)

gudrb · 2024-07-03T03:27:06Z

In the equations provided in the MiniViT paper, is K_m^T actually representing (K'_m + r_m)^T, where r are trainable positional identifiers? In the code, iRPE is used, but the notation is not explicitly shown in the equations from the paper. Could you confirm if this interpretation is correct?

wkcn · 2024-07-03T04:17:10Z

In the equation 7, we ignore the relative position encoding.
The iRPE is only applied on Mini-DeiT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformations in MiniViT paper #224

transformations in MiniViT paper #224

gudrb commented Feb 22, 2024

wkcn commented Feb 23, 2024

gudrb commented Feb 23, 2024

wkcn commented Feb 23, 2024

gudrb commented Jul 1, 2024

wkcn commented Jul 2, 2024

gudrb commented Jul 2, 2024

wkcn commented Jul 3, 2024

gudrb commented Jul 3, 2024

wkcn commented Jul 3, 2024

transformations in MiniViT paper #224

transformations in MiniViT paper #224

Comments

gudrb commented Feb 22, 2024

wkcn commented Feb 23, 2024

gudrb commented Feb 23, 2024

wkcn commented Feb 23, 2024

gudrb commented Jul 1, 2024

wkcn commented Jul 2, 2024

gudrb commented Jul 2, 2024

wkcn commented Jul 3, 2024

gudrb commented Jul 3, 2024

wkcn commented Jul 3, 2024