Question about why the add&norm structure of the tranformer network differ from the typical transformer one #24

Liaoqing-up · 2023-02-19T14:42:34Z

centerformer/det3d/models/utils/transformer.py

Lines 267 to 279 in 96aa375

    
               if pos_embedding is not None: 
        
                   x_att = self_attn(x + center_pos_embedding) 
        
                   x = x_att + x 
        
                   x_att = cross_attn( 
        
                       x + center_pos_embedding, y + neighbor_pos_embedding 
        
                   ) 
        
               else: 
        
                   x_att = self_attn(x) 
        
                   x = x_att + x 
        
                   x_att = cross_attn(x, y) 
        
           x = x_att + x 
        
           x = ff(x) + x

In the code, the residual in transformer is only the input after add and does not pass through the norm layer. add and norm are not taken as a whole, which is different from the typical transformer structure (the result of add and norm in series as a new level of input). Is there any special consideration for the design here?

edwardzhou130 · 2023-02-19T23:56:02Z

I used prenorm inside each layer.

centerformer/det3d/models/utils/transformer.py

Lines 218 to 238 in 96aa375

    
           PreNorm( 
        
               dim, 
        
               Attention( 
        
                   dim, 
        
                   heads=heads, 
        
                   dim_head=dim_head, 
        
                   dropout=dropout, 
        
                   out_attention=self.out_attention, 
        
               ), 
        
           ), 
        
           PreNorm_CA( 
        
               dim, 
        
               Cross_attention( 
        
                   dim, 
        
                   heads=heads, 
        
                   dim_head=dim_head, 
        
                   dropout=dropout, 
        
                   out_attention=self.out_attention, 
        
               ), 
        
           ), 
        
           PreNorm(dim, FeedForward(dim, mlp_dim, dropout=dropout)),

Liaoqing-up · 2023-02-20T01:31:31Z

I used prenorm inside each layer.

centerformer/det3d/models/utils/transformer.py

Lines 218 to 238 in 96aa375

PreNorm(

dim,

Attention(

dim,

heads=heads,

dim_head=dim_head,

dropout=dropout,

out_attention=self.out_attention,

),

),

PreNorm_CA(

dim,

Cross_attention(

dim,

heads=heads,

dim_head=dim_head,

dropout=dropout,

out_attention=self.out_attention,

),

),

PreNorm(dim, FeedForward(dim, mlp_dim, dropout=dropout)),

I see, but I wonder if you have tried Add&Norm after each layer, which means the residual skip connect input are the features already passed through the Norm. Is it possible that the results of these two structures do not differ much?

edwardzhou130 · 2023-02-22T17:05:03Z

Sorry, I haven't tried Add&Norm after each layer. Do you have experience with this before and would the results be better if you used this implementation?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about why the add&norm structure of the tranformer network differ from the typical transformer one #24

Question about why the add&norm structure of the tranformer network differ from the typical transformer one #24

Liaoqing-up commented Feb 19, 2023

edwardzhou130 commented Feb 19, 2023

Liaoqing-up commented Feb 20, 2023 •

edited

Loading

edwardzhou130 commented Feb 22, 2023

Question about why the add&norm structure of the tranformer network differ from the typical transformer one #24

Question about why the add&norm structure of the tranformer network differ from the typical transformer one #24

Comments

Liaoqing-up commented Feb 19, 2023

edwardzhou130 commented Feb 19, 2023

Liaoqing-up commented Feb 20, 2023 • edited Loading

edwardzhou130 commented Feb 22, 2023

Liaoqing-up commented Feb 20, 2023 •

edited

Loading