Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about why the add&norm structure of the tranformer network differ from the typical transformer one #24

Open
Liaoqing-up opened this issue Feb 19, 2023 · 3 comments

Comments

@Liaoqing-up
Copy link

if pos_embedding is not None:
x_att = self_attn(x + center_pos_embedding)
x = x_att + x
x_att = cross_attn(
x + center_pos_embedding, y + neighbor_pos_embedding
)
else:
x_att = self_attn(x)
x = x_att + x
x_att = cross_attn(x, y)
x = x_att + x
x = ff(x) + x

In the code, the residual in transformer is only the input after add and does not pass through the norm layer. add and norm are not taken as a whole, which is different from the typical transformer structure (the result of add and norm in series as a new level of input). Is there any special consideration for the design here?

@edwardzhou130
Copy link
Collaborator

I used prenorm inside each layer.

PreNorm(
dim,
Attention(
dim,
heads=heads,
dim_head=dim_head,
dropout=dropout,
out_attention=self.out_attention,
),
),
PreNorm_CA(
dim,
Cross_attention(
dim,
heads=heads,
dim_head=dim_head,
dropout=dropout,
out_attention=self.out_attention,
),
),
PreNorm(dim, FeedForward(dim, mlp_dim, dropout=dropout)),

@Liaoqing-up
Copy link
Author

Liaoqing-up commented Feb 20, 2023

I used prenorm inside each layer.

PreNorm(
dim,
Attention(
dim,
heads=heads,
dim_head=dim_head,
dropout=dropout,
out_attention=self.out_attention,
),
),
PreNorm_CA(
dim,
Cross_attention(
dim,
heads=heads,
dim_head=dim_head,
dropout=dropout,
out_attention=self.out_attention,
),
),
PreNorm(dim, FeedForward(dim, mlp_dim, dropout=dropout)),

I see, but I wonder if you have tried Add&Norm after each layer, which means the residual skip connect input are the features already passed through the Norm. Is it possible that the results of these two structures do not differ much?

@edwardzhou130
Copy link
Collaborator

Sorry, I haven't tried Add&Norm after each layer. Do you have experience with this before and would the results be better if you used this implementation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants