Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why use Hadamard product in Spatial Aggregation? #18

Open
jing-zhao9 opened this issue May 27, 2024 · 3 comments
Open

Why use Hadamard product in Spatial Aggregation? #18

jing-zhao9 opened this issue May 27, 2024 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@jing-zhao9
Copy link

2c4436b1-73c3-4cb9-9ffa-8312b8790a92

May I ask why concatenation is not used for feature aggregation in the Spatial Aggregation block?

@Lupin1998
Copy link
Member

Hi, @jing-zhao9, thanks for your question! This dot production of two branches is one of the efficient designs first proposed in MogaNet, which is also used in Mamba and its recently proposed variants. We called gating according to GLU and found it more powerful than other additive operations like the concatenation you mentioned. You might find an intuitive explanation of why gating operations are effective and efficient in StarNet. Feel free to discuss if there are more questions.

@Lupin1998 Lupin1998 added the question Further information is requested label May 29, 2024
@Lupin1998 Lupin1998 self-assigned this May 29, 2024
@jing-zhao9
Copy link
Author

Thank you for your careful answer!Thank you for your careful explanation! I have another question: why do I encounter gradient explosion when I apply the dot product proposed in MogaNet to my baseline model for training?

@Lupin1998
Copy link
Member

Sorry for the late reply. The gradient explosion might sometimes occur in MogaNet because of the gating branch in the Moga module. There are two possible ways: (1) Checking the NAN or Inf during training. If the gradient explosion occurs, resume the training at the previous checkpoint. (2) Removing the SiLU in the branch with multiple DWConv. Two SiLU activation functions provide strong non-linearity with small parameters but increase the risk of instability. You might trade-off the performance and training stability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants