-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initialization of qkv #68
Comments
RetNet uses DeepNet's derivation methods to obtain the initialization for better training stability, instead of directly re-using its derived initialization (on Post-LN transformers), because the initialization depends on the model architecture according to the theory in DeepNet. |
Thanks for the quick reply! "because the initialization depends on the model architecture according to the theory in DeepNet" Could you elaborate the derivation methods more? How do you get the number 2 ** -2.5 here? Thanks |
I am also interested in this initialisation scheme. It seems for recurrent models such as S4 and S5, they have different schemes. Do you have any particular explanation or heuristic of this scale? |
In the paper, the authors mentioned that the initialization followed DeepNet but from the code, it's kind of different. Why is there a mismatch?
The text was updated successfully, but these errors were encountered: