-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about the parameter dt(delta) and its initialization #5
Comments
For the first question. The low rank can just be thought of parameterizing delta as a low rank matrix projection, rather than a full linear layer. The dimension of delta in Mamba2, will essentially be a scaler for each head. Meaning that if we let Having it be low rank is just parameter efficient. Meaning that rather than having a matrix The initialization of delta in the way above is just convenient for training dynamics. As Finally, if you are wanting to understand what discretization actually is, think back to calculus when you were learning about the derivative. We can use euler's approximation for discretization as an example: This isn't the discretization method used Mamba, however, it is along the same lines and it paints a picture of what discretization is doing. At the end of the day these thing our weight matrices. I hope this makes sense. There are a lot of fascinating little tricks about Mamba parameterization, which IMO is the actual reason why these things work so well. If you want to get deeper into the math of parameterization I would recommend reading Albert Gu's paper called S4D, which is only on the parameterization of SSMs. |
Thanks for your reply. I will try to understand it and read S4D. |
Hey yes it's on my bucket list! I'm currently swamped with grads apps, classes, and my research, however, I was planning on fixing the bugs during winter break. Sorry for being slow on this :( Also might try to play around with ThunderKittens and create some optimized causal and bidirectional kernels! |
oh it's really great to hear that !! Hope to see your update soon @Hprairie |
@oggyfaker Starting to take a look back into this and have been trying to reproduce the error and struggling. I wrote a small training pipeline and haven't had any NaN's appear. Do you have a small reproducible script of NaNs? I will keep trying to recreate them but would greatly appreciate this if you do. Thanks! |
@Hprairie Actually i have intention to implement your lib to replace original Hydra block. But i see some previous comment in another issue when apply for ViT to make sure your repo will work before i start my project. So if you are in process, i will implement project and give you feedback as soon as possible, may be before the end of this year :D |
Thanks for your awesome work.
Looking through the code of Mamba and Mamba2. I'm really confused about the dimension of the parameter dt. I understand that delta is used to discretize A and B in SSM. However, I don't understand why dt is first projected into (b,l,dt_rank) and then projected into (b,l,d_inner) as in algorithm 2 of the Mamba paper. As in the code:
self.x_proj = nn.Linear( self.d_inner, self.dt_rank + self.d_state * 2, bias=False, **factory_kwargs )
What is the purpose of 'dt_rank'?
As for the initialization of dt, I found the code in your project:
`# Initialize log dt bias
dt = torch.exp(
torch.rand(self.nheads, **factory_kwargs) * (math.log(dt_max) - math.log(dt_min))
+ math.log(dt_min)
)
I don't understand the process of this initialization. I found some explanations in the Mamba paper:
I found a similar problem here. But I still don't understand the time step of the discretization process. Could you help me with this problem? Could you give me papers or tutorials to check? I'm looking forward to your reply.
The text was updated successfully, but these errors were encountered: