Replies: 2 comments
-
That's actually really, really good question. I think you mean one can rewrite Correct? I think this would work but then the transformation becomes the same for keys and queries, and it would not be possible to distinguish them. So basically it looks like Separate: and merged: Come out as the same end result, the training dynamics would be different. Where in the first case the weight parameters are updated separately, and in the second case you lose that distinction and lose degrees of freedom. But you are welcome to try this in Chapter 5 for example and compare the training losses with and without the merging. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the reply. Yes, this is exactly what I mean. The training dynamics would be different, as there will be no keys K or queries Q any more. The thing is, do we really need them? Optimizing two matrices ( |
Beta Was this translation helpful? Give feedback.
-
When compute the context vector in the attention algorithm, three weight matrices were introduced. It has discussed in #454 that the value matrix W_V is not necessary. For the rest two, query matrix and key matrix, keeping two of them seems not necessary, either. The context vector can be expressed as
X*W_q*W_K^T*X^T*X*W_V
where*
is for matrix multiplication. Is it possible to merge the partW_q*W_K^T
as a single covariance matrixS
, so the context vector becomeX*S*X^T*X*W_V
? This merge could potentially reduce nuisance parameters and improve computational performance.Beta Was this translation helpful? Give feedback.
All reactions