Self attention: Merge Query matrix and Key matrix into a single covariance matrix? #517

wenlong2 · 2025-02-05T16:26:39Z

wenlong2
Feb 5, 2025

When compute the context vector in the attention algorithm, three weight matrices were introduced. It has discussed in #454 that the value matrix W_V is not necessary. For the rest two, query matrix and key matrix, keeping two of them seems not necessary, either. The context vector can be expressed as X*W_q*W_K^T*X^T*X*W_Vwhere * is for matrix multiplication. Is it possible to merge the part W_q*W_K^T as a single covariance matrix S, so the context vector become X*S*X^T*X*W_V? This merge could potentially reduce nuisance parameters and improve computational performance.

rasbt · 2025-02-06T17:10:03Z

rasbt
Feb 6, 2025
Maintainer

That's actually really, really good question. I think you mean one can rewrite

$$ QK^T = (XW_q)(XW_K)^T = X W_q W_K^T X^T = X S X^T, \quad \text{with } S = W_q W_K^T $$

Correct? I think this would work but then the transformation becomes the same for keys and queries, and it would not be possible to distinguish them.

So basically it looks like

Separate:

$$ Q = X W_q, \quad K = X W_k, \quad \text{and} \quad QK^T = X W_q W_k^T X^T. $$

and merged:

$$ S = W_q W_k^T \quad \text{so that} \quad QK^T = X S X^T. $$

Come out as the same end result, the training dynamics would be different. Where in the first case the weight parameters are updated separately, and in the second case you lose that distinction and lose degrees of freedom.

But you are welcome to try this in Chapter 5 for example and compare the training losses with and without the merging.

0 replies

wenlong2 · 2025-02-06T21:32:44Z

wenlong2
Feb 6, 2025
Author

Thanks for the reply. Yes, this is exactly what I mean. The training dynamics would be different, as there will be no keys K or queries Q any more. The thing is, do we really need them? Optimizing two matrices (W_q and W_k) than one matrix (S) may involve quite a lot "junk parameters", which would not help the model accuracy in any way.
I am still reading and studying your wonderful book, and I will try this out when I get to Chapter 5.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self attention: Merge Query matrix and Key matrix into a single covariance matrix? #517

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Self attention: Merge Query matrix and Key matrix into a single covariance matrix? #517

wenlong2 Feb 5, 2025

Replies: 2 comments

rasbt Feb 6, 2025 Maintainer

wenlong2 Feb 6, 2025 Author

wenlong2
Feb 5, 2025

rasbt
Feb 6, 2025
Maintainer

wenlong2
Feb 6, 2025
Author