Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MHA vs MQA vs GQA vs MLA, Zain ul Abideen, 2024.07 #1621

Open
AkihikoWatanabe opened this issue Dec 28, 2024 · 2 comments
Open

MHA vs MQA vs GQA vs MLA, Zain ul Abideen, 2024.07 #1621

AkihikoWatanabe opened this issue Dec 28, 2024 · 2 comments

Comments

@AkihikoWatanabe
Copy link
Owner

https://medium.com/@zaiinn440/mha-vs-mqa-vs-gqa-vs-mla-c6cf8285bbec

@AkihikoWatanabe
Copy link
Owner Author

AkihikoWatanabe commented Dec 28, 2024

DeepSeekで使われているMulti Head Latent Attention(MLA)ってなんだ?と思い読んだ。端的に言うと、GQAやMQAは、KVのヘッドをそもそも減らしてKV Cacheを抑えよう、という手法だったが、MLAはKVを低ランクなベクトルに圧縮して保持し、使う時に復元するといった操作をすることで、MHAのパフォーマンスを落とすことなく(むしろ上がるらしい?)、利用するKV Cacheで利用するメモリを大幅に減らせるという手法らしい。

@AkihikoWatanabe
Copy link
Owner Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant