-
Notifications
You must be signed in to change notification settings - Fork 12.2k
graph : make FA compatible with MLA + add initial Metal kernels #12953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Just linking this old attempt at doing this: as from @fairydreaming's CPU test and my CUDA test, it seemed the tile size was just too large to be useful. @JohannesGaessler explained in #12227 (comment) why this likely failed for CUDA, and I guess for CPU it was just such a massive quadratic increase over the previous maximum tile size (256^2 --> 512^2) that it no longer fits in cache. One other very MLA-specific thing to think about is that if the V-cache doesn't need transposing, the last 512 elements of the K-cache hold the same values, so there would be no need to store these and a 2D view starting at element 64 and offsetting by 576 would get the same data untransposed. |
…-org#12953) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci
HI there, sorry to bother. I was testing Deepseek V3 0324 on CPU + GPU, but when using FA, I get this issue
I'm running the model like this
I did build from source with
When not using -fa, it works correctly. Did I do the setup incorrectly? Raised an issue with more info here #13252 |
cont #12801
For backends that support FA with different K and V head sizes, the FA path can be used. To support that, we decompress the FA result using the
v_mla
tensor.