Skip to content

graph : make FA compatible with MLA + add initial Metal kernels #12953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 17, 2025

Conversation

ggerganov
Copy link
Member

cont #12801

For backends that support FA with different K and V head sizes, the FA path can be used. To support that, we decompress the FA result using the v_mla tensor.

@github-actions github-actions bot added testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Apr 15, 2025
@jukofyork
Copy link
Collaborator

jukofyork commented Apr 15, 2025

Just linking this old attempt at doing this:

#12227 (comment)

as from @fairydreaming's CPU test and my CUDA test, it seemed the tile size was just too large to be useful.

@JohannesGaessler explained in #12227 (comment) why this likely failed for CUDA, and I guess for CPU it was just such a massive quadratic increase over the previous maximum tile size (256^2 --> 512^2) that it no longer fits in cache.

One other very MLA-specific thing to think about is that if the V-cache doesn't need transposing, the last 512 elements of the K-cache hold the same values, so there would be no need to store these and a 2D view starting at element 64 and offsetting by 576 would get the same data untransposed.

@ggerganov ggerganov merged commit 2f74c35 into master Apr 17, 2025
55 of 58 checks passed
@ggerganov ggerganov deleted the gg/mla branch April 17, 2025 15:16
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
…-org#12953)

* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci
@Panchovix
Copy link

Panchovix commented May 2, 2025

HI there, sorry to bother. I was testing Deepseek V3 0324 on CPU + GPU, but when using FA, I get this issue

slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 2048, progress = 0.632294
srv  update_slots: decoding batch, n_tokens = 2048
set_embeddings: value = 0
clear_adapter_lora: call
/run/media/pancho/6AE20D1AE20CEBDF/ChatIAs/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:75: ggml_cuda_compute_forward: MUL_MAT failed
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_compute_forward at /run/media/pancho/6AE20D1AE20CEBDF/ChatIAs/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2344
  err
CUDA error
[New LWP 64005]
[New LWP 64004]
[New LWP 64003]
[New LWP 64002]
[New LWP 64001]
[New LWP 64000]
[New LWP 63999]
[New LWP 63605]
[New LWP 63604]
[New LWP 63603]
[New LWP 63602]
[New LWP 63601]
[New LWP 63600]
[New LWP 63599]
[New LWP 63598]
[New LWP 63597]
[New LWP 63596]
[New LWP 63595]
[New LWP 63594]
[New LWP 63593]
[New LWP 63592]
[New LWP 63591]
[New LWP 63590]
[New LWP 63589]
[New LWP 63588]
[New LWP 63587]
[New LWP 63586]
[New LWP 63585]
[New LWP 63584]
[New LWP 63583]
[New LWP 63582]
[New LWP 63581]
[New LWP 63580]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.fedoraproject.org/>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f47c40876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#0  0x00007f47c40876c2 in __syscall_cancel_arch () from /lib64/libc.so.6
#1  0x00007f47c407b9da in __internal_syscall_cancel () from /lib64/libc.so.6
#2  0x00007f47c407ba24 in __syscall_cancel () from /lib64/libc.so.6
#3  0x00007f47c40eb5af in wait4 () from /lib64/libc.so.6
#4  0x00007f47c8b35fb6 in ggml_abort () from libggml-base.so
#5  0x00007f47c8c93963 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from libggml-cuda.so
#6  0x00007f47c8c9edbe in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from libggml-cuda.so
#7  0x00007f47c8b4b344 in ggml_backend_sched_graph_compute_async () from libggml-base.so
#8  0x00007f47d5b9d371 in llama_context::graph_compute(ggml_cgraph*, bool) () from libllama.so
#9  0x00007f47d5ba0ef8 in llama_context::decode(llama_batch&) () from libllama.so
#10 0x00007f47d5ba219b in llama_decode () from libllama.so
#11 0x000000000048b040 in server_context::update_slots() ()
#12 0x000000000045b25c in server_queue::start_loop() ()
#13 0x0000000000426020 in main ()
[Inferior 1 (process 63579) detached]

I'm running the model like this

./llama-server -m '/run/media/pancho/DE1652041651DDD9/HuggingFaceModelDownloader/Storage/GGUFs/DeepSeek-V3-0324-UD-Q2_K_XL-merged.gguf' -c 16384 --no-mmap --no-warmup -v -ngl 99 --override-tensor 'blk\.(2[5-9]|[3-6][0-9])\..*_exps\.=CPU' --override-tensor 'blk\.([1-6])\..*_exps\.=CUDA0' --override-tensor 'blk\.([7-9]|1[0])\..*_exps\.=CUDA1' --override-tensor 'blk\.(1[1-5])\..*_exps\.=CUDA2' --override-tensor 'blk\.(1[6-9]|2[0-4])\..*_exps\.=CUDA3' -fa

I did build from source with

cmake -B build \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_FA_ALL_QUANTS=ON \
  -DGGML_BLAS=OFF \
  -DCMAKE_CUDA_ARCHITECTURES="86;89;120" \

When not using -fa, it works correctly.

Did I do the setup incorrectly? Raised an issue with more info here #13252

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants