Replies: 3 comments 2 replies
-
https://developer.nvidia.com/cuda-legacy-gpus |
Beta Was this translation helpful? Give feedback.
-
CUDA code is forwards-compatible, so the same code should work correctly for both CUDA versions. However, in practice the CUDA code is not actually 100% the same because llama.cpp/ggml uses some features that are not available on older versions so workarounds may be necessary (check vs In principle there is also this recent fix #13852 that could be relevant (but CC 7.5 should only be affected when using models without GQA). |
Beta Was this translation helpful? Give feedback.
-
@JohannesGaessler I think you are probably right that it's some kind of race condition. Swapping the CUDA toolkit versions between various versions seems to trigger it randomly and I can't pinpoint any consistency on how to repro it. It doesn't need long contexts - the incoherence is very obvious and it happens even with a very short prompt, the output is just random words. In the end, I ended up swapping to the old wmma flash attention kernel for Turing too (previously was volta only), and it works fine now, at least for short contexts it seems equally fast or within error margins (4T/s for a 12B with 20 layers on a RTX 2060). If llama.cpp sticks to cu11.7 I don't think this will be an issue, though it's something to keep in mind in future if this ever changes. Thanks for your time. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @JohannesGaessler , this is not a bug report on llama.cpp, just wanted to share my findings with you and maybe bounce some ideas.
So previously I mentioned that ever since the Deepseek Flash Attention PR, some people mentioned that they got incoherent gen outputs with flash attention enabled LostRuins#1563 - this only seems to affect Turing, there was a RTX 2080Ti user, and I verified the garbage output on my old RTX 2060 as well.
So it turns out that this is not reproducible on llama.cpp, which works completely fine, because it does not happen when building with Cuda 11.7. I was previously using Jimver/cuda-toolkit@v0.2.15 with cuda 11.4.4 to build, switching it to 11.7.0 solved the incoherence.
That does leave me with a little pickle though. We have some users using a K80, K6000 and a GT740M cards, previously I build for compute capability 3.5, and it works fine for them. However I think the last official drivers for these old cards do not support CUDA 11.7? Online it says that cu11.7 theoretically supports cc3.5 but in practice i'm not sure whether the final releases of those drivers exceed cuda11.7
Another possible approach would be to either fall back to
ggml_cuda_flash_attn_ext_wmma_f16
for Turing for my compatibility build, or even the vec kernels though I think those might be kind of slow.Since I'm not very well versed in this I was wondering if you had any suggestions or insight. Does the Cuda Toolkit version that I set when building affect which driver versions end users require? Or can I do something like, build with cu 11.7 toolkit and then use the cu11.4 runtimes? Does that even make sense?
Beta Was this translation helpful? Give feedback.
All reactions