Not a bug report, just wanna chat with Johannes #13946

LostRuins · 2025-05-31T16:03:31Z

LostRuins
May 31, 2025
Collaborator

Hi @JohannesGaessler , this is not a bug report on llama.cpp, just wanted to share my findings with you and maybe bounce some ideas.

So previously I mentioned that ever since the Deepseek Flash Attention PR, some people mentioned that they got incoherent gen outputs with flash attention enabled LostRuins#1563 - this only seems to affect Turing, there was a RTX 2080Ti user, and I verified the garbage output on my old RTX 2060 as well.

So it turns out that this is not reproducible on llama.cpp, which works completely fine, because it does not happen when building with Cuda 11.7. I was previously using Jimver/cuda-toolkit@v0.2.15 with cuda 11.4.4 to build, switching it to 11.7.0 solved the incoherence.

That does leave me with a little pickle though. We have some users using a K80, K6000 and a GT740M cards, previously I build for compute capability 3.5, and it works fine for them. However I think the last official drivers for these old cards do not support CUDA 11.7? Online it says that cu11.7 theoretically supports cc3.5 but in practice i'm not sure whether the final releases of those drivers exceed cuda11.7

Another possible approach would be to either fall back to ggml_cuda_flash_attn_ext_wmma_f16 for Turing for my compatibility build, or even the vec kernels though I think those might be kind of slow.

Since I'm not very well versed in this I was wondering if you had any suggestions or insight. Does the Cuda Toolkit version that I set when building affect which driver versions end users require? Or can I do something like, build with cu 11.7 toolkit and then use the cu11.4 runtimes? Does that even make sense?

LostRuins · 2025-05-31T16:17:56Z

LostRuins
May 31, 2025
Collaborator Author

https://developer.nvidia.com/cuda-legacy-gpus
Heh, the GT740M compute capability doesn't match up...

2 replies

netrunnereve Jun 1, 2025
Collaborator

14 GB/s? Ouch.

LostRuins Jun 1, 2025
Collaborator Author

Yeah, they managed to force desktop driver and install cuda 11.4
| NVIDIA-SMI 425.31 Driver Version: 475.14 CUDA Version: 11.4 |

And they are running a 8B model on this ancient GPU LostRuins#1272

but surprisingly they still managed to get over 2T/s which they seem happy with.
A lot of people have surprisingly old setups

JohannesGaessler · 2025-05-31T22:04:29Z

JohannesGaessler
May 31, 2025
Collaborator

CUDA code is forwards-compatible, so the same code should work correctly for both CUDA versions. However, in practice the CUDA code is not actually 100% the same because llama.cpp/ggml uses some features that are not available on older versions so workarounds may be necessary (check vs CUDART_VERSION). If you look at what is actually different between 11.4 and 11.7 it's workarounds for the __hmax and __hmax2 instructions as well as whether or not to use CUB for GGML_OP_SUM. I think it's very unlikely that these differences would be causing issues. What may be happening is that there is a race condition somewhere and it only causes problems on some CUDA versions. But when I compile the CUDA code for compute capability 7.5 and run compute-sanitzer --tool=racecheck tests/test-backend-ops -o FLASH_ATTN_EXT I don't get reports of race conditions.

In principle there is also this recent fix #13852 that could be relevant (but CC 7.5 should only be affected when using models without GQA).

0 replies

LostRuins · 2025-06-01T03:47:10Z

LostRuins
Jun 1, 2025
Collaborator Author

@JohannesGaessler I think you are probably right that it's some kind of race condition. Swapping the CUDA toolkit versions between various versions seems to trigger it randomly and I can't pinpoint any consistency on how to repro it.

It doesn't need long contexts - the incoherence is very obvious and it happens even with a very short prompt, the output is just random words.

In the end, I ended up swapping to the old wmma flash attention kernel for Turing too (previously was volta only), and it works fine now, at least for short contexts it seems equally fast or within error margins (4T/s for a 12B with 20 layers on a RTX 2060).

If llama.cpp sticks to cu11.7 I don't think this will be an issue, though it's something to keep in mind in future if this ever changes.

Thanks for your time.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Not a bug report, just wanna chat with Johannes #13946

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Not a bug report, just wanna chat with Johannes #13946

Uh oh!

LostRuins May 31, 2025 Collaborator

Replies: 3 comments · 2 replies

Uh oh!

LostRuins May 31, 2025 Collaborator Author

Uh oh!

netrunnereve Jun 1, 2025 Collaborator

Uh oh!

LostRuins Jun 1, 2025 Collaborator Author

Uh oh!

JohannesGaessler May 31, 2025 Collaborator

Uh oh!

Uh oh!

LostRuins Jun 1, 2025 Collaborator Author

LostRuins
May 31, 2025
Collaborator

Replies: 3 comments 2 replies

LostRuins
May 31, 2025
Collaborator Author

netrunnereve Jun 1, 2025
Collaborator

LostRuins Jun 1, 2025
Collaborator Author

JohannesGaessler
May 31, 2025
Collaborator

LostRuins
Jun 1, 2025
Collaborator Author