CUDA: fix numerical issue in tile FA kernel #16540

JohannesGaessler · 2025-10-12T19:34:46Z

Fixes issue described in #16528 (comment) .

The problem as far as I can tell are numerical issues for the rescaling of the VKQ accumulators with the inverse of the KQ sum at the end of the kernel. The input values in test-backend-ops and the models I tested did not provoke this issue so I did not detect it in #16492 . The fix is to simply use FP32 arithmetic, the impact on performance is negligible since it's only done once per CUDA block.

IMbackK

The code changes themselves look fine and i think its a good idea to do this, even if the the intimidate precision issue lies elsewhere, which i did not verify.

JohannesGaessler · 2025-10-13T10:41:09Z

I definitely did observe issues with the numerical range when I debugged this, there seems to be an additional bug that specifically occurs ~~when running >1 parallel blocks in ne01 direction and ne01 % ncols1 != 0~~ when the KQ maximum comes from the zeroed-out padding.

* origin/master: (32 commits) metal : FA support F32 K and V and head size = 32 (ggml-org#16531) graph : support cacheless embeddings with FA and iSWA (ggml-org#16528) opencl: fix build targeting CL 2 (ggml-org#16554) CUDA: fix numerical issues in tile FA kernel (ggml-org#16540) ggml : fix build broken with -march=armv9-a on MacOS (ggml-org#16520) CANN: fix CPU memory leak in CANN backend (ggml-org#16549) fix: add remark plugin to render raw HTML as literal text (ggml-org#16505) metal: add support for opt_step_sgd (ggml-org#16539) ggml : fix scalar path for computing norm (ggml-org#16558) CANN: Update several operators to support FP16 data format (ggml-org#16251) metal : add opt_step_adamw and op_sum (ggml-org#16529) webui: remove client-side context pre-check and rely on backend for limits (ggml-org#16506) [SYCL] fix UT fault cases: count-equal, argsort, pad OPs (ggml-org#16521) ci : add Vulkan on Ubuntu with default packages build (ggml-org#16532) common : handle unicode during partial json parsing (ggml-org#16526) common : update presets (ggml-org#16504) ggml : Fix FP16 ELU positive branch (ggml-org#16519) hparams : add check for layer index in is_recurrent (ggml-org#16511) ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (ggml-org#16518) CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492) ...

This reverts commit 7049736.

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 12, 2025

JohannesGaessler mentioned this pull request Oct 12, 2025

graph : support cacheless embeddings with FA and iSWA #16528

Merged

JohannesGaessler force-pushed the cuda-fa-fix-numerics branch from 525916a to ba62ea9 Compare October 12, 2025 19:45

IMbackK approved these changes Oct 13, 2025

View reviewed changes

CUDA: fix numerical issues in tile FA kernel

7dcee04

JohannesGaessler force-pushed the cuda-fa-fix-numerics branch from ba62ea9 to 7dcee04 Compare October 13, 2025 13:42

ggerganov approved these changes Oct 13, 2025

View reviewed changes

ggerganov merged commit 7049736 into ggml-org:master Oct 13, 2025
59 checks passed

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 13, 2025

Revert "CUDA: fix numerical issues in tile FA kernel (ggml-org#16540)"

4010937

This reverts commit 7049736.

yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025

CUDA: fix numerical issues in tile FA kernel (ggml-org#16540)

a02a503

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 16, 2025

Revert "CUDA: fix numerical issues in tile FA kernel (ggml-org#16540)"

6d12720

This reverts commit 7049736.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 16, 2025

Revert "CUDA: fix numerical issues in tile FA kernel (ggml-org#16540)"

369bf78

This reverts commit 7049736.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: fix numerical issue in tile FA kernel #16540

CUDA: fix numerical issue in tile FA kernel #16540

Uh oh!

JohannesGaessler commented Oct 12, 2025

Uh oh!

IMbackK left a comment

Uh oh!

JohannesGaessler commented Oct 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CUDA: fix numerical issue in tile FA kernel #16540

CUDA: fix numerical issue in tile FA kernel #16540

Uh oh!

Conversation

JohannesGaessler commented Oct 12, 2025

Uh oh!

IMbackK left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JohannesGaessler commented Oct 13, 2025 •

edited

Loading