CUDA: fix q_nope_absorbed precision for Deepseek 2 Lite f16 #13137
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I noticed that Deepseek 2 Lite when calculating perplexity on Wikitext was returning worse results for FP16 weights than with q4_0 weights: with 10 512 token chunks FP16 resulted in 27.6215 while q4_0 resulted in 8.1775. The problem seems to be numerical issues in the calculation of
q_nope_absorbed
, specifically with CUDA and batch sizes > 1. If FP32 precision is used the perplexity becomes 7.9094.On master
ggml_cuda_mul_mat
does not always respect the precision set viaggml_mul_mat_set_prec
.ggml_cuda_mul_mat_batched_cublas
only supports FP16, FP16 -> FP16 GEMM but is used regardless of the requested precision. This PR makes it so that if higher precision is requestedggml_cuda_op_mul_mat_cublas
is used instead (which supports FP32 precision). Long-term I think we should aim to removeggml_cuda_op_mul_mat
and refactor the cuBLAS code. I'm currently working towards the former; what I think is specifically needed is MMQ support for batched and non-contiguous inputs and backend-agnostic support for tensor parallelism.