Skip to content

CUDA: fix q_nope_absorbed precision for Deepseek 2 Lite f16 #13137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 28, 2025

Conversation

JohannesGaessler
Copy link
Collaborator

I noticed that Deepseek 2 Lite when calculating perplexity on Wikitext was returning worse results for FP16 weights than with q4_0 weights: with 10 512 token chunks FP16 resulted in 27.6215 while q4_0 resulted in 8.1775. The problem seems to be numerical issues in the calculation of q_nope_absorbed, specifically with CUDA and batch sizes > 1. If FP32 precision is used the perplexity becomes 7.9094.

On master ggml_cuda_mul_mat does not always respect the precision set via ggml_mul_mat_set_prec. ggml_cuda_mul_mat_batched_cublas only supports FP16, FP16 -> FP16 GEMM but is used regardless of the requested precision. This PR makes it so that if higher precision is requested ggml_cuda_op_mul_mat_cublas is used instead (which supports FP32 precision). Long-term I think we should aim to remove ggml_cuda_op_mul_mat and refactor the cuBLAS code. I'm currently working towards the former; what I think is specifically needed is MMQ support for batched and non-contiguous inputs and backend-agnostic support for tensor parallelism.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 27, 2025
@jukofyork
Copy link
Collaborator

Yeah, q_nope_absorbed is basically doing the KQ multiplication so likely it will suffer from the same overflow problems - I had similar problems when I tried to use FP16 for it too.

@JohannesGaessler JohannesGaessler merged commit 69699be into ggml-org:master Apr 28, 2025
48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants