Try to use quantized `ggml_mul_mat` in attention layer #1098

ggerganov · 2023-04-21T07:38:58Z

The following 2 matrix multiplication calls sill remain in FP16 precission:

Was wondering, if we quantize those on-the-fly would there be any benefit.
The quantization can be done with an extra ggml_cpy() call, before the ggml_mul_mat() call.

See if this speeds up the computation and how it affects perplexity

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-04-22T08:37:55Z

Performance stats on 7B indicate that the F16 matrix multiplication account for just 2% of the total processing time.
The PoC in #1103 does not show any significant performance gains, so closing this for now

ggerganov added good first issue Good for newcomers performance Speed related topics labels Apr 21, 2023

ggerganov added this to ggml : improve integer quantization Apr 21, 2023

ggerganov moved this to Todo in ggml : improve integer quantization Apr 21, 2023

ggerganov mentioned this issue Apr 21, 2023

llama : quantize attention results #1103

Draft

ggerganov moved this from Todo to In Progress in ggml : improve integer quantization Apr 21, 2023

ggerganov linked a pull request Apr 21, 2023 that will close this issue

llama : quantize attention results #1103

Draft

ggerganov closed this as completed Apr 22, 2023

github-project-automation bot moved this from In Progress to Done in ggml : improve integer quantization Apr 22, 2023

ggerganov self-assigned this Apr 22, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to use quantized `ggml_mul_mat` in attention layer #1098

Try to use quantized `ggml_mul_mat` in attention layer #1098

ggerganov commented Apr 21, 2023

ggerganov commented Apr 22, 2023

Try to use quantized ggml_mul_mat in attention layer #1098

Try to use quantized ggml_mul_mat in attention layer #1098

Comments

ggerganov commented Apr 21, 2023

ggerganov commented Apr 22, 2023

Try to use quantized `ggml_mul_mat` in attention layer #1098

Try to use quantized `ggml_mul_mat` in attention layer #1098