Add AVX2 implementation of dequantize_row_q4_1 #505

slaren · 2023-03-25T18:03:12Z

Initial tests are promising, similar gains with BLAS as with q4_0.

slaren · 2023-03-25T18:17:21Z

Base (no BLAS):
    60.97 seconds per pass - ETA 11.09 hours
    [1]4.4948,[2]4.9721,[3]5.8697,

BLAS:
    34.60 seconds per pass - ETA 6.29 hours
    [1]4.4305,[2]4.8844,[3]5.7737,

BLAS + AVX:
    32.06 seconds per pass - ETA 5.83 hours
    [1]4.4305,[2]4.8844,[3]5.7737,

Most of the improvement comes from BLAS, but it is still a gain.

Not sure if the lower perplexity from using BLAS is just a fluke in the first chunks, but interesting regardless.

ggerganov · 2023-03-25T18:31:02Z

Not sure if the lower perplexity from using BLAS is just a fluke in the first chunks, but interesting regardless.

I also observed it and it is a bit worrying because it means the non-BLAS SIMD matrix multiplication is significantly less accurate compared to BLAS. The problem is that during normal inference for text generation, after processing the prompt, we switch from BLAS to non-BLAS. Using BLAS for single token inference is terribly slow.

So I think there is a risk that we will measure too good perplexity thanks to BLAS and then in reality it will be worse due to the SIMD implementation. But I don't see a good solution less than disabling BLAS for perplexity computations.

Add AVX2 implementation of dequantize_row_q4_1

70ff206

ggerganov approved these changes Mar 25, 2023

View reviewed changes

ggerganov merged commit 459e93c into ggml-org:master Mar 25, 2023

slaren deleted the avx-dequantize-q4_1 branch March 25, 2023 18:43

gjmulder added enhancement New feature or request performance Speed related topics labels Mar 26, 2023

slaren restored the avx-dequantize-q4_1 branch March 26, 2023 18:50

slaren deleted the avx-dequantize-q4_1 branch March 26, 2023 18:53

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX2 implementation of dequantize_row_q4_1 #505

Add AVX2 implementation of dequantize_row_q4_1 #505

slaren commented Mar 25, 2023

slaren commented Mar 25, 2023 •

edited

Loading

ggerganov commented Mar 25, 2023 •

edited

Loading

Add AVX2 implementation of dequantize_row_q4_1 #505

Add AVX2 implementation of dequantize_row_q4_1 #505

Conversation

slaren commented Mar 25, 2023

slaren commented Mar 25, 2023 • edited Loading

ggerganov commented Mar 25, 2023 • edited Loading

slaren commented Mar 25, 2023 •

edited

Loading

ggerganov commented Mar 25, 2023 •

edited

Loading