Add AVX2 implementation of dequantize_row_q4_0 #467

slaren · 2023-03-24T16:39:39Z

I couldn't notice a big performance improvement, more testing necessary

slaren · 2023-03-24T17:02:32Z

A quick performance test shows significant improvement in the function itself (with k=4096):

Running ./test-dq
Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.56, 1.26, 1.46
----------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations
----------------------------------------------------------------------
BM_dequantize_row_q4_0           10351 ns        10351 ns        66698
BM_dequantize_row_q4_0_avx2       1384 ns         1384 ns       509491

slaren · 2023-03-24T17:13:56Z

The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values.
[1]4.5690,[2]5.2058,[3]6.0526,

Green-Sky · 2023-03-24T17:52:50Z

@ggerganov we need some sort of benchmarking suite for ggml.

@slaren how complex is the ./test-dq ? can you provide the code, does it require the model files or is it standalone? (should be easy to create synthetic data)

slaren · 2023-03-24T17:58:27Z

It's a standalone test using the google benchmark library. Here is the code: https://gist.github.com/slaren/ba732ed08abd0ba148129eab3335dfb7
To do that, I split the avx and scalar implementations into dequantize_row_q4_0_avx2 and dequantize_row_q4_0 beforehand.

ggerganov · 2023-03-24T20:13:30Z

The first chunks of the perplexity computation show the same values. I didn't run the full test but I have no reason to believe that it would produce different values. [1]4.5690,[2]5.2058,[3]6.0526,

The dequantize functions are only used if you link against BLAS and use -b 32 or bigger:

make clean
LLAMA_OPENBLAS=1 make

Otherwise, they will never be called.

slaren · 2023-03-24T20:35:44Z

@ggerganov that's not what I am seeing, here is a stack trace for example:

#2  0x00005555555660d4 in dequantize_row_q4_0 (x=0x7ffedc43a0d0, y=0x7ffe585b81a0, k=k@entry=4096) at ggml.c:767
#3  0x000055555556b1e7 in ggml_compute_forward_get_rows_q4_0 (params=<optimized out>, params=<optimized out>,
    dst=0x7ffe585b8100, src1=0x7ffe585b8030, src0=0x7ffedc43a030) at ggml.c:7249
#4  ggml_compute_forward_get_rows (dst=0x7ffe585b8100, src1=0x7ffe585b8030, src0=0x7ffedc43a030, params=<optimized out>)
    at ggml.c:7345
#5  ggml_compute_forward (params=<optimized out>, tensor=0x7ffe585b8100) at ggml.c:9027
#6  0x0000555555571435 in ggml_graph_compute (ctx=<optimized out>, cgraph=0x7ffffffe4d90) at ggml.c:9911
#7  0x00005555555793f5 in llama_eval_internal (lctx=..., tokens=<optimized out>, n_tokens=4, n_past=0, n_threads=<optimized out>)
    at llama.cpp:822
#8  0x000055555557976d in llama_eval (ctx=<optimized out>, tokens=<optimized out>, n_tokens=<optimized out>,
    n_past=<optimized out>, n_threads=<optimized out>) at llama.cpp:1493
#9  0x000055555555c396 in main (argc=<optimized out>, argv=<optimized out>) at main.cpp:224

ggerganov · 2023-03-24T20:47:32Z

Ah yes - there is one exception -- the ggml_get_rows at the start of the inference. It is a very lightweight call so I don't expect it to take a measurable amount of time.

slaren · 2023-03-24T20:49:12Z

Ah I see. I am running some tests with BLAS now, will report back when I have some results. Unfortunately it seems to be much slower, probably need to find a better BLAS library than just using the libopenblas-dev package from ubuntu..

slaren · 2023-03-24T21:17:34Z

@ggerganov When building with BLAS, -b 32 and a long enough prompt I only get garbage generation (not just bad, but random tokens). This happens on master too. Is it possible that BLAS support is broken at the moment?

ggerganov · 2023-03-24T21:30:26Z

Yes, it is broken. Weird ..

ggerganov · 2023-03-25T14:32:01Z

Ok, BLAS has been fixed and for large prompts and batch size ( > 256) there is significant benefit to enable BLAS.
Tested on M1 so far, but I expect the same results for x86

slaren · 2023-03-25T16:12:28Z

I am seeing a very significant improvement with x86 as well, for instance the perplexity computation went from ~8 hours to ~5 hours.

Add AVX2 implementation of dequantize_row_q4_0

8f2b6d2

slaren marked this pull request as ready for review March 24, 2023 17:14

ggerganov merged commit 09aecbf into ggml-org:master Mar 25, 2023

slaren deleted the avx-dequantize branch March 25, 2023 15:43

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX2 implementation of dequantize_row_q4_0 #467

Add AVX2 implementation of dequantize_row_q4_0 #467

slaren commented Mar 24, 2023

slaren commented Mar 24, 2023 •

edited

Loading

slaren commented Mar 24, 2023

Green-Sky commented Mar 24, 2023

slaren commented Mar 24, 2023 •

edited

Loading

ggerganov commented Mar 24, 2023

slaren commented Mar 24, 2023 •

edited

Loading

ggerganov commented Mar 24, 2023

slaren commented Mar 24, 2023 •

edited

Loading

slaren commented Mar 24, 2023

ggerganov commented Mar 24, 2023

ggerganov commented Mar 25, 2023

slaren commented Mar 25, 2023

Add AVX2 implementation of dequantize_row_q4_0 #467

Add AVX2 implementation of dequantize_row_q4_0 #467

Conversation

slaren commented Mar 24, 2023

slaren commented Mar 24, 2023 • edited Loading

slaren commented Mar 24, 2023

Green-Sky commented Mar 24, 2023

slaren commented Mar 24, 2023 • edited Loading

ggerganov commented Mar 24, 2023

slaren commented Mar 24, 2023 • edited Loading

ggerganov commented Mar 24, 2023

slaren commented Mar 24, 2023 • edited Loading

slaren commented Mar 24, 2023

ggerganov commented Mar 24, 2023

ggerganov commented Mar 25, 2023

slaren commented Mar 25, 2023

slaren commented Mar 24, 2023 •

edited

Loading

slaren commented Mar 24, 2023 •

edited

Loading

slaren commented Mar 24, 2023 •

edited

Loading

slaren commented Mar 24, 2023 •

edited

Loading