Attempt to SIMD-ify dequantize_row_q4_0() for ARM_NEON #502

ggerganov · 2023-03-25T16:42:31Z

Any ideas what's wrong in this computation?

Printing the numbers seem quite OK, but the results are garbage.
It works with the scalar code below and the AVX SIMD code above.

Test command:

make clean && make -j && ./main -m ./models/7B/ggml-model-q4_0.bin -p "I believe the meaning of life is" -c 512 -n 64 -s 12 -b 10 -t 8

* Retire the ggml_mul_mat() for transposed src0 - It can always be made contiguous with ggml_cpy() - The code is now simplified - The results are deterministic in respect to num threads * SIMD-ify dequantize_row_q4_0() for ARM_NEON (#502) * Attempt to SIMD-ify dequantize_row_q4_0() for ARM_NEON * Fix dequantization - forgot to interleave the quants

ggerganov added 2 commits March 25, 2023 18:40

Attempt to SIMD-ify dequantize_row_q4_0() for ARM_NEON

04be5b0

Fix dequantization - forgot to interleave the quants

b83ddbd

ggerganov merged commit face808 into simple-mul_mat Mar 25, 2023

ggerganov deleted the arm-deq-q4_0 branch March 25, 2023 17:31

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to SIMD-ify dequantize_row_q4_0() for ARM_NEON #502

Attempt to SIMD-ify dequantize_row_q4_0() for ARM_NEON #502

ggerganov commented Mar 25, 2023

Attempt to SIMD-ify dequantize_row_q4_0() for ARM_NEON #502

Attempt to SIMD-ify dequantize_row_q4_0() for ARM_NEON #502

Conversation

ggerganov commented Mar 25, 2023