Why Q4 much faster than Q8 ? #1239

gotzmann · 2023-04-29T18:45:38Z

I've tried to check inference performance for different quantised formats expecting Q8_0 to be fastest due to smaller number of shifts / moves and other CPU operations.

To my surprise it lags behind the Q4_0, which I expected to be slower.

So I'm curious what's the main reason for that - just the fact that maybe Q8 is not well supported yet, or Q4 faster due to some fundamental laws, like less moves between RAM <-> CPU, etc?

Is it expected for Q4 to be faster for future releases too?

ggerganov · 2023-04-29T18:50:02Z

The explanation that I have is that the computation becomes memory-bound on modern CPUs.
I.e. running inference with 8 threads is constrained by the speed of the RAM and not by the actual computation.
Therefore, using quantized data we reduce the memory throughput and gain performance.

If we had infinite memory throughput, then you will be probably right - the Q8_0 method will be faster.

gotzmann · 2023-04-29T18:56:54Z

Great, had the same conclusion. Going to re-test it on AMD platform with much slower RAM throughput and see how the things maybe even worse there :) My M1 Pro laptop should have 200 Gb/s interface with RAM, and my home PC only 20% from that.

ggerganov · 2023-04-29T19:01:44Z

On M1 Pro I haven't seen more than 40GB/s when using just CPU (i.e. memcpy).
The memory is indeed fast, but the "200 GB/s" is mostly a marketing trick IMHO

gotzmann · 2023-05-01T09:39:45Z

Interesting numbers with PassMark command-line for M1 Pro:

  Memory Read Cached               24194 MB/s
  Memory Read Uncached             24291 MB/s
  Memory Write                     22829 MB/s
  Memory Latency                   25 Nanoseconds
  Memory Threaded                  119809 MB/s

Looks like memory throughput for multi-threaded apps much higher than for your average dual-channel memory PC.

sw · 2023-05-12T11:38:42Z

As with #1243, reducing the amount of memory that needs to be loaded is more important for performance than matching the cache line size or reducing instructions per block. Closing this as there isn't really anything concrete we can do here, in my opinion.

gjmulder added performance Speed related topics hardware Hardware related labels May 2, 2023

sw closed this as not planned Won't fix, can't repro, duplicate, stale May 12, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why Q4 much faster than Q8 ? #1239

Why Q4 much faster than Q8 ? #1239

gotzmann commented Apr 29, 2023

ggerganov commented Apr 29, 2023 •

edited

Loading

gotzmann commented Apr 29, 2023

ggerganov commented Apr 29, 2023 •

edited

Loading

gotzmann commented May 1, 2023

sw commented May 12, 2023

Why Q4 much faster than Q8 ? #1239

Why Q4 much faster than Q8 ? #1239

Comments

gotzmann commented Apr 29, 2023

ggerganov commented Apr 29, 2023 • edited Loading

gotzmann commented Apr 29, 2023

ggerganov commented Apr 29, 2023 • edited Loading

gotzmann commented May 1, 2023

sw commented May 12, 2023

ggerganov commented Apr 29, 2023 •

edited

Loading

ggerganov commented Apr 29, 2023 •

edited

Loading