Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why Q4 much faster than Q8 ? #1239

Closed
gotzmann opened this issue Apr 29, 2023 · 5 comments
Closed

Why Q4 much faster than Q8 ? #1239

gotzmann opened this issue Apr 29, 2023 · 5 comments
Labels
hardware Hardware related performance Speed related topics

Comments

@gotzmann
Copy link

I've tried to check inference performance for different quantised formats expecting Q8_0 to be fastest due to smaller number of shifts / moves and other CPU operations.

To my surprise it lags behind the Q4_0, which I expected to be slower.

So I'm curious what's the main reason for that - just the fact that maybe Q8 is not well supported yet, or Q4 faster due to some fundamental laws, like less moves between RAM <-> CPU, etc?

Is it expected for Q4 to be faster for future releases too?

@ggerganov
Copy link
Member

ggerganov commented Apr 29, 2023

The explanation that I have is that the computation becomes memory-bound on modern CPUs.
I.e. running inference with 8 threads is constrained by the speed of the RAM and not by the actual computation.
Therefore, using quantized data we reduce the memory throughput and gain performance.

If we had infinite memory throughput, then you will be probably right - the Q8_0 method will be faster.

@gotzmann
Copy link
Author

Great, had the same conclusion. Going to re-test it on AMD platform with much slower RAM throughput and see how the things maybe even worse there :) My M1 Pro laptop should have 200 Gb/s interface with RAM, and my home PC only 20% from that.

@ggerganov
Copy link
Member

ggerganov commented Apr 29, 2023

On M1 Pro I haven't seen more than 40GB/s when using just CPU (i.e. memcpy).
The memory is indeed fast, but the "200 GB/s" is mostly a marketing trick IMHO

@gotzmann
Copy link
Author

gotzmann commented May 1, 2023

Interesting numbers with PassMark command-line for M1 Pro:

  Memory Read Cached               24194 MB/s
  Memory Read Uncached             24291 MB/s
  Memory Write                     22829 MB/s
  Memory Latency                   25 Nanoseconds
  Memory Threaded                  119809 MB/s

Looks like memory throughput for multi-threaded apps much higher than for your average dual-channel memory PC.

@gjmulder gjmulder added performance Speed related topics hardware Hardware related labels May 2, 2023
@sw
Copy link
Contributor

sw commented May 12, 2023

As with #1243, reducing the amount of memory that needs to be loaded is more important for performance than matching the cache line size or reducing instructions per block. Closing this as there isn't really anything concrete we can do here, in my opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hardware Hardware related performance Speed related topics
Projects
None yet
Development

No branches or pull requests

4 participants