-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why Q4 much faster than Q8 ? #1239
Comments
The explanation that I have is that the computation becomes memory-bound on modern CPUs. If we had infinite memory throughput, then you will be probably right - the |
Great, had the same conclusion. Going to re-test it on AMD platform with much slower RAM throughput and see how the things maybe even worse there :) My M1 Pro laptop should have 200 Gb/s interface with RAM, and my home PC only 20% from that. |
On M1 Pro I haven't seen more than 40GB/s when using just CPU (i.e. |
Interesting numbers with PassMark command-line for M1 Pro:
Looks like memory throughput for multi-threaded apps much higher than for your average dual-channel memory PC. |
As with #1243, reducing the amount of memory that needs to be loaded is more important for performance than matching the cache line size or reducing instructions per block. Closing this as there isn't really anything concrete we can do here, in my opinion. |
I've tried to check inference performance for different quantised formats expecting Q8_0 to be fastest due to smaller number of shifts / moves and other CPU operations.
To my surprise it lags behind the Q4_0, which I expected to be slower.
So I'm curious what's the main reason for that - just the fact that maybe Q8 is not well supported yet, or Q4 faster due to some fundamental laws, like less moves between RAM <-> CPU, etc?
Is it expected for Q4 to be faster for future releases too?
The text was updated successfully, but these errors were encountered: