-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize AVX2 ggml_vec_dot_q4_0 #642
Conversation
Very nice. Compiled on Windows and went from 230ms per token to 195ms per token on average. Which sounds even more promising because I was getting 185ms per token as is on Linux without these optimizations. |
I didn't have consistent run times which would have allowed me to see an improvement, so thanks for looking at this more closely. I wonder how much the speed-up is due to the loop unrolling, though. |
It's not much at all, I couldn't even measure a difference reliably in the google benchmark code, however it seemed to consistently lower perplexity times by a couple of seconds so I decided to leave it in. |
Out of curiosity, I checked the non-AVX2 binaries out on Windows now that #617 was merged. Getting 177ms per token now. Edit: Got back on Linux and the numbers comparing a AVX-only build versus this PR with AVX2 are as follows. |
Additional AVX2 ggml_vec_dot_q4_0 optimization you might be interested in @slaren , Before (with this pull): After (with this pull + extra optimization):
I'm on msvc windows so im not sure if those improvements would translate to linux |
Is there any reason to unroll the loops manually rather than allowing the compiler to do it? From what I can tell, |
Still getting more or less the same average of 138ms per token with my test prompt on Linux. Will try to compare on Windows later. |
I don't think that we can reliably measure small performance differences just by looking at the eval time, the variance is too high. For now I found that the perplexity time per pass is a lot more stable (remember to disable BLAS when doing this though). |
This PR changes the behavior of inference on Intel. I monitor this project for deterministic responses by hard-coding the random seed parameter. After patching this change, token outputs became different and lower quality. There's probably a small bug that needs to be addressed. Please consider rolling back. |
@jart : You're right. I noticed it before merging but thought it due to changes in floating point operation order. There are such differences between the various optimizations for different processors. But thinking about it, there shouldn't be a difference here because only integer operations were modified. Let me look into this some more. #617 might have the same problem, but on AVX instead of AVX2. |
Intel support was already in a somewhat broken state compared to Apple M1 support before this change. The differences are kind of minor, like a missing comma. This change has a bug that caused completely different answers that don't make sense to show up. Although it was still real spoken language. I've seen worse. Sometimes when working on changes, I get something wrong and it speaks complete gibberish! I care much more about reliability, determinism, predictability, and accuracy than I care about an 8% performance boost. Yes differences in floating point exist between architectures. But isn't it usually minor? Stuff like NAN vs. -NAN? Whatever defined differences in behavior exist, I would ideally hope we could avoid triggering them, in the interest of better correctness. |
Or rather than correctness, I should say consistency. |
|
@rabidcopy : the change in inference causes an EOT token to be generated, so I don't think you can compare the performance. Try a longer run with a different seed. @jart : can you give an example of "answers that don't make sense"? Maybe with seed and model file checksums? I agree that the inference has changed, it just sounds like we're not seeing the same kind of degradation. After some tests I'm quite confident that the integer operations haven't changed, so I'm back to my initial suspicion that it's just the order of how the floats are added. Edit: here's what I mean, just to be clear...
|
|
After more tests, I'm seeing no change in speed between cbef542 and 1d08882 (latest master), and (subjectively) no degradation in language. Of course your experience may be different, I don't want to discount that and I'm open to reverting it. After all, different processor generations have different AVX throughput/latency. @slaren : could you repeat and confirm your measurements on master? |
On current master:
Which is in line with what I expected from this PR. This is on a 9900k under WSL2. |
|
Keep in my mind that even in my tests the biggest improvement is when evaluating in batches (prompt or perplexity). The performance difference in generation is within the margin of error. When BLAS is used for the prompt, this function is not used at all. |
|
@rabidcopy if you are on linux and using GCC, can you try lowering the unroll number in #pragma GCC unroll 16 I wonder if there is some issue with the instruction cache. |
@rabidcopy For me, there's an additional performance improvement when compiling with |
Using |
You may still be using GCC even if you use cmake. |
Ah, duh. Well, going to mess with that then. Edit: Setting the unroll to 16, 8, and 4 didn't seem to affect performance at all. |
Oh man. I have egg on my face. It appears I somehow wasn't actually on the latest master. Only now noticed that sections were missing in ggml.c that didn't make sense. |
Fix issue of missing words due to buffer overflow for Issue ggml-org#642
This is a port of @perserk's and @sw's AVX implementation of
ggml_vec_dot_q4_0
(#617) to AVX2.Before:
After: