-
Notifications
You must be signed in to change notification settings - Fork 11.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize non-SIMD Q4 vector dot product #703
Conversation
I realize this merits checking the perplexity, but I get an ETA of 130 hours for the slowest case listed in the tables above. So roughly a month for a full run with all combinations :-( |
I'm not very versed in LLM and Neural Networks, but what is the exact point of checking the perplexity? Is that to check that the model hasn't regressed? Does it hold that the lower the perplexity the better the model output? |
That's basically it, though I wouldn't call myself an expert. If you look at some of the graphs in #406 that people have made, you'll see that the first few chunks have a large variation, so you'd ideally run the full set for it to give a good indication of quality. |
ggml.c
Outdated
const float m0 = x[i].m; | ||
const float m1 = y[i].m; | ||
const float m0 = x[i].m / d0; | ||
const float m1 = y[i].m / d1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change need more testing as scale smaller may lose some bits of m0/m1. I suggest you only change q4_0 which is current we are using. q4_0 change is safe which can be prove there are no bits losing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that this is more questionable than the changes to Q4_0, and it also has less of a performance gain than Q4_0. I'll leave it for now so other people can try it and maybe comment on it. In the end we might revert that before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's keep only the Q4_0
change and merge
ggml.c
Outdated
const float m0 = x[i].m; | ||
const float m1 = y[i].m; | ||
const float m0 = x[i].m / d0; | ||
const float m1 = y[i].m / d1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's keep only the Q4_0
change and merge
This should help the poor souls running llama.cpp without a supported SIMD optimization.
First, the good stuff: per-token eval times in milliseconds:
With
-march=native -mtune=native
, but SIMD optimization code inggml_vec_dot_q
disabled:2023Without
-march=native -mtune=native
, which causes the scalar code to be used without requiring modification:2083Of course it would be interesting to see measurements from machines that actually do not have any SIMD circuits.
I would like to hear if this causes a regression in speed anywhere.
I can only really imagine this if floating point operations were faster than integer operations, because that's what we're trading here.
Beware of the recent change to the
Makefile
(c4f89d8), and possible differences toCMakeLists.txt
, if you run your own tests.The disadvantage is that this would once again cause a change in output due to floating point non-associativity / rounding.
I'll repost my snippet here, though admittedly this uses
double
s:We had changes like this before, and I understand that not having reproducible results makes some people unhappy.
I don't see us achieving this across the processor-specific optimizations without a severe regression in performance.