Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize non-SIMD Q4 vector dot product #703

Merged
merged 1 commit into from
Apr 13, 2023
Merged

Conversation

sw
Copy link
Contributor

@sw sw commented Apr 2, 2023

This should help the poor souls running llama.cpp without a supported SIMD optimization.

First, the good stuff: per-token eval times in milliseconds:

With -march=native -mtune=native, but SIMD optimization code in ggml_vec_dot_q disabled:

recent master (6e7801d) this PR
Q4_0 2488 1320
Q4_1 2315 2023

Without -march=native -mtune=native, which causes the scalar code to be used without requiring modification:

recent master (6e7801d) this PR
Q4_0 2840 1330
Q4_1 2388 2083

Of course it would be interesting to see measurements from machines that actually do not have any SIMD circuits.
I would like to hear if this causes a regression in speed anywhere.
I can only really imagine this if floating point operations were faster than integer operations, because that's what we're trading here.
Beware of the recent change to the Makefile (c4f89d8), and possible differences to CMakeLists.txt, if you run your own tests.

The disadvantage is that this would once again cause a change in output due to floating point non-associativity / rounding.

I'll repost my snippet here, though admittedly this uses doubles:

$ echo 'main(){printf("%.16f\n%.16f\n",.7+.2+.1,.7+.1+.2);}' | gcc -x c - && ./a.out
0.9999999999999999
1.0000000000000000

We had changes like this before, and I understand that not having reproducible results makes some people unhappy.
I don't see us achieving this across the processor-specific optimizations without a severe regression in performance.

@sw
Copy link
Contributor Author

sw commented Apr 2, 2023

I realize this merits checking the perplexity, but I get an ETA of 130 hours for the slowest case listed in the tables above. So roughly a month for a full run with all combinations :-(

@thement
Copy link
Contributor

thement commented Apr 2, 2023

I realize this merits checking the perplexity, but I get an ETA of 130 hours for the slowest case listed in the tables above. So roughly a month for a full run with all combinations :-(

I'm not very versed in LLM and Neural Networks, but what is the exact point of checking the perplexity? Is that to check that the model hasn't regressed? Does it hold that the lower the perplexity the better the model output?

@sw
Copy link
Contributor Author

sw commented Apr 3, 2023

I'm not very versed in LLM and Neural Networks, but what is the exact point of checking the perplexity? Is that to check that the model hasn't regressed? Does it hold that the lower the perplexity the better the model output?

That's basically it, though I wouldn't call myself an expert. If you look at some of the graphs in #406 that people have made, you'll see that the first few chunks have a large variation, so you'd ideally run the full set for it to give a good indication of quality.

ggml.c Outdated
const float m0 = x[i].m;
const float m1 = y[i].m;
const float m0 = x[i].m / d0;
const float m1 = y[i].m / d1;
Copy link
Collaborator

@howard0su howard0su Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change need more testing as scale smaller may lose some bits of m0/m1. I suggest you only change q4_0 which is current we are using. q4_0 change is safe which can be prove there are no bits losing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that this is more questionable than the changes to Q4_0, and it also has less of a performance gain than Q4_0. I'll leave it for now so other people can try it and maybe comment on it. In the end we might revert that before merging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's keep only the Q4_0 change and merge

ggml.c Outdated
const float m0 = x[i].m;
const float m1 = y[i].m;
const float m0 = x[i].m / d0;
const float m1 = y[i].m / d1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's keep only the Q4_0 change and merge

@sw sw requested review from howard0su and ggerganov April 13, 2023 14:51
@ggerganov ggerganov merged commit 6232f2d into ggml-org:master Apr 13, 2023
@sw sw deleted the scalar-opt branch April 13, 2023 15:23
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants