Optimize non-SIMD Q4 vector dot product #703

sw · 2023-04-02T09:22:35Z

This should help the poor souls running llama.cpp without a supported SIMD optimization.

First, the good stuff: per-token eval times in milliseconds:

With -march=native -mtune=native, but SIMD optimization code in ggml_vec_dot_q disabled:

	recent master (`6e7801d`)	this PR
Q4_0	2488	1320
Q4_1	2315	~~2023~~

Without -march=native -mtune=native, which causes the scalar code to be used without requiring modification:

	recent master (`6e7801d`)	this PR
Q4_0	2840	1330
Q4_1	2388	~~2083~~

Of course it would be interesting to see measurements from machines that actually do not have any SIMD circuits.
I would like to hear if this causes a regression in speed anywhere.
I can only really imagine this if floating point operations were faster than integer operations, because that's what we're trading here.
Beware of the recent change to the Makefile (c4f89d8), and possible differences to CMakeLists.txt, if you run your own tests.

The disadvantage is that this would once again cause a change in output due to floating point non-associativity / rounding.

I'll repost my snippet here, though admittedly this uses doubles:

$ echo 'main(){printf("%.16f\n%.16f\n",.7+.2+.1,.7+.1+.2);}' | gcc -x c - && ./a.out
0.9999999999999999
1.0000000000000000

We had changes like this before, and I understand that not having reproducible results makes some people unhappy.
I don't see us achieving this across the processor-specific optimizations without a severe regression in performance.

sw · 2023-04-02T09:43:41Z

I realize this merits checking the perplexity, but I get an ETA of 130 hours for the slowest case listed in the tables above. So roughly a month for a full run with all combinations :-(

thement · 2023-04-02T13:14:28Z

I realize this merits checking the perplexity, but I get an ETA of 130 hours for the slowest case listed in the tables above. So roughly a month for a full run with all combinations :-(

I'm not very versed in LLM and Neural Networks, but what is the exact point of checking the perplexity? Is that to check that the model hasn't regressed? Does it hold that the lower the perplexity the better the model output?

sw · 2023-04-03T19:05:48Z

I'm not very versed in LLM and Neural Networks, but what is the exact point of checking the perplexity? Is that to check that the model hasn't regressed? Does it hold that the lower the perplexity the better the model output?

That's basically it, though I wouldn't call myself an expert. If you look at some of the graphs in #406 that people have made, you'll see that the first few chunks have a large variation, so you'd ideally run the full set for it to give a good indication of quality.

howard0su · 2023-04-04T00:13:13Z

ggml.c

-        const float m0 = x[i].m;
-        const float m1 = y[i].m;
+        const float m0 = x[i].m / d0;
+        const float m1 = y[i].m / d1;


This change need more testing as scale smaller may lose some bits of m0/m1. I suggest you only change q4_0 which is current we are using. q4_0 change is safe which can be prove there are no bits losing.

You are right that this is more questionable than the changes to Q4_0, and it also has less of a performance gain than Q4_0. I'll leave it for now so other people can try it and maybe comment on it. In the end we might revert that before merging.

Yes, let's keep only the Q4_0 change and merge

ggerganov · 2023-04-13T13:02:41Z

ggml.c

-        const float m0 = x[i].m;
-        const float m1 = y[i].m;
+        const float m0 = x[i].m / d0;
+        const float m1 = y[i].m / d1;


Yes, let's keep only the Q4_0 change and merge

howard0su reviewed Apr 4, 2023

View reviewed changes

ggerganov requested changes Apr 13, 2023

View reviewed changes

Optimize non-SIMD Q4_0 vector dot product

b7f38ee

sw force-pushed the scalar-opt branch from f63a79f to b7f38ee Compare April 13, 2023 14:48

sw requested review from howard0su and ggerganov April 13, 2023 14:51

ggerganov approved these changes Apr 13, 2023

View reviewed changes

ggerganov merged commit 6232f2d into ggml-org:master Apr 13, 2023

sw deleted the scalar-opt branch April 13, 2023 15:23

sw mentioned this pull request Apr 14, 2023

Fix potential int8_t overflow in non-SIMD vec_dot #986

Merged

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Add X-Request-ID request header for mirroring custom IDs. (ggml-org#703)

da9df78

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize non-SIMD Q4 vector dot product #703

Optimize non-SIMD Q4 vector dot product #703

sw commented Apr 2, 2023 •

edited

Loading

sw commented Apr 2, 2023

thement commented Apr 2, 2023 •

edited

Loading

sw commented Apr 3, 2023

howard0su Apr 4, 2023 •

edited

Loading

sw Apr 4, 2023

ggerganov Apr 13, 2023

ggerganov Apr 13, 2023

Optimize non-SIMD Q4 vector dot product #703

Optimize non-SIMD Q4 vector dot product #703

Conversation

sw commented Apr 2, 2023 • edited Loading

sw commented Apr 2, 2023

thement commented Apr 2, 2023 • edited Loading

sw commented Apr 3, 2023

howard0su Apr 4, 2023 • edited Loading

Choose a reason for hiding this comment

sw Apr 4, 2023

Choose a reason for hiding this comment

ggerganov Apr 13, 2023

Choose a reason for hiding this comment

ggerganov Apr 13, 2023

Choose a reason for hiding this comment

sw commented Apr 2, 2023 •

edited

Loading

thement commented Apr 2, 2023 •

edited

Loading

howard0su Apr 4, 2023 •

edited

Loading