Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #825

saharNooby · 2023-04-07T06:21:25Z

THIS PR IS OPENED BY MISTAKE, IT WAS MEANT FOR rwkv.cpp

Q4_1_O is like Q4_1, but with two important differences:

for each block, a single outlier is selected (absmax value) and stored separately, as-is; remaining values are quantized as if there was no outlier at all
during inference, dot product in matmul is done in FP32, following weight dequantization; in contrast to Q4_1, which quantized activations and does quantized dot

This format greatly improves perplexity as compared to Q4_1, but the cost is inference that is as slow as FP32.

Perplexity comparison on a private dataset (less is better):

1B5-20220929-ctx4096-Q4_0.bin,   loss [3.079], perplexity  21.745
1B5-20220929-ctx4096-Q4_1.bin,   loss [2.655], perplexity  14.231
1B5-20220929-ctx4096-Q4_1_O.bin, loss [2.204], perplexity   9.060
1B5-20220929-ctx4096-FP16.bin,   loss [2.060], perplexity   7.847

3B-20221110-ctx4096-Q4_0.bin,    loss [4.689], perplexity 108.724
3B-20221110-ctx4096-Q4_1.bin,    loss [2.916], perplexity  18.475
3B-20221110-ctx4096-Q4_1_O.bin,  loss [2.406], perplexity  11.093
3B-20221110-ctx4096-FP16.bin,    loss [2.067], perplexity   7.901

Performance comparison (per-token latency, less is better):

1B5 FP32:   213 ms per token
1B5 FP16:   115 ms per token
1B5 Q4_0:   159 ms per token
1B5 Q4_1:   110 ms per token
1B5 Q4_1_O: 207 ms per token

…te & logits saving

…similar to llama.cpp format)

Adds instructions and works on linux as well

we actually build a dylib on macos

Update macOS, better instructions, streaming output

saharNooby · 2023-04-07T06:22:05Z

Ooops, very large mistake -- this PR was meant for rwkv.cpp, sorry!

saharNooby and others added 30 commits March 30, 2023 17:55

Initial commit

2f51451

Make ln0 work correctly

873cb95

Implement time mixing, fix matrix shape mismatch

56bf4fc

Update README.md

93c8dca

Minor changes

fe272dc

Implement exp, max, 1_minus_x, sigmoid operators in ggml

01d667f

Update README.md

02c9946

Add reference implementation of RWKV RNN

d00f285

Add comparison against reference implementation script, implement sta…

61c6b1a

…te & logits saving

Finally, FP32 inference

6fe9486

Update README.md

bf88e8a

Remove reference implementation code and test against pre-created logits

0fcb7c6

Add fail-fast version of the test

16ec7a5

[FILE FORMAT CHANGED] Use ggml_get_rows to get embedding

fe98c94

Support FP16 inference

f6d45ba

Move model to separate C library file

ac03019

[FILE FORMAT CHANGED] Reverse dimensions in ggml file (makes it more …

7130a89

…similar to llama.cpp format)

Add Python wrapper for C library

a1e1d34

Allocate memory as needed for specific configuration of model

b164bf4

Implement INT4 conversion and inference

972e28d

Fix quantization from FP16

38f9d02

Move library wrapper to separate file, refactor code

935d16f

Remove unused files

1ecbad3

Add quantization test back, run ggml tests on first context init

ee46ad2

Add text generation and chat scripts

e0684e8

Update README.md

6b4ebc3

Add GitHub workflows file

f2b1dad

Fix build errors and warnings

1262ad0

Remove hardcoded memory requirements table

d62a050

initial addition

a64aaa8

saharNooby and others added 25 commits April 3, 2023 09:32

Increase memory for overhead from 32 MB to 256 MB

5b2830e

Update README.md: include info about pre-compiled library

3535476

suggestions

6f3fb01

for consistency

0a0cabc

Merge branch 'master' into more_instructions_works_linux

bea02c4

more details for macos/linux

fa74b01

Merge pull request #9 from hypnopump/more_instructions_works_linux

4f1df7c

Adds instructions and works on linux as well

Minor formatting changes

aacc8b6

we actually build a dylib on macos

977efba

Merge pull request #13 from pixelkaiser/rwkv-macos

77e1998

we actually build a dylib on macos

working on macos. no point in fp32 if all weights distributed in fp16

b75a805

verify instructions can be followed

f5feb74

verify instructions can be followed

c320573

streaming output

a9cb9ad

streaming output

d380134

Merge pull request #14 from hypnopump/update_macos

dc679bf

Update macOS, better instructions, streaming output

Minor formatting changes

d12088e

Add missing labels and symbols for new operators

ad3a4eb

Free ggml context when model is garbage collected

fa9ad13

Show file compression ratio

058b5cd

Do not quantize head

ec99bc1

Add Q4_1_O format

c40941d

Use ggml function for parameter size calculation

18bf02f

Add Q4_1_O test

e26b408

Update README.md

edd57a1

saharNooby closed this Apr 7, 2023

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #825

Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #825

saharNooby commented Apr 7, 2023 •

edited

Loading

saharNooby commented Apr 7, 2023

Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #825

Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #825

Conversation

saharNooby commented Apr 7, 2023 • edited Loading

THIS PR IS OPENED BY MISTAKE, IT WAS MEANT FOR rwkv.cpp

saharNooby commented Apr 7, 2023

saharNooby commented Apr 7, 2023 •

edited

Loading