Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Q4_1_O quantization format that preserves outliers in weights and does dot in FP32 #825

Closed

Conversation

saharNooby
Copy link

@saharNooby saharNooby commented Apr 7, 2023

THIS PR IS OPENED BY MISTAKE, IT WAS MEANT FOR rwkv.cpp

Q4_1_O is like Q4_1, but with two important differences:

  • for each block, a single outlier is selected (absmax value) and stored separately, as-is; remaining values are quantized as if there was no outlier at all
  • during inference, dot product in matmul is done in FP32, following weight dequantization; in contrast to Q4_1, which quantized activations and does quantized dot

This format greatly improves perplexity as compared to Q4_1, but the cost is inference that is as slow as FP32.

Perplexity comparison on a private dataset (less is better):

1B5-20220929-ctx4096-Q4_0.bin,   loss [3.079], perplexity  21.745
1B5-20220929-ctx4096-Q4_1.bin,   loss [2.655], perplexity  14.231
1B5-20220929-ctx4096-Q4_1_O.bin, loss [2.204], perplexity   9.060
1B5-20220929-ctx4096-FP16.bin,   loss [2.060], perplexity   7.847

3B-20221110-ctx4096-Q4_0.bin,    loss [4.689], perplexity 108.724
3B-20221110-ctx4096-Q4_1.bin,    loss [2.916], perplexity  18.475
3B-20221110-ctx4096-Q4_1_O.bin,  loss [2.406], perplexity  11.093
3B-20221110-ctx4096-FP16.bin,    loss [2.067], perplexity   7.901

Performance comparison (per-token latency, less is better):

1B5 FP32:   213 ms per token
1B5 FP16:   115 ms per token
1B5 Q4_0:   159 ms per token
1B5 Q4_1:   110 ms per token
1B5 Q4_1_O: 207 ms per token

@saharNooby saharNooby closed this Apr 7, 2023
@saharNooby
Copy link
Author

Ooops, very large mistake -- this PR was meant for rwkv.cpp, sorry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants