8-bit Quantization #298

kroggen · 2023-08-15T03:52:31Z

This PR also has work from Aniket

It implements very basic but understandable 8-bit quantization (using quantize.c) and also dequantization on-the-fly on matmul, rmsnorm and dequantize_token

The RoPE weights are intentionally not quantized as it may cause some loss, although not tested

Example Usage

gcc quantize.c -o quantize
./quantize stories110M.bin
gcc -Ofast -march=native runq.c -o runq
./runq data.bin

karpathy · 2023-08-15T04:03:21Z

Nice, this will be a helpful reference. This is Q8_1 scheme. A few things that are in my mind for quantization:

I think I will change the python script directly to export in int8 instead of a quantize.c
I think I'll go for Q8_0 which is simpler and just as good
I think we have to quantize the activation vector x (dynamically), instead of keeping it float. otherwise we don't realize all the gains we'd want

roughly some of the things that come to mind

byte-6174 · 2023-08-15T04:08:19Z

Acts quant most likely would need fine tuning though, wouldn’t it?
So much more work needed to get that in good shape- huge potential runtime gains once we have it though.

karpathy · 2023-08-15T04:09:38Z

@byte-6174 not to my knowledge? it's possible to do quantization-aware finetuning to improve a model for quantization, but you can quantize it anyway.

byte-6174 · 2023-08-15T04:15:24Z

@karpathy I have been pouring thru literature about this for a few days now and mostly points to need for ft. see for eg ibert. But there are many more that point need for ft.
However, I'm not saying it won't work for sure, just going off of what I am seeing out there :)
We should try it anyhow :D

byte-6174 · 2023-08-15T04:17:57Z

btw, this PR also works for quantizing the llama2 7B model as well. compression from 25GB to 6.2GB. 🎆

byte-6174 · 2023-08-16T00:55:25Z

Following up with the little pseudo-code above, checked in a small experiment that does weights/acts quant on the fly. needs more experiments as blindly turning all matmults run on int8 will run everything to ground :)

the code of interest is this:

void get_quants_and_max(float *ptr, int size, int8_t *out_ptr, float* pmax, char* label){
    float max = -INFINITY;
    for (int i = 0; i < size; i++){
        if (ptr[i] > max) max = ptr[i];
    }
    *pmax = max;
    int8_t x_quant;
    for (int i = 0; i < size; i++){
        x_quant = round(127/max * ptr[i]);
        out_ptr[i] = x_quant;
    }
}

void matmulint(float* xout, float* x, float* w, int n, int d) {
    // W (d,n) @ x (n,) -> xout (d,)
    // by far the most amount of time is spent inside this little function

    // calcualte instantaneous max
    float maxx, maxw;
    int8_t *intx, *intw;
    intx = calloc(n, sizeof(int8_t)); 
    intw = calloc(n*d, sizeof(int8_t)); 
    get_quants_and_max(x, n, intx, &maxx, "x");
    get_quants_and_max(w, d * n, intw, &maxw, "w");

    #pragma omp parallel for private(i)
    for (int i = 0; i < d; i++) {
        int16_t vali = 0;
        for (int j = 0; j < n; j++) {
            // calculate int8 mults
            vali += intw[i*n + j] * intx[j];
        }
        xout[i] = (vali * (maxx * maxw)) / (127 * 127);
    }
}

mgrabban · 2023-08-16T14:20:59Z

gcc -Ofast -march=native runq.c -o runq

btw, this PR also works for quantizing the llama2 7B model as well. compression from 25GB to 6.2GB. 🎆

Hello @byte-6174 ,

When I try to run llama2 7b chat quantized version, I get gibberish. I did get coherent response from quantized stories42M.

using run

llama/llama2.c $ make run
gcc -O3 -o run run.c -lm
llama/llama2.c $ ./run bin/llama2_7b_chat.bin -n 16 -i "Why is sky blue?"
Why is sky blue?
How does the sky appear blue?
What is
achieved tok/s: 0.167125

using runq

llama/llama2.c $ gcc -Ofast -march=native runq.c -o runq -lm
llama/llama2.c $ ./runq bin/data.bin -n 16 -i "Why is sky blue?"
Why is sky blue?dj aj grandsls swo refuge花роз Louisiana Alb Alb
achieved tok/s: 1.536885

Not sure what could be wrong.

kroggen · 2023-08-16T16:52:37Z

@mgrabban How did you quantize the 7B model?

Can you show the output of the quantization? (it can be a link)

kroggen · 2023-08-17T00:28:14Z

For the CUDA implementation, check #310

mgrabban · 2023-08-17T00:38:42Z

@mgrabban How did you quantize the 7B model?

Can you show the output of the quantization? (it can be a link)

You can find the output here
I followed a two step process -

convert original llama2_7b_chat *.pth file (from Meta) into llama2.c *.bin file
quantize the llama2.c *.bin file from step 1 output to data.bin file

kroggen · 2023-08-17T01:36:18Z

I suspect it is related to the shared_weights

With stories110M.bin it is equal to 1:

$ ./quantize stories110M.bin
vocab size = 32000  shared_weights=1

And with that chat model it is 0

But the quantize is not processing the additional wcls

Can you check with the last commit I sent?

byte-6174 · 2023-08-17T04:22:39Z

yes, that change was needed for llama2 7B model. thanks @kroggen !

./runq data.bin -n 16 -i "why is sky blue?"

why is sky blue? Here's a theory
 everyone's been looking
achieved tok/s: 0.107060

karpathy · 2023-08-17T04:29:01Z

Quantization here is per layer instead of groups. That feels risky? I'd expect llama.cpp does groups?

byte-6174 · 2023-08-17T04:34:27Z

Yes. Llama2.cpp has groups 64 etc.
Why risky ?

karpathy · 2023-08-17T04:35:37Z

One outlier nukes the whole tensor. I'm starting a branch for int8 quantization now. I'll do groups.

byte-6174 · 2023-08-17T04:43:36Z

Humm. Trying to understand this, So the groups of 64 avoids this how ?
You mean outlier in the magnitude sense I'm presuming ?

karpathy · 2023-08-17T04:45:01Z

If there is a bad outlier somewhere, only e.g. up to 63 elements get "messed up" with high error, not the entire tensor. So breaking things up into groups makes things more robust to outliers.

byte-6174 · 2023-08-17T04:49:17Z

Btw as a side note : there is experimental evidence,in llama.cpp and also places like llm.int8, of needing mixed precision to tackle outliers. Thought we might want to / have to consider ?!

karpathy · 2023-08-17T04:50:24Z

That's a good point... The outliers are really annoying because they necessitate ugly special casing. Hmmm

byte-6174 · 2023-08-17T04:52:52Z

Ya this is where weight statistics analysis and all that business is needed, not as clean 🤢

byte-6174 · 2023-08-17T04:55:25Z

Right to solve that to the extreme we can / or need to scale rows and columns convert to int8 send to matmul.

byte-6174 · 2023-08-17T04:56:17Z

There is another cool approach I remember from song han with k-means. But another added complexity. 😜

kroggen · 2023-08-17T06:26:57Z

Quantization here is per layer instead of groups. That feels risky? I'd expect llama.cpp does groups?

The reason of the choice is because this repo is mainly for educational purposes, and using one scale factor per layer is the simplest way to show dequantization on-the-fly

The code is easier to understand

It is not the most precise. If we want to go to the edge, we can just use llama.cpp

It would be good to check the ppl values to compare how much it decreases with this approach

karpathy · 2023-08-18T15:08:42Z

Btw this PR doesn't actually perform integer dot products, it dequantizes the weight to float in the inner loop and uses that. Are people still obtaining speed improvements from this "weak" version of quantization?

mgrabban · 2023-08-18T17:49:45Z

Btw this PR doesn't actually perform integer dot products, it dequantizes the weight to float in the inner loop and uses that. Are people still obtaining speed improvements from this "weak" version of quantization?

I tested stories110M.bin. In normal CPU (e.g. i7-8700K), runq is faster than runopm (which is faster than runfast and run). For high end CPU, runq is faster than runfast and run, but runomp is the fastest (since there are many cores available).

kroggen · 2023-08-18T20:59:10Z

These are my results

When compiled with these commands:

gcc -Ofast -march=native runq.c -o runq -lm
gcc -Ofast -march=native run.c -o run -lm

On MacBook Pro Intel:

llama2.c bernardo$ ./run stories110M.bin -t 0 | grep tok
achieved tok/s: 36.963495
llama2.c bernardo$ ./runq data.bin -t 0 | grep tok
achieved tok/s: 78.974013

On Linux with AMD EPYC-Rome:

root@tests-br:~/llama2.c# ./run stories110M.bin -t 0 | grep tok
achieved tok/s: 39.503754
root@tests-br:~/llama2.c# ./runq data.bin -t 0 | grep tok
achieved tok/s: 73.079325

On Apple M1/M2 chips there is no difference in performance, probably because they have integrated RAM and these other chips have external RAM.

When compiling both with OpenMP on Linux/AMD using these commands:

gcc -Ofast -march=native -fopenmp run.c -o run -lm
gcc -Ofast -march=native -fopenmp runq.c -o runq -lm

I got these results:

root@tests-br:~/llama2.c# ./run stories110M.bin -t 0 | grep tok
achieved tok/s: 44.485294
root@tests-br:~/llama2.c# ./runq data.bin -t 0 | grep tok
achieved tok/s: 72.897196

kroggen · 2023-08-18T22:25:16Z

Btw this PR doesn't actually perform integer dot products, it dequantizes the weight to float in the inner loop and uses that

It is intentional. Multiplication with int8 requires a lot of other computations, making it slower than using float32 multiplication.

With dequantization on-the-fly, the code is simpler and faster at the same time

The downside in the current implementation is the use of only 1 scale factor per layer. This can be changed

byte-6174 · 2023-08-18T23:47:32Z

Yes, moreover even that downside is not that significant -- as proven by our results for at least Up to 7B model. Real deal breaker is the outliers and solution is mixed precision of some sort.
Whether any of that belongs in this repo or not is different question.

Aniket and others added 25 commits August 11, 2023 12:29

no gitignore

f1a7235

adding initial weight exploration

22bda2e

adding weight hist plots

84ac19d

adding example hist to readme

f5d08d2

adding quantize and runq

ea9b018

update readme with model sizes

6251785

use fabs(.) instead of max

f1aa0ff

speedup dequant_ints

f45187d

runq: dequantize weights on-the-fly

9c6a5c6

quantize weights

cd3b918

fix pointer arithmetic

2b3c1ce

fix quantization factor

5f62209

print quantization info

f546dcf

fix wcls weights

6e73fc6

fix: use uint8_t instead of int8_t

19d4b17

quantize token embeddings

a030c5c

rename get_max_vals to get_minmax

53370b7

update quantization to incorporate changes upstream for multiquery

6f5e426

update quantization to incorporate changes upstream for multiquery

7ff42dc

Merge branch 'master' into quantization-q8

dd09af7

update runq.c with changes from run.c

ec64f5c

update printed output of quantize.c

96f471d

Merge branch 'master' into quantization-q8

424170e

update runq with UTF-8 support

61a8ba9

remove files for PR

b961e3c

kroggen added 2 commits August 16, 2023 22:47

process wcls when not using shared weights

3bb4ed3

enhance output of quantize tool

c141a3a

byte-6174 mentioned this pull request Aug 21, 2023

Fix import of llama2.c models that don't share weights between embedding layers ggerganov/llama.cpp#2685

Merged

kroggen added 3 commits August 22, 2023 17:17

rename runq.c to run-q8.c

557a321

run-q8: compute freq_cis on-the-fly

c897efa

do not quantize the freq_cis

b011870

clebert mentioned this pull request Oct 19, 2023

Add support for 8-bit Quantization clebert/llama2.zig#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8-bit Quantization #298

8-bit Quantization #298

kroggen commented Aug 15, 2023

karpathy commented Aug 15, 2023

byte-6174 commented Aug 15, 2023

karpathy commented Aug 15, 2023

byte-6174 commented Aug 15, 2023

byte-6174 commented Aug 15, 2023

byte-6174 commented Aug 16, 2023 •

edited

Loading

mgrabban commented Aug 16, 2023

kroggen commented Aug 16, 2023

kroggen commented Aug 17, 2023

mgrabban commented Aug 17, 2023

kroggen commented Aug 17, 2023 •

edited

Loading

byte-6174 commented Aug 17, 2023

karpathy commented Aug 17, 2023

byte-6174 commented Aug 17, 2023

karpathy commented Aug 17, 2023

byte-6174 commented Aug 17, 2023

karpathy commented Aug 17, 2023

byte-6174 commented Aug 17, 2023 •

edited

Loading

karpathy commented Aug 17, 2023

byte-6174 commented Aug 17, 2023

byte-6174 commented Aug 17, 2023

byte-6174 commented Aug 17, 2023 •

edited

Loading

kroggen commented Aug 17, 2023 •

edited

Loading

karpathy commented Aug 18, 2023

mgrabban commented Aug 18, 2023

kroggen commented Aug 18, 2023

kroggen commented Aug 18, 2023

byte-6174 commented Aug 18, 2023

8-bit Quantization #298

Are you sure you want to change the base?

8-bit Quantization #298

Conversation

kroggen commented Aug 15, 2023

Example Usage

karpathy commented Aug 15, 2023

byte-6174 commented Aug 15, 2023

karpathy commented Aug 15, 2023

byte-6174 commented Aug 15, 2023

byte-6174 commented Aug 15, 2023

byte-6174 commented Aug 16, 2023 • edited Loading

mgrabban commented Aug 16, 2023

kroggen commented Aug 16, 2023

kroggen commented Aug 17, 2023

mgrabban commented Aug 17, 2023

kroggen commented Aug 17, 2023 • edited Loading

byte-6174 commented Aug 17, 2023

karpathy commented Aug 17, 2023

byte-6174 commented Aug 17, 2023

karpathy commented Aug 17, 2023

byte-6174 commented Aug 17, 2023

karpathy commented Aug 17, 2023

byte-6174 commented Aug 17, 2023 • edited Loading

karpathy commented Aug 17, 2023

byte-6174 commented Aug 17, 2023

byte-6174 commented Aug 17, 2023

byte-6174 commented Aug 17, 2023 • edited Loading

kroggen commented Aug 17, 2023 • edited Loading

karpathy commented Aug 18, 2023

mgrabban commented Aug 18, 2023

kroggen commented Aug 18, 2023

kroggen commented Aug 18, 2023

byte-6174 commented Aug 18, 2023

byte-6174 commented Aug 16, 2023 •

edited

Loading

kroggen commented Aug 17, 2023 •

edited

Loading

byte-6174 commented Aug 17, 2023 •

edited

Loading

byte-6174 commented Aug 17, 2023 •

edited

Loading

kroggen commented Aug 17, 2023 •

edited

Loading