I had the idea of smoothing outliers before quantization, apparently it's a VERY BAD idea #1707

KerfuffleV2 · 2023-06-06T00:54:47Z

KerfuffleV2
Jun 6, 2023
Collaborator

Outliers are something quantization struggles with, so why not just clamp values... I don't know, more than 3 standard deviations above/below the mean to that value? Let's just say that's a fun way to see perplexity values over 12,000.

Turns out you have to increase that 20+ standard deviations before it doesn't just absolutely destroy the model.

base
[1]4.2373,[2]4.6986,[3]5.5834,[4]6.1988,[5]6.3160,[6]6.2878,[7]6.4837,[8]6.5714,[9]6.8887,[10]7.1386

32 stdevs:
[1]5.2509,[2]5.7545,[3]6.7047,[4]7.5823,[5]7.7103,[6]7.8027,[7]8.0383,[8]8.0762,[9]8.6288,[10]9.0520

24 stdevs:
[1]11.5673,[2]12.0298,[3]14.5403,[4]18.1483,[5]18.3686,[6]18.7237,[7]19.3419,[8]19.8451,[9]21.3683,[10]23.1861

18 stdevs:
[1]114.6366,[2]210.1257,[3]183.9364,[4]220.4966,[5]202.6002,[6]203.8893,[7]192.5779,[8]188.1800,[9]194.5049,[10]209.2188

base there is requantizing a 7b q8_0 LLaMA to q6_k with a hacked version of #1691 and no clamping of values, same thing for the others just giving it a bit of the old clamps.

Here is some output running quantization while trying to clamp to 18 standard deviations. Some tensors don't get changed at all. The most is a couple hundred outlier values changed out of 10+ million but it has a huge effect.

[   1/ 291]                tok_embeddings.weight -     4096 x 32000, type =   q8_0, 
els = 131072000, above = 0, below = 0, mean = -0.00003834, min/max = -0.2220/0.2682, stdev = 0.02019165, stdeva2 = 0.04034496, stdeva3 = 0.36341136
quantizing .. size =   132.81 MB ->   102.54 MB | hist: 
[   2/ 291]                          norm.weight -             4096, type =    f32, size =    0.016 MB
[   3/ 291]                        output.weight -     4096 x 32000, type =   q8_0, 
els = 131072000, above = 1, below = 0, mean = 0.00000330, min/max = -0.3583/0.4268, stdev = 0.02004954, stdeva2 = 0.04010238, stdeva3 = 0.36089506
quantizing .. size =   132.81 MB ->   102.54 MB | hist: 
[   4/ 291]         layers.0.attention.wq.weight -     4096 x  4096, type =   q8_0, 
els = 16777216, above = 46, below = 51, mean = -0.00000345, min/max = -0.7451/0.7931, stdev = 0.03120203, stdeva2 = 0.06240061, stdeva3 = 0.56163313
quantizing .. size =    17.00 MB ->    13.12 MB | hist: 
[   5/ 291]         layers.0.attention.wk.weight -     4096 x  4096, type =   q8_0, 
els = 16777216, above = 24, below = 35, mean = -0.00000857, min/max = -1.0455/1.1550, stdev = 0.03112786, stdeva2 = 0.06224715, stdeva3 = 0.56029291
quantizing .. size =    17.00 MB ->    13.12 MB | hist: 
[   6/ 291]         layers.0.attention.wv.weight -     4096 x  4096, type =   q8_0, 
els = 16777216, above = 0, below = 0, mean = 0.00000459, min/max = -0.1031/0.1103, stdev = 0.01344857, stdeva2 = 0.02690173, stdeva3 = 0.24207889
quantizing .. size =    17.00 MB ->    13.12 MB | hist: 
[   7/ 291]         layers.0.attention.wo.weight -     4096 x  4096, type =   q8_0, 
els = 16777216, above = 188, below = 168, mean = 0.00000165, min/max = -0.5131/0.5315, stdev = 0.01177015, stdeva2 = 0.02354194, stdeva3 = 0.21186432
quantizing .. size =    17.00 MB ->    13.12 MB | hist: 
[   8/ 291]       layers.0.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[   9/ 291]      layers.0.feed_forward.w1.weight -     4096 x 11008, type =   q8_0, 
els = 45088768, above = 37, below = 29, mean = -0.00000586, min/max = -1.0736/1.1918, stdev = 0.01685951, stdeva2 = 0.03371316, stdeva3 = 0.30346530
quantizing .. size =    45.69 MB ->    35.27 MB | hist: 
[  10/ 291]      layers.0.feed_forward.w2.weight -    11008 x  4096, type =   q8_0, 
els = 45088768, above = 84, below = 60, mean = 0.00000154, min/max = -0.7354/0.8149, stdev = 0.02046981, stdeva2 = 0.04094117, stdeva3 = 0.36845815
quantizing .. size =    45.69 MB ->    35.27 MB | hist: 
[  11/ 291]      layers.0.feed_forward.w3.weight -     4096 x 11008, type =   q8_0, 
els = 45088768, above = 0, below = 1, mean = 0.00000206, min/max = -0.2965/0.2682, stdev = 0.01634489, stdeva2 = 0.03269184, stdeva3 = 0.29421008
quantizing .. size =    45.69 MB ->    35.27 MB | hist: 
[  12/ 291]             layers.0.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB

Perhaps there's something wrong with my calculations? Relative to the pull I listed, right after joining the workers in llama_convert_tensor_internal I added:

    if (true) {
        long double sum = 0;
        for (auto i = 0; i < nelements; i++) {
            sum += f32_output[i];
        }
        long double m = sum / nelements;

        long double accum = 0.0;
        double minval = 0, maxval = 0;
        for (auto i = 0; i < nelements; i++) {
            auto d = f32_output[i];
            if (d < minval) {
                 minval = d;
             } else if (d > maxval) {
                maxval = d;
             }

            accum += (d - m) * (d - m);
        }

        long double stdev = sqrtl(accum / (nelements - 1));
        long double stdeva2 = m + (stdev * 2);
        long double stdeva3 = m + (stdev * 18);
        long double stdevb3 = m - (stdev * 18);
        int below = 0, above = 0;
        for (auto i = 0; i < nelements; i++) {
            auto d = f32_output[i];
            if (d <= stdevb3) {
                f32_output[i] = stdevb3;
                below++;
            } else if (d >= stdeva3) {
                f32_output[i] = stdeva3;
                above++;
            }
        }
        printf("\nels = %d, above = %d, below = %d, mean = %.8Lf, min/max = %.4f/%.4f, stdev = %.8Lf, stdeva2 = %.8Lf, stdeva3 = %.8Lf\n",
            nelements, above, below, m, minval, maxval, stdev, stdeva2, stdeva3);
    }

Is this idea just a complete dead end?

KerfuffleV2 · 2023-06-06T01:44:07Z

KerfuffleV2
Jun 6, 2023
Collaborator Author

Another interesting thing is the outliers are mainly clustered in the same general area of the tensor:

[ 268/ 291]        layers.29.attention.wo.weight -     4096 x  4096, type =   q8_0, 
 15731332:<<-0.61237 
 15731337:<<-0.61237 
 15731351:<<-0.48700 
 15731353:>> 0.49664 
 15731358:<<-0.45807 
 15731362:>> 0.49595 
 15731363:>> 0.47176 
 15731368:<<-0.45966 
 15731388:>> 0.49998 
 15731390:>> 0.51208 
 15731395:<<-0.50620 
 15731407:<<-0.45735 
 15731411:>> 0.56392 
 15731414:<<-0.53728 
 15731420:<<-0.47955 
 15731421:>> 0.47067 
 15731430:<<-0.48105 
 15731433:<<-0.58184 
 15731440:<<-0.51312 
 15731441:>> 0.45815 
 15731449:<<-0.47647 
 15731455:<<-0.48563 
els = 16777216, above = 8, below = 14, mean = -0.00000796, min/max = -0.6124/0.5639, stdev = 0.02531989, stdeva2 = 0.05063183, stdeva3 = 0.45575009

<< and >> indicates whether it's standard deviations above/below the threshold. For that tensor with 16.7mil values, all those outliers were in the space of 123 total values.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I had the idea of smoothing outliers before quantization, apparently it's a VERY BAD idea #1707

{{title}}

Replies: 1 comment

{{title}}

Select a reply

I had the idea of smoothing outliers before quantization, apparently it's a VERY BAD idea #1707

KerfuffleV2 Jun 6, 2023 Collaborator

Replies: 1 comment

KerfuffleV2 Jun 6, 2023 Collaborator Author

KerfuffleV2
Jun 6, 2023
Collaborator

KerfuffleV2
Jun 6, 2023
Collaborator Author