Improved ternary rounding #112

compilade · 2024-11-13T02:03:47Z

compilade
Nov 13, 2024

I've noticed there is room for improvement regarding rounding full-precision vectors to ternary.

Here, I've calculated the cosine similarity for different rounding thresholds with random vectors (each colored curve is the attempts for a single vector). I've tried normal and laplace random distributions (which can somewhat be seen by how the curves group themselves).

The x axis is the value of the rounding thresholds relative to the absmean of the vector components.

The y axis is the cosine of the angle between the original full precision vector and the ternarized vector. Closer to 1.00 is better since it means a smaller angle.

Notice how the max of the curves is not proportional to the absmean.

But there is an ideal rounding threshold which can still be found: it's the top of the individual curves plotted above.

That would mean the ideal rounding function would look more like:

import numpy as np

def ternarize(a: np.ndarray) -> np.ndarray:
    s = -np.sort(-np.abs(a), axis=-1)
    # This is the same as making dot products with a progressively lower rounding threshold
    c = np.cumsum(s, axis=-1)
    # Normalize each result
    cn = c / np.sqrt(np.cumsum(np.ones_like(c), axis=-1))

    # The biggest cosine is the smallest angle
    i = np.argmax(cn, axis=-1, keepdims=True)

    # This is the best rounding threshold found above
    t = np.take_along_axis(s, i, axis=-1)

    # Ternarize to {-1, 0, 1}
    q = np.where(abs(a) < t, 0, np.sign(a))

    # The scale which minimizes the squared error is
    # the one which projects the ternarized vector onto the original
    sc = np.sum(q * a, axis=-1, keepdims=True) / np.sum(q * q, axis=-1, keepdims=True)

    return q * sc

This basically compares every relevant rounding threshold for a vector to find the one giving the smallest angle, but dot products are replaced with an equivalent cumulative sum to make it fast.

I would be curious if using the best rounding would change the training dynamics of ternary models compared to the previously-used absmean rounding.

Thoughts?

@Ayushk4, @tejasvaidhyadev, @Eddie-Wang1120, @shumingma, @MekkCyber, @jquesnelle

Ayushk4 · 2024-11-13T15:37:06Z

Ayushk4
Nov 13, 2024

Hi Compilade,

Your observation is spot-on; the current rounding and thresholding methods indeed have lot of room for improvement. Simple alternative schemes based on median rather than mean can marginally enhance BitNet b1.58.

A better rounding/thresholding function (like the one you proposed) could help more (IMO) in post-training quantization setting - where we want to preserve as much of a weight as possible when in low-bitwidth representation (while cosine similarity is an apt measure, others like sensitivity (such as in GPTQ) could be even better). However, as you suspect, for training quantized models—unlike post-training quantization—it's crucial to consider the training dynamics.

In my experience experimenting with low-bitwidth models, I found that if scale/zero (or thresholding) shifts too rapidly, it can make optimization difficult or even cause unstable training:

For instance, if the function depends solely on a single value in the weight matrix (or worse, if that value is an outlier, such as the absolute maximum).
- In fact, this also needs to be handled, if you are "learning" the continuous valued scale/zero (values that control threshold/rounding), by either separate lower learning rate or downscaling the gradients (for example in Learned Step Size Quantization).
- This is one place where training unstability in your case could arise from, unless you use some gradient based or hyperparameter-based trick.
Having very small group size for computing threshold (even if an average or median, which moves slower) also hurts the training. This is opposite of what one expects for post training quantization, where smaller group size ensures maximum amount of the original weight is preserved. For example, a group size of 128 elements is recommended for AutoGPTQ. In the original GGML (See this piece of code for example), 32 elements were grouped for quantizing (this was before LLaMa.cpp was a thing, I haven't kept up with updates since a while to GGML.)
- This is one place where your implementation differs from ours in TriLMs - we had one scale per "matrix" (not column or row), when model_parallel = 1 or (N scales if model_parallel = N). Yours seems to have one scale per column (or per row).
Clipping has also been shown to be important as it can suppress outliers (this was the earliest I could trace it to). Though, I believe, as matrices in Neural Networks get larger, optimization would be less impacted by clipping.

Another perspective is to consider whether the latent FP16 weights represent model parameters or primarily facilitate the optimization process. It appears that latter role is more accurate.

I'd be glad to explore potential hurdles in what you've proposed and share thoughts on strategies to overcome them.

0 replies

Ayushk4 · 2024-11-25T05:32:15Z

Ayushk4
Nov 25, 2024

Hi Compilade,

I’ve been revisiting your ternarization method and I am working to replicate it. I had a few small questions:

What matrix sizes did you use?
What parameters did you use for the Gaussian and Laplace distributions?

By the way, if you prefer a more direct medium for clarifications on such minor details, please do let me know—I’d be happy to connect in a way that works best for you.

Looking forward to your response!

Best regards,
Ayush

2 replies

compilade Nov 26, 2024
Author

Hi @Ayushk4, of course I can clarify.

What matrix sizes did you use?

I used 64 random vectors of 256 elements (each vector uses a different color on the graph), with half (32) using a normal distribution and the other half (32) using a Laplace distribution.

But the observation does scale with bigger matrices. Just now I've tried with matrices of size (512, 1024) with a single scale per matrix, and the observation is similar, but there is less variation between each distinct matrix using the same random distribution:

What parameters did you use for the Gaussian and Laplace distributions?

The default ones from numpy.

Here's a cleaned up version of the code I've used for making the plot, with the same ternarizing method, and with another variant which doesn't need square roots. This is exactly what I've used in the above plot in this comment.

Ternarization graph code (click to expand)

import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass


@dataclass
class TernaryResult:
    result: np.ndarray
    thresholds: np.ndarray
    angles: np.ndarray


def ternarize_with_sqrt(a: np.ndarray, axis: int | None = None) -> TernaryResult:
    an = a / np.sqrt(np.sum(a * a, axis=axis, keepdims=True))
    s = -np.sort(-np.abs(an), axis=axis)
    c = np.cumsum(s, axis=axis)
    cn = c / np.sqrt(np.cumsum(np.ones_like(c), axis=axis))

    i = np.argmax(cn, axis=axis, keepdims=True)

    t = np.take_along_axis(s, i, axis=axis)

    q = np.where(abs(an) < t, 0, np.sign(a))

    sc = np.sum(q * a, axis=axis, keepdims=True) / np.sum(
        q * q, axis=axis, keepdims=True
    )

    return TernaryResult(result=q * sc, thresholds=s, angles=cn)


def ternarize(a: np.ndarray, axis: int | None = None) -> TernaryResult:
    s = -np.sort(-np.abs(a), axis=axis)
    c = np.cumsum(s, axis=axis)
    cn = c * c / np.cumsum(np.ones_like(c), axis=axis)

    i = np.argmax(cn, axis=axis, keepdims=True)

    t = np.take_along_axis(s, i, axis=axis)

    q = np.where(abs(a) < t, 0, np.sign(a))

    sc = np.sum(q * a, axis=axis, keepdims=True) / np.sum(
        q * q, axis=axis, keepdims=True
    )

    return TernaryResult(
        result=q * sc,
        thresholds=s,
        angles=np.sqrt(cn / np.sum(a * a, axis=axis, keepdims=(axis is not None))),
    )


rng = np.random.default_rng(42)

plt.figure()

for i in range(8):
    size = (
        512,
        1024,
    )
    a_n = rng.normal(size=size)
    t_n = ternarize(a_n)
    ts_n = ternarize_with_sqrt(a_n)
    a_l = rng.laplace(size=size)
    t_l = ternarize(a_l)
    ts_l = ternarize_with_sqrt(a_l)

    plt.plot(t_n.thresholds / np.mean(np.abs(a_n)), t_n.angles, "-o")
    plt.plot(t_l.thresholds / np.mean(np.abs(a_l)), t_l.angles, "-o")

    plt.plot(ts_n.thresholds / np.mean(np.abs(ts_n.thresholds)), ts_n.angles)
    plt.plot(ts_l.thresholds / np.mean(np.abs(ts_l.thresholds)), ts_l.angles)

plt.show()

I was also curious about what happens when rounding non-random tensors used in actual models, so I've also tried rounding FloatLM_99M to ternary (thank you for TriLMs, by the way!):

Again, the ideal threshold does not seem directly related to the absmean.

Code for the above FloatLM_99M graph (click to expand)

(putting this file in the same directory as model.safetensors)

import numpy as np
import matplotlib.pyplot as plt
from dataclasses import dataclass
from safetensors.numpy import load_file
from pathlib import Path
import os


@dataclass
class TernaryResult:
    result: np.ndarray
    thresholds: np.ndarray
    angles: np.ndarray


def ternarize(a: np.ndarray, axis: int | None = None) -> TernaryResult:
    s = -np.sort(-np.abs(a), axis=axis)
    c = np.cumsum(s, axis=axis)
    cn = c * c / np.cumsum(np.ones_like(c), axis=axis)

    i = np.argmax(cn, axis=axis, keepdims=True)

    t = np.take_along_axis(s, i, axis=axis)

    q = np.where(abs(a) < t, 0, np.sign(a))

    sc = np.sum(q * a, axis=axis, keepdims=True) / np.sum(
        q * q, axis=axis, keepdims=True
    )

    return TernaryResult(
        result=q * sc,
        thresholds=s,
        angles=np.sqrt(cn / np.sum(a * a, axis=axis, keepdims=(axis is not None))),
    )


m = load_file(Path(os.path.dirname(__file__)) / "model.safetensors")

plt.figure()

i = 0

for n, t in m.items():
    if t.ndim > 1 and n != "model.embed_tokens.weight" and n != "lm_head.weight":
        print(i, n)
        tt = ternarize(t.astype(np.float32))
        plt.plot(tt.thresholds / np.mean(np.abs(tt.thresholds)), tt.angles)
        i += 1

plt.show()

But the main problem with this compared to absmean rounding is that it's much, much slower to run (only 6M parameters per second per core on a low-power CPU) when using 1024x1024 F32 matrices. I think sorting is likely the bottleneck, but its role here is to eliminate the symmetries to simplify the exhaustive search. As it is, it's fast enough for post-training quantization, but it's probably too slow for training. (there is still room for improvement)

By the way, if you prefer a more direct medium for clarifications on such minor details, please do let me know—I’d be happy to connect in a way that works best for you.

Right. I'm a bit slow to respond on GitHub because I tend to make more complete responses. But at least it leaves a public trail, which can be useful since anybody can chime in. Anyway, I'm compilade on Discord, and @fch:compilade.net on Matrix.

I'd be happy to further help you experiment with this.

Ayushk4 Nov 27, 2024

Thank you so much Compilade for the detailed clarification.

I got caught up our ICLR rebuttal and have some backlog to clear. It might take me a day or so to get back to you.

I’m looking forward to continuing our conversation on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved ternary rounding #112

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Improved ternary rounding #112

compilade Nov 13, 2024

Replies: 2 comments · 2 replies

Ayushk4 Nov 13, 2024

Ayushk4 Nov 25, 2024

compilade Nov 26, 2024 Author

Ayushk4 Nov 27, 2024

compilade
Nov 13, 2024

Replies: 2 comments 2 replies

Ayushk4
Nov 13, 2024

Ayushk4
Nov 25, 2024

compilade Nov 26, 2024
Author