Replies: 2 comments 2 replies
-
Hi Compilade, Your observation is spot-on; the current rounding and thresholding methods indeed have lot of room for improvement. Simple alternative schemes based on median rather than mean can marginally enhance BitNet b1.58. A better rounding/thresholding function (like the one you proposed) could help more (IMO) in post-training quantization setting - where we want to preserve as much of a weight as possible when in low-bitwidth representation (while cosine similarity is an apt measure, others like sensitivity (such as in GPTQ) could be even better). However, as you suspect, for training quantized models—unlike post-training quantization—it's crucial to consider the training dynamics. In my experience experimenting with low-bitwidth models, I found that if scale/zero (or thresholding) shifts too rapidly, it can make optimization difficult or even cause unstable training:
Another perspective is to consider whether the latent FP16 weights represent model parameters or primarily facilitate the optimization process. It appears that latter role is more accurate. I'd be glad to explore potential hurdles in what you've proposed and share thoughts on strategies to overcome them. |
Beta Was this translation helpful? Give feedback.
-
Hi Compilade, I’ve been revisiting your ternarization method and I am working to replicate it. I had a few small questions:
By the way, if you prefer a more direct medium for clarifications on such minor details, please do let me know—I’d be happy to connect in a way that works best for you. Looking forward to your response! Best regards, |
Beta Was this translation helpful? Give feedback.
-
I've noticed there is room for improvement regarding rounding full-precision vectors to ternary.
Here, I've calculated the cosine similarity for different rounding thresholds with random vectors (each colored curve is the attempts for a single vector). I've tried normal and laplace random distributions (which can somewhat be seen by how the curves group themselves).
The
x
axis is the value of the rounding thresholds relative to the absmean of the vector components.The
y
axis is the cosine of the angle between the original full precision vector and the ternarized vector. Closer to1.00
is better since it means a smaller angle.Notice how the max of the curves is not proportional to the absmean.
But there is an ideal rounding threshold which can still be found: it's the top of the individual curves plotted above.
That would mean the ideal rounding function would look more like:
This basically compares every relevant rounding threshold for a vector to find the one giving the smallest angle, but dot products are replaced with an equivalent cumulative sum to make it fast.
I would be curious if using the best rounding would change the training dynamics of ternary models compared to the previously-used absmean rounding.
Thoughts?
@Ayushk4, @tejasvaidhyadev, @Eddie-Wang1120, @shumingma, @MekkCyber, @jquesnelle
Beta Was this translation helpful? Give feedback.
All reactions