Question about "Bit-serial linear transformation" in paper #35
-
I have thoroughly reviewed the T-MAC code, and I find it to be exceptionally well-written. However, both the TVM part of the code and the C++ code generated by TVM are quite challenging for me to understand due to my limited expertise. Additionally, I am struggling to correlate some details with the explanations provided in the paper. I would greatly appreciate some clarification. My main point of confusion lies in the purpose of the "Bit-serial linear transformation." From what I understand, the following code loads 16 int8 weights at once, then unpacks them into two sets of int4 data. The LUT is used for multiplication, and the adder_bot and adder_top are responsible for accumulating the data (with some operations to avoid overflow). At this point, multiplying the accumulated results by the scale of the GPTQ model group should yield the result of the matrix multiplication. However, does this lead to significant precision loss? If so, is this loss introduced during the quantization of the LUT? Would it be possible to avoid this precision loss by not quantizing the LUT?
I also find it difficult to comprehend some of the formulas in the "Bit-serial linear transformation" section of the Arxiv version:
But where exactly is Additionally, in the zero-point section, why is there a calculation involving the weights? The comment states w = (w - default_zero - (zeros - default_zero)) * scales. Does w refer to the weights? Is the purpose of add_zero to fit the formula above?
My questions might be somewhat unclear, mainly because my understanding is limited, and I haven't been able to systematically formulate them. If possible, I would greatly appreciate your assistance in clarifying these points! 我深入阅读了T-MAC的代码,代码十分优秀,但是无论是TVM部分的代码,还是TVM生成的C++代码,由于我的水平有限,阅读起来很吃力,并且和文章中一些细节无法建立起联系。希望能够得到一些解答。
Arxiv版本的“Bit-serial linear transformation”部分的公式部分我有些难以理解:
但是 以及在zeropoint部分,为什么要对权重进行计算?注释中有
我的提问可能有些不清晰,主要是因为我理解能力有限,没能很系统的提出问题,如果可以的话,劳烦您解答! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Thanks for your interests in the project.
Yes, this loss is introduced during the quantization of the LUT. However, this loss is negligible. The baseline llama.cpp also introduces int8 activation quantization, and according to section 5.6 of our paper, the LUT quantization achieves exactly the same results compared to activation quantization. Such fine-grained activation quantization won't lead to significant precision loss and is negligible compared to loss brought by weight quantization.
Sure. We have already provided the kernel implementation. You can access it by setting |
Beta Was this translation helpful? Give feedback.
-
I have converted this issue into a discussion for a more open-ended communication.
B and LUT_Bias both come from the linear transformation of the bits. E.g., for a uint4 value 0b1010, it equals However, for int4, the thing is slightly different. The common practice to represent a int4 value is using a uint4 value and a default bias (or LUT_Bias is derived from this
This |
Beta Was this translation helpful? Give feedback.
Thanks for your interests in the project.
Yes, this loss is introduced during the quantization of the LUT. However, this loss is negligible. The baseline llama.cpp also introduces int8 activation quantization, and according to section 5.6 of our paper, the LUT quantization achieves exactly the same results compared to activation quantization.
Such fine-grained activation quantization won't lead to significant precision loss and is negligible compared to loss brought by weight quantization.
S…