You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thank you for your interest in the work, it’s very inspiring.
I would like to ask, in QuaRot, because of the residual connection, all R1 must be the same rotation matrix. However, in DuQuant, the rotation matrix needs to be obtained through greedy search, so each R1 is different, right? But how to do that and keep the inference results the same as before the rotation?
In DuQuant, each $R_1$ is indeed obtained through a greedy search, making each rotation matrix different. In the latest version of our paper, we provide additional results on inference speedup.
We compare pre-filling and decoding stages with QuaRot, particularly for the LLaMA2-7B model. For pre-filling, we measure time usage by sending one sentence with 2048 tokens and we decode 128 steps to compute peak memory usage. As shown below, DuQuant maintains comparable pre-filling speedup and achieves better performance in downstream tasks.
INT4, BS=1
Time (ms)
Saving Factor
Memory (GB)
Saving Factor
WiKi
QA
FP16
568
-
13.638
-
5.47
63.72
SmoothQuant
248
2.290x
3.890
3.506x
83.12
44.52
QLLM
435
1.306x
3.894
3.502x
9.09
51.60
QuaRot
284
2.000x
3.891
3.505x
6.39
61.25
DuQuant
288
1.972x
3.893
3.503x
6.28
61.76
For more in-depth information, please refer to our paper.
Hello, thank you for your interest in the work, it’s very inspiring.
I would like to ask, in QuaRot, because of the residual connection, all R1 must be the same rotation matrix. However, in DuQuant, the rotation matrix needs to be obtained through greedy search, so each R1 is different, right? But how to do that and keep the inference results the same as before the rotation?
figure
Looking forward to your reply.
The text was updated successfully, but these errors were encountered: