Question about R1 rotation matrix. #6

fingerk28 · 2024-10-28T03:28:13Z

Hello, thank you for your interest in the work, it’s very inspiring.

I would like to ask, in QuaRot, because of the residual connection, all R1 must be the same rotation matrix. However, in DuQuant, the rotation matrix needs to be obtained through greedy search, so each R1 is different, right? But how to do that and keep the inference results the same as before the rotation?

figure

Looking forward to your reply.

FelixMessi · 2024-11-04T03:47:24Z

Hello, thank you for your interest in our work.

In DuQuant, each $R_1$ is indeed obtained through a greedy search, making each rotation matrix different. In the latest version of our paper, we provide additional results on inference speedup.

We compare pre-filling and decoding stages with QuaRot, particularly for the LLaMA2-7B model. For pre-filling, we measure time usage by sending one sentence with 2048 tokens and we decode 128 steps to compute peak memory usage. As shown below, DuQuant maintains comparable pre-filling speedup and achieves better performance in downstream tasks.

INT4, BS=1 Time (ms) Saving Factor Memory (GB) Saving Factor WiKi QA

FP16 568 - 13.638 - 5.47 63.72

SmoothQuant 248 2.290x 3.890 3.506x 83.12 44.52

QLLM 435 1.306x 3.894 3.502x 9.09 51.60

QuaRot 284 2.000x 3.891 3.505x 6.39 61.25

DuQuant 288 1.972x 3.893 3.503x 6.28 61.76

For more in-depth information, please refer to our paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about R1 rotation matrix. #6

Question about R1 rotation matrix. #6

fingerk28 commented Oct 28, 2024

FelixMessi commented Nov 4, 2024

Question about R1 rotation matrix. #6

Question about R1 rotation matrix. #6

Comments

fingerk28 commented Oct 28, 2024

FelixMessi commented Nov 4, 2024