You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems we had INT4.GEMM and INT8.GEMM, even FP8.GEMM, but actually activations in LLMs are hard to be compressed. We sometime use W4A8 in the gemm, however, we first uncompress the two-INT4 stored in one INT8 and carry out the INT8.GEMM. This is ugly and very slow due to the memory reading and writing for INT4 decompress to INT8 process.
[FEA]: I ask for the surpport for INT4 * INT8 and output INT32/8.
W-A-KV bits:
figure is citing from the LLM.QAT paper.
The text was updated successfully, but these errors were encountered:
It seems we had INT4.GEMM and INT8.GEMM, even FP8.GEMM, but actually activations in LLMs are hard to be compressed. We sometime use W4A8 in the gemm, however, we first uncompress the two-INT4 stored in one INT8 and carry out the INT8.GEMM. This is ugly and very slow due to the memory reading and writing for INT4 decompress to INT8 process.
[FEA]: I ask for the surpport for INT4 * INT8 and output INT32/8.
W-A-KV bits:
figure is citing from the LLM.QAT paper.
The text was updated successfully, but these errors were encountered: