[QST] how can i do w4a8 (int4 * int8) using cutlass? #1370

yyfcc17 · 2024-03-01T07:29:06Z

i see int4 * fp8 is supported on hopper gpus,

but for gpus like a30 and a100, there is no fp8 support, so i need to use int4 * int8, two int4 is packed into a int8,

how can i accomplish this kind of mixed gemm using cutlass?

thanks!

hwu36 · 2024-03-01T16:19:57Z

#1190

@alexsamardzic

alexsamardzic · 2024-03-01T16:34:13Z

My PR is about F16/ S4, and will cover also BF16/S4, F16/ U4 and BF16/ U4 MM on Ampere; through an earlier work of @manishucsd, CUTLASS already supports these for S8 and U8 instead of S4 and U4, respectively, on Ampere. And that's about it regarding mixed data-types MM, thus what OP would need is not in the works yet, but may be an interesting follow-up.

yyfcc17 · 2024-03-04T08:35:23Z

thanks for reply.

since LLM w8a8 is almost solved by PTQ methods, and w4a4 is too aggressive for PTQ methods, w4a8 PTQ is a good option for further optimization.

in my own experiments, the LLM's performance degradation is minor under w4a8 (per-channel * per-token) PTQ settings. and there is also a paper about it: https://arxiv.org/abs/2311.09550, the performance seems promising.

trt-llm already supported w4a8 PTQ & inference on hopper, but only for fp8.

is it possible to extend your PR to support w4a8 (int4 * int8) on Ampere? @alexsamardzic

alexsamardzic · 2024-03-04T09:41:37Z

It would be better to open another PR. Namely, the combination of data-types that my PR intends to handle is rather tricky regarding loading data from shared memory to registers, that I believe should not be the case for S4/S8 MM. Also, there isn't much to share between the two - all of data conversion, reshuffling S4 data between threads, and finally the tests, should be different for S4/S8.

alexsamardzic · 2024-03-19T14:59:44Z

I'm now working on this feature - #1413 is created, so this issue could be closed.

yyfcc17 · 2024-03-20T03:11:57Z

@alexsamardzic that's great, thank you!

yyfcc17 added ? - Needs Triage question Question labels Mar 1, 2024

mnicely removed the ? - Needs Triage label Mar 4, 2024

supriyar mentioned this issue Mar 18, 2024

[New Feature] CUTLASS kernels for w4a8 quantization pytorch/ao#64

Closed

yyfcc17 closed this as completed Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] how can i do w4a8 (int4 * int8) using cutlass? #1370

[QST] how can i do w4a8 (int4 * int8) using cutlass? #1370

yyfcc17 commented Mar 1, 2024

hwu36 commented Mar 1, 2024

alexsamardzic commented Mar 1, 2024 •

edited

Loading

yyfcc17 commented Mar 4, 2024

alexsamardzic commented Mar 4, 2024

alexsamardzic commented Mar 19, 2024

yyfcc17 commented Mar 20, 2024

[QST] how can i do w4a8 (int4 * int8) using cutlass? #1370

[QST] how can i do w4a8 (int4 * int8) using cutlass? #1370

Comments

yyfcc17 commented Mar 1, 2024

hwu36 commented Mar 1, 2024

alexsamardzic commented Mar 1, 2024 • edited Loading

yyfcc17 commented Mar 4, 2024

alexsamardzic commented Mar 4, 2024

alexsamardzic commented Mar 19, 2024

yyfcc17 commented Mar 20, 2024

alexsamardzic commented Mar 1, 2024 •

edited

Loading