-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST] how can i do w4a8 (int4 * int8) using cutlass? #1370
Comments
My PR is about |
thanks for reply. since LLM w8a8 is almost solved by PTQ methods, and w4a4 is too aggressive for PTQ methods, w4a8 PTQ is a good option for further optimization. in my own experiments, the LLM's performance degradation is minor under w4a8 (per-channel * per-token) PTQ settings. and there is also a paper about it: https://arxiv.org/abs/2311.09550, the performance seems promising. trt-llm already supported w4a8 PTQ & inference on hopper, but only for fp8. is it possible to extend your PR to support w4a8 (int4 * int8) on Ampere? @alexsamardzic |
It would be better to open another PR. Namely, the combination of data-types that my PR intends to handle is rather tricky regarding loading data from shared memory to registers, that I believe should not be the case for |
I'm now working on this feature - #1413 is created, so this issue could be closed. |
@alexsamardzic that's great, thank you! |
i see int4 * fp8 is supported on hopper gpus,
but for gpus like a30 and a100, there is no fp8 support, so i need to use int4 * int8, two int4 is packed into a int8,
how can i accomplish this kind of mixed gemm using cutlass?
thanks!
The text was updated successfully, but these errors were encountered: