-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AutoGPTQ's cpu kernel. #245
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I just leave some reviews to make sure we can have a good shape for this pr, so that it can be easily extend and mantained in the future.
auto_gptq/modeling/_base.py
Outdated
use_cuda_fp16=use_cuda_fp16, | ||
desc_act=quantize_config.desc_act, | ||
trainable=trainable | ||
checkpoint=checkpoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to make_quant_cpu first and then load checkpoint. And of course this requires a futher modification in make_quant_cpu()
, will describe it on there.
auto_gptq/modeling/_utils.py
Outdated
qweights=checkpoint[name1 + '.qweight'].contiguous(), | ||
zeros=checkpoint[name1 + '.qzeros'], | ||
scales=checkpoint[name1 + '.scales'].float(), | ||
bias = checkpoint[name1 + '.bias'].float() if name1 + '.bias' in checkpoint else None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I'm not fond of the way it make quant here, basically it load checkpoint directly when init QuantLinear, which breaks the API format to be different with other QuantLinear, and may cause difficulty of maintainance in the future.
My suggestions is only replace module in make_quant functions and load state dicts after this function is called in from_quantized()
class QuantLinear(nn.Module): | ||
QUANT_TYPE = "qigen" | ||
|
||
def __init__(self, bits=4, group_size=-1, N=0, M=0, qweights=None, zeros=None, scales=None, bias=None, hint=1, p=8, l1=2**18): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least those five arguments' name and order should be same with other QuantLinear: bits, group_size, infeatures, outfeatures, bias
My code is complete. Please review again. |
@qwopqwop200 Did not look at the code but given that CPUs usually do not support fp16 compute, is the compute rather done in fp32? With dequantization e.g. int4 -> fp32? |
Probably so. But I didn't design this kernel, so I'm not sure how it works. |
Where does it come from? |
https://github.com/IST-DASLab/QIGen/tree/master |
oh sorry I merged them by mistake. I've checked that it's working now, so there shouldn't be any problems. |
Currently only supports 2,4 bits, desc_act does not.
Enabling use_qigen overrides all settings related to cuda.