-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPTQ quantization not working #12
Comments
One more issue is very high memory usage, it exceeds 128 GB after processing only the first 9 layers with the 13b model. |
I am at the third bullet point here as well, going to just follow along to comments here |
@jamestwhedbee to get rid of those python issues you can try to use this fork in the meantime https://github.com/lopuhin/gpt-fast/ -- but I don't have a solution for high RAM usage yet, so in the end I didn't manage to get a converted model. |
That looked promising but I unfortunately ran into another issue you probably wouldn't have. I am on AMD so that might be the cause. I can't find anything online related to this issue. I noticed that non-GPTQ int4 quantization does not work for me either, with the same error. int8 quantization works fine and I have run GPTQ int4 quantized models using the auto-gptq library for ROCm before so not sure what this issue is.
|
I got the same error when trying a conversion on another machine with more RAM but an older NVIDIA GPU. |
anyone solved all the problem. i am getting all the problem discussed in this thread |
@jamestwhedbee @lopuhin i stuck on this are you guys able to solve this? |
@MrD005 I got this error when trying to run on 2080Ti but not on L4 (both using CUDA 12.1) so I suspect this is due to this function missing in lower compute capability. |
@lopuhin i am running it on A100 , python 3.8 , with cuda 11.8 nightly so i think it is not about lower compute capability |
According to the code here, probably both cuda 12.x and compute capability 8.0+ are required. |
I had the same _convert_weight_to_int4pack_cuda not available problem. It was due to Cuda 11.8 not supporting the operator. Works now with a RTX4090 and 12.1 |
I got this problem on my single RTX4090 with Pytorch nightly installed with Cuda 11.8. After I had switched to Pytorch nightly on CUDA12.1, the problem was gone. |
@jamestwhedbee did you find a solution for ROCm? |
@lufixSch no, but as of last week v0.2.7 of vLLM supports GPTQ with ROCm, and I am seeing pretty good results there. So maybe that is an option for you. |
I applied all the fixes mentioned. But I'm still getting this error:- I am using lm_eval 0.4.0 |
lm_eval 0.3.0 and 0.4.0 support is updated in eb1789b |
GPTQ should be working for rocm ATM (rocm 6.2) , if not please let us know the detail. |
Running
quantize.py
with--mode int4-gptq
does not seem to work:lm-evaluation-harness
which is not included/documented/usedeval.py
is incorrect, should probably befrom model import Transformer as LLaMA
instead offrom model import LLaMA
import lm_eval
should be replaced withimport lm_eval.base
Overall here are the fixes I had to apply to make it run: lopuhin@86d990b
Based on this, could you please check if the right version of the code was included for GPTQ quantization?
The text was updated successfully, but these errors were encountered: