-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
q4_1/f16 model is slow #681
Comments
Wasn't the 7B model quantized with q4_0 by any chance? |
the other model which runs normally is gpt4all which I first converted to ggml and then used the migrate script to ggjt this one is named gpt4-x-alpaca-13b-ggml-q4_1-from-gptq-4bit-128g |
Okay, so the 7B model that "runs as expected" is indeed using the q4_0 quantization method ( The "slow" model is quantized with q4_1 while some of the weights are also kept as f16 ( |
There is a version of that model (gpt-x-alpaca) reconverted with q4_0 quantization floating around if you look back through where you found it. The q4_1 version is considerably slower. |
I don't get that name, it says it's been quantized with the GPTQ method but at the same time it's a q4_1? |
I guess it was first quantized to 4-bit using GPTQ, then converted to q4_1 quantization used by llama.cpp. There is also a script which does that: https://github.com/ggerganov/llama.cpp/blob/master/convert-gptq-to-ggml.py |
The only way to convert a gptq.pt file into a ggml.bin file is to use this script and this script is keeping the GPTQ quantization, it's not converting it into a q4_1 quantization. That was it's main purpose, to let the llama.cpp users to enjoy the GPTQ quantized models |
Closing as it is expected that q4_1/f16 is slower than q4_0 |
pulled to the latest commit
another 7B model still runs as expected (which is gpt4all-lora-ggjt)
I have 16 gb of ram, the model file is about 9.5 gb
4 cores, amd, linux
problem description:
model name: gpt4-x-alpaca-13b-ggml-q4_1-from-gptq-4bit-128g
the model was described as: LLaMA 13B, finetuned natively with alpaca dataset, then finetuned on GPT4 responses (GPT4-x), then GPTQ 4b-128g quantized, then converted to ggml q4_1 format
it loads, but takes about 30 seconds per token
The text was updated successfully, but these errors were encountered: