Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

q4_1/f16 model is slow #681

Closed
tekakutli opened this issue Apr 1, 2023 · 8 comments
Closed

q4_1/f16 model is slow #681

tekakutli opened this issue Apr 1, 2023 · 8 comments

Comments

@tekakutli
Copy link

tekakutli commented Apr 1, 2023

pulled to the latest commit
another 7B model still runs as expected (which is gpt4all-lora-ggjt)
I have 16 gb of ram, the model file is about 9.5 gb
4 cores, amd, linux

problem description:

model name: gpt4-x-alpaca-13b-ggml-q4_1-from-gptq-4bit-128g
the model was described as: LLaMA 13B, finetuned natively with alpaca dataset, then finetuned on GPT4 responses (GPT4-x), then GPTQ 4b-128g quantized, then converted to ggml q4_1 format
it loads, but takes about 30 seconds per token

$./main -m models/13B/ggml-model-q4_1.bin -n 128 --repeat_penalty 1.0 --color -ins
main: seed = 1680359110
llama_model_load: loading model from 'models/13B/ggml-model-q4_1.bin' - please wait ...
llama_model_load: GPTQ model detected - are you sure n_parts should be 2? we normally expect it to be 1
llama_model_load: use '--n_parts 1' if necessary
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 4
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 9702.04 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 11750.14 MB (+ 1608.00 MB per state)
llama_model_load: loading tensors from 'models/13B/ggml-model-q4_1.bin'
llama_model_load: model size =  9701.60 MB / num tensors = 363
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 2
@prusnak
Copy link
Collaborator

prusnak commented Apr 1, 2023

another 7B model still runs as expected

Wasn't the 7B model quantized with q4_0 by any chance?

@tekakutli
Copy link
Author

tekakutli commented Apr 1, 2023

the other model which runs normally is gpt4all which I first converted to ggml and then used the migrate script to ggjt

this one is named gpt4-x-alpaca-13b-ggml-q4_1-from-gptq-4bit-128g
which if I try to migrate, it says: "input ggml has already been converted to 'ggjt' magic"

@prusnak
Copy link
Collaborator

prusnak commented Apr 1, 2023

Okay, so the 7B model that "runs as expected" is indeed using the q4_0 quantization method (f16 = 2 in the debug output).

The "slow" model is quantized with q4_1 while some of the weights are also kept as f16 (f16 = 4 in the debug output), so my guess is that it is kind of expected, that this model will be slow on your configuration.

@prusnak prusnak changed the title extremely slow model q4_1/f16 model is slow Apr 1, 2023
@rabidcopy
Copy link
Contributor

There is a version of that model (gpt-x-alpaca) reconverted with q4_0 quantization floating around if you look back through where you found it. The q4_1 version is considerably slower.

@BadisG
Copy link

BadisG commented Apr 1, 2023

gpt4-x-alpaca-13b-ggml-q4_1-from-gptq-4bit-128g

I don't get that name, it says it's been quantized with the GPTQ method but at the same time it's a q4_1?
What? lmao

@prusnak
Copy link
Collaborator

prusnak commented Apr 1, 2023

I don't get that name, it says it's been quantized with the GPTQ method but at the same time it's a q4_1?

I guess it was first quantized to 4-bit using GPTQ, then converted to q4_1 quantization used by llama.cpp.

There is also a script which does that: https://github.com/ggerganov/llama.cpp/blob/master/convert-gptq-to-ggml.py

@BadisG
Copy link

BadisG commented Apr 1, 2023

I don't get that name, it says it's been quantized with the GPTQ method but at the same time it's a q4_1?

I guess it was first quantized to 4-bit using GPTQ, then converted to q4_1 quantization used by llama.cpp.

The only way to convert a gptq.pt file into a ggml.bin file is to use this script and this script is keeping the GPTQ quantization, it's not converting it into a q4_1 quantization.

That was it's main purpose, to let the llama.cpp users to enjoy the GPTQ quantized models

@prusnak prusnak closed this as not planned Won't fix, can't repro, duplicate, stale Apr 2, 2023
@prusnak
Copy link
Collaborator

prusnak commented Apr 2, 2023

Closing as it is expected that q4_1/f16 is slower than q4_0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants