q4_1/f16 model is slow #681

tekakutli · 2023-04-01T14:31:56Z

pulled to the latest commit
another 7B model still runs as expected (which is gpt4all-lora-ggjt)
I have 16 gb of ram, the model file is about 9.5 gb
4 cores, amd, linux

problem description:

model name: gpt4-x-alpaca-13b-ggml-q4_1-from-gptq-4bit-128g
the model was described as: LLaMA 13B, finetuned natively with alpaca dataset, then finetuned on GPT4 responses (GPT4-x), then GPTQ 4b-128g quantized, then converted to ggml q4_1 format
it loads, but takes about 30 seconds per token

$./main -m models/13B/ggml-model-q4_1.bin -n 128 --repeat_penalty 1.0 --color -ins
main: seed = 1680359110
llama_model_load: loading model from 'models/13B/ggml-model-q4_1.bin' - please wait ...
llama_model_load: GPTQ model detected - are you sure n_parts should be 2? we normally expect it to be 1
llama_model_load: use '--n_parts 1' if necessary
llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 4
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 9702.04 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 11750.14 MB (+ 1608.00 MB per state)
llama_model_load: loading tensors from 'models/13B/ggml-model-q4_1.bin'
llama_model_load: model size =  9701.60 MB / num tensors = 363
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 2

The text was updated successfully, but these errors were encountered:

prusnak · 2023-04-01T14:43:48Z

another 7B model still runs as expected

Wasn't the 7B model quantized with q4_0 by any chance?

tekakutli · 2023-04-01T14:51:30Z

the other model which runs normally is gpt4all which I first converted to ggml and then used the migrate script to ggjt

this one is named gpt4-x-alpaca-13b-ggml-q4_1-from-gptq-4bit-128g
which if I try to migrate, it says: "input ggml has already been converted to 'ggjt' magic"

prusnak · 2023-04-01T15:05:14Z

Okay, so the 7B model that "runs as expected" is indeed using the q4_0 quantization method (f16 = 2 in the debug output).

The "slow" model is quantized with q4_1 while some of the weights are also kept as f16 (f16 = 4 in the debug output), so my guess is that it is kind of expected, that this model will be slow on your configuration.

rabidcopy · 2023-04-01T16:50:13Z

There is a version of that model (gpt-x-alpaca) reconverted with q4_0 quantization floating around if you look back through where you found it. The q4_1 version is considerably slower.

BadisG · 2023-04-01T18:30:04Z

gpt4-x-alpaca-13b-ggml-q4_1-from-gptq-4bit-128g

I don't get that name, it says it's been quantized with the GPTQ method but at the same time it's a q4_1?
What? lmao

prusnak · 2023-04-01T18:32:37Z

I don't get that name, it says it's been quantized with the GPTQ method but at the same time it's a q4_1?

I guess it was first quantized to 4-bit using GPTQ, then converted to q4_1 quantization used by llama.cpp.

There is also a script which does that: https://github.com/ggerganov/llama.cpp/blob/master/convert-gptq-to-ggml.py

BadisG · 2023-04-01T18:36:48Z

I don't get that name, it says it's been quantized with the GPTQ method but at the same time it's a q4_1?

I guess it was first quantized to 4-bit using GPTQ, then converted to q4_1 quantization used by llama.cpp.

The only way to convert a gptq.pt file into a ggml.bin file is to use this script and this script is keeping the GPTQ quantization, it's not converting it into a q4_1 quantization.

That was it's main purpose, to let the llama.cpp users to enjoy the GPTQ quantized models

prusnak · 2023-04-02T09:55:52Z

Closing as it is expected that q4_1/f16 is slower than q4_0

prusnak changed the title ~~extremely slow model~~ q4_1/f16 model is slow Apr 1, 2023

prusnak closed this as not planned Won't fix, can't repro, duplicate, stale Apr 2, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

q4_1/f16 model is slow #681

q4_1/f16 model is slow #681

tekakutli commented Apr 1, 2023 •

edited

Loading

prusnak commented Apr 1, 2023

tekakutli commented Apr 1, 2023 •

edited

Loading

prusnak commented Apr 1, 2023 •

edited

Loading

rabidcopy commented Apr 1, 2023

BadisG commented Apr 1, 2023

prusnak commented Apr 1, 2023 •

edited

Loading

BadisG commented Apr 1, 2023 •

edited

Loading

prusnak commented Apr 2, 2023 •

edited

Loading

q4_1/f16 model is slow #681

q4_1/f16 model is slow #681

Comments

tekakutli commented Apr 1, 2023 • edited Loading

problem description:

prusnak commented Apr 1, 2023

tekakutli commented Apr 1, 2023 • edited Loading

prusnak commented Apr 1, 2023 • edited Loading

rabidcopy commented Apr 1, 2023

BadisG commented Apr 1, 2023

prusnak commented Apr 1, 2023 • edited Loading

BadisG commented Apr 1, 2023 • edited Loading

prusnak commented Apr 2, 2023 • edited Loading

tekakutli commented Apr 1, 2023 •

edited

Loading

tekakutli commented Apr 1, 2023 •

edited

Loading

prusnak commented Apr 1, 2023 •

edited

Loading

prusnak commented Apr 1, 2023 •

edited

Loading

BadisG commented Apr 1, 2023 •

edited

Loading

prusnak commented Apr 2, 2023 •

edited

Loading