Add Command R Plus support #6491

RefractAI · 2024-04-04T17:29:31Z

Updated tensor mapping to add Command R Plus support for GGUF conversion.

bartowski1182 · 2024-04-04T17:55:56Z

Probably shouldn't do the model_max_length mapping and should instead force it to be in config.json, otherwise these changes worked for me as well

N8python · 2024-04-04T18:09:29Z

Do the ggufs produced w/ this work w/ the main branch once they are quantized?

bartowski1182 · 2024-04-04T18:10:50Z

@N8python i'll let you know as soon as I have a finished one, I would be slightly surprised if they worked without this PR at inference time but i'm not sure how it all works

my conversion to f16.gguf is almost done, will be making a Q2 immediately after and seeing if that runs from master

Noeda · 2024-04-04T18:21:45Z

I think bunch of people are rushing to implement this. I have a slightly more complete code here (https://github.com/Noeda/llama.cpp/tree/commandr-plus
), but something with reading tensors is off at inference time. I think I'm getting the llama.cpp equivalent of Torch being mad about badly interleaved tensors in underlying storage. I'm playing with views to see can I hack it to work or do I have to actually start understanding how tensors are laid out in memory here. The application of the new qk norms is failing.

I made Q4_K and Q8_0 quants for myself; those seem fine but inference is not.

If you want you can pull my code into yours but it doesn't work. I have a bit of limited time and might have to stop hacking until evening or later tomorrow; but I'll try to get it working. I think adding the new layernorms for query and value should be enough; didn't see other differences in the Transformers code.

(I'll comment here if I have to get off, so no one waits for me if I have to go. I'm currently hacking and trying to figure out what's going on with my assert failures. This is another of those models that gets lots of excited people out of woods including myself :D to hacking but I don't want people to wait on me because the times I can work are unpredictable and tend to come in bursts and I might have to suddenly disappear)

Link for easier reading from my diff: it's not exactly lots of lines of code: master...Noeda:llama.cpp:commandr-plus

bartowski1182 · 2024-04-04T19:00:13Z

as Noeda suspected this change was not enough to make it work, conversion to f16.gguf "worked" but going to Q2 failed with

"gguf_init_from_file: invalid magic characters ''"

candre23 · 2024-04-04T19:09:13Z

Probably shouldn't do the model_max_length mapping and should instead force it to be in config.json, otherwise these changes worked for me as well

I pinged Cohere on HF and they added model_max_length to the config.json. So no more need to compensate for that oversight in the code.

Noeda · 2024-04-04T19:36:08Z

I'm no longer crashing on the spot in my branch but it's clearly not correct (prompt was just "hello", repeats token digit '8'):

Will need to review things a bit.

For those curious GGUF sizes for this thing seem to be:

Q4_K: 58G
Q8_0: 103G
f16: 193G

I can't run the f16 at all with my machinery. I'm doing testing on the Q4 one.

N8python · 2024-04-04T19:42:48Z

Here's the mlx impl:

ml-explore/mlx-examples#650

sammcj · 2024-04-04T21:11:51Z

FYI converting to fp16 on macOS works with this PR, but quantizing segfaults.

~/git/llama.cpp/convert-hf-to-gguf.py ./CohereForAI_c4ai-command-r-plus --outtype f16 --outfile CohereForAI_c4ai-command-r-plus.fp16.bin

~/git/llama.cpp/convert-hf-to-gguf.py ./CohereForAI_c4ai-command-r-plus --outtype f16 --outfile CohereForAI_c4ai-command-r-plus.fp16.bin
Loading model: CohereForAI_c4ai-command-r-plus
gguf: This GGUF file is for Little Endian only
Set model parameters
gguf: context length = 131072
gguf: embedding length = 12288
gguf: feed forward length = 33792
gguf: head count = 96
gguf: key-value head count = 8
gguf: rope theta = 75000000.0
gguf: layer norm epsilon = 1e-05
gguf: file type = 1
Set model tokenizer
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
gguf: Adding 253333 merge(s).
gguf: Setting special token type bos to 5
gguf: Setting special token type eos to 255001
gguf: Setting special token type pad to 0
gguf: Setting add_bos_token to True
gguf: Setting add_eos_token to False
Exporting model to 'CohereForAI_c4ai-command-r-plus.fp16.bin'
gguf: loading model part 'model-00001-of-00044.safetensors'
token_embd.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00002-of-00044.safetensors'
blk.0.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.0.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.0.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.0.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.0.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.1.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.1.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.1.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.1.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.1.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.1.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.1.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00003-of-00044.safetensors'
blk.1.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.1.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.1.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.2.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.2.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.2.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.2.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.3.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.3.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00004-of-00044.safetensors'
blk.3.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.3.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.3.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.3.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.3.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.3.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.3.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.3.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.4.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.4.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.4.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.4.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.4.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.4.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.4.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00005-of-00044.safetensors'
blk.4.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.4.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.4.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.5.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.5.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.5.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.5.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.6.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.6.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00006-of-00044.safetensors'
blk.6.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.6.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.6.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.6.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.6.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.6.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.6.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.6.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.7.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.7.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.7.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.7.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.7.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.7.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.7.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00007-of-00044.safetensors'
blk.7.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.7.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.7.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.8.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.8.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.8.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.8.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00008-of-00044.safetensors'
blk.10.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.10.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.10.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.10.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.10.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.10.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.10.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.9.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.9.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.9.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.9.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00009-of-00044.safetensors'
blk.10.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.10.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.10.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.11.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.11.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.11.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.11.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.12.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.12.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00010-of-00044.safetensors'
blk.12.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.12.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.12.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.12.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.12.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.12.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.12.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.12.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.13.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.13.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.13.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.13.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.13.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.13.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.13.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00011-of-00044.safetensors'
blk.13.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.13.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.13.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.14.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.14.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.14.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.14.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.15.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.15.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00012-of-00044.safetensors'
blk.15.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.15.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.15.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.15.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.15.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.15.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.15.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.15.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.16.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.16.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.16.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.16.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.16.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.16.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.16.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00013-of-00044.safetensors'
blk.16.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.16.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.16.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.17.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.17.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.17.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.17.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.18.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.18.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00014-of-00044.safetensors'
blk.18.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.18.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.18.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.18.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.18.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.18.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.18.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.18.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.19.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.19.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.19.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.19.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.19.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.19.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.19.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00015-of-00044.safetensors'
blk.19.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.19.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.19.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.20.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.20.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.20.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.20.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.21.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.21.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00016-of-00044.safetensors'
blk.21.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.21.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.21.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.21.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.21.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.21.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.21.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.21.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.22.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.22.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.22.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.22.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.22.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.22.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.22.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00017-of-00044.safetensors'
blk.22.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.22.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.22.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.23.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.23.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.23.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.23.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.24.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.24.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00018-of-00044.safetensors'
blk.24.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.24.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.24.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.24.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.24.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.24.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.24.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.24.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.25.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.25.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.25.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.25.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.25.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.25.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.25.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00019-of-00044.safetensors'
blk.25.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.25.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.25.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.26.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.26.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.26.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.26.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.27.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.27.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00020-of-00044.safetensors'
blk.27.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.27.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.27.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.27.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.27.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.27.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.27.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.27.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.28.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.28.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.28.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.28.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.28.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.28.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.28.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00021-of-00044.safetensors'
blk.28.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.28.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.28.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.29.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.29.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.29.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.29.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.30.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.30.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00022-of-00044.safetensors'
blk.30.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.30.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.30.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.30.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.30.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.30.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.30.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.30.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.31.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.31.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.31.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.31.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.31.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.31.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.31.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00023-of-00044.safetensors'
blk.31.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.31.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.31.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.32.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.32.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.32.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.32.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.33.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.33.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00024-of-00044.safetensors'
blk.33.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.33.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.33.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.33.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.33.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.33.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.33.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.33.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.34.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.34.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.34.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.34.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.34.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.34.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.34.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00025-of-00044.safetensors'
blk.34.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.34.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.34.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.35.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.35.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.35.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.35.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.36.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.36.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00026-of-00044.safetensors'
blk.36.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.36.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.36.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.36.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.36.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.36.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.36.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.36.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.37.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.37.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.37.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.37.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.37.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.37.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.37.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00027-of-00044.safetensors'
blk.37.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.37.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.37.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.38.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.38.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.38.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.38.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.39.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.39.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00028-of-00044.safetensors'
blk.39.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.39.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.39.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.39.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.39.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.39.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.39.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.39.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.40.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.40.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.40.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.40.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.40.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.40.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.40.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00029-of-00044.safetensors'
blk.40.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.40.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.40.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.41.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.41.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.41.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.41.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.42.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.42.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00030-of-00044.safetensors'
blk.42.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.42.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.42.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.42.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.42.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.42.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.42.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.42.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.43.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.43.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.43.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.43.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.43.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.43.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.43.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00031-of-00044.safetensors'
blk.43.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.43.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.43.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.44.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.44.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.44.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.44.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.45.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.45.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00032-of-00044.safetensors'
blk.45.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.45.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.45.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.45.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.45.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.45.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.45.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.45.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.46.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.46.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.46.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.46.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.46.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.46.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.46.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00033-of-00044.safetensors'
blk.46.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.46.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.46.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.47.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.47.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.47.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.47.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.48.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.48.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00034-of-00044.safetensors'
blk.48.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.48.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.48.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.48.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.48.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.48.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.48.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.48.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.49.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.49.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.49.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.49.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.49.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.49.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.49.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00035-of-00044.safetensors'
blk.49.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.49.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.49.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.50.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.50.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.50.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.50.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.51.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.51.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00036-of-00044.safetensors'
blk.51.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.51.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.51.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.51.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.51.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.51.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.51.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.51.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.52.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.52.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.52.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.52.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.52.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.52.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.52.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00037-of-00044.safetensors'
blk.52.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.52.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.52.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.53.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.53.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.53.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.53.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.54.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.54.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00038-of-00044.safetensors'
blk.54.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.54.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.54.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.54.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.54.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.54.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.54.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.54.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.55.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.55.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.55.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.55.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.55.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.55.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.55.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00039-of-00044.safetensors'
blk.55.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.55.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.55.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.56.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.56.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.56.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.56.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.57.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.57.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00040-of-00044.safetensors'
blk.57.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.57.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.57.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.57.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.57.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.57.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.57.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.57.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.58.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.58.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.58.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.58.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.58.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.58.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.58.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00041-of-00044.safetensors'
blk.58.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.58.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.58.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.59.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.59.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.59.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.59.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.60.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.60.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00042-of-00044.safetensors'
blk.60.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.60.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.60.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.60.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.60.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.60.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.60.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.60.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.61.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.61.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.61.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.61.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.61.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.61.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.61.attn_v.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00043-of-00044.safetensors'
blk.61.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.61.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.61.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.62.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.62.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.62.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.62.attn_v.weight, n_dims = 2, torch.float16 --> float16
blk.63.attn_k_norm.weight, n_dims = 2, torch.float16 --> float16
blk.63.attn_q_norm.weight, n_dims = 2, torch.float16 --> float16
gguf: loading model part 'model-00044-of-00044.safetensors'
blk.63.attn_norm.weight, n_dims = 1, torch.float16 --> float32
blk.63.ffn_down.weight, n_dims = 2, torch.float16 --> float16
blk.63.ffn_gate.weight, n_dims = 2, torch.float16 --> float16
blk.63.ffn_up.weight, n_dims = 2, torch.float16 --> float16
blk.63.attn_k.weight, n_dims = 2, torch.float16 --> float16
blk.63.attn_output.weight, n_dims = 2, torch.float16 --> float16
blk.63.attn_q.weight, n_dims = 2, torch.float16 --> float16
blk.63.attn_v.weight, n_dims = 2, torch.float16 --> float16
output_norm.weight, n_dims = 1, torch.float16 --> float32
Model successfully exported to 'CohereForAI_c4ai-command-r-plus.fp16.bin'

quantize CohereForAI_c4ai-command-r-plus.fp16.bin CohereForAI_c4ai-command-r-plus-Q4_K_M.gguf Q4_K_M
ggml_opencl: selecting platform: 'Apple'
ggml_opencl: selecting device: 'Apple M2 Max'
main: build = 1213 (a307375c)
main: built with Apple clang version 15.0.0 (clang-1500.3.9.4) for arm64-apple-darwin23.4.0
main: quantizing 'CohereForAI_c4ai-command-r-plus.fp16.bin' to 'CohereForAI_c4ai-command-r-plus-Q4_K_M.gguf' as Q4_K_M
zsh: segmentation fault  quantize CohereForAI_c4ai-command-r-plus.fp16.bin  Q4_K_M

RefractAI · 2024-04-04T21:39:27Z

Quantizing to Q5_0 works, but the llm_build_norm() function doesn't accept a 2D layer norm for the new q_norm and k_norm parameters.

The tensor is 12288 but appears it should be evaluated as 128x96 by the layer norm.

sammcj · 2024-04-04T21:51:46Z

Your latest push seems to have fixed Q3_K_M and Q4_K_M creation:

48G Apr  4 22:01 CohereForAI_c4ai-command-r-plus-Q3_K_M.gguf
59G Apr  4 21:55 CohereForAI_c4ai-command-r-plus-Q4_K_M.gguf

Here's the Q3_K_M quantization log if it's interesting:

quantize CohereForAI_c4ai-command-r-plus.fp16.bin CohereForAI_c4ai-command-r-plus-Q3_K_M.gguf Q3_K_M

quantize CohereForAI_c4ai-command-r-plus.fp16.bin CohereForAI_c4ai-command-r-plus-Q3_K_M.gguf Q3_K_M
main: build = 2612 (e4b2e2d)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing 'CohereForAI_c4ai-command-r-plus.fp16.bin' to 'CohereForAI_c4ai-command-r-plus-Q3_K_M.gguf' as Q3_K_M
llama_model_loader: loaded meta data with 22 key-value pairs and 642 tensors from CohereForAI_c4ai-command-r-plus.fp16.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = command-r
llama_model_loader: - kv   1:                               general.name str              = CohereForAI_c4ai-command-r-plus
llama_model_loader: - kv   2:                      command-r.block_count u32              = 64
llama_model_loader: - kv   3:                   command-r.context_length u32              = 131072
llama_model_loader: - kv   4:                 command-r.embedding_length u32              = 12288
llama_model_loader: - kv   5:              command-r.feed_forward_length u32              = 33792
llama_model_loader: - kv   6:             command-r.attention.head_count u32              = 96
llama_model_loader: - kv   7:          command-r.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                   command-r.rope.freq_base f32              = 75000000.000000
llama_model_loader: - kv   9:     command-r.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                      command-r.logit_scale f32              = 0.833333
llama_model_loader: - kv  12:                command-r.rope.scaling.type str              = none
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 5
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 255001
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  577 tensors
llama_model_quantize_internal: meta size = 10904864 bytes
[   1/ 642]                    token_embd.weight - [12288, 256000,     1,     1], type =    f16, converting to q6_K .. size =  6000.00 MiB ->  2460.94 MiB
[   2/ 642]               blk.0.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[   3/ 642]                blk.0.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q5_K .. size =   792.00 MiB ->   272.25 MiB
[   4/ 642]                blk.0.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[   5/ 642]                  blk.0.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[   6/ 642]             blk.0.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[   7/ 642]                  blk.0.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[   8/ 642]             blk.0.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[   9/ 642]             blk.0.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  10/ 642]                  blk.0.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  11/ 642]                  blk.0.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q5_K .. size =    24.00 MiB ->     8.25 MiB
[  12/ 642]                blk.1.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  13/ 642]             blk.1.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  14/ 642]                  blk.1.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  15/ 642]             blk.1.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  16/ 642]             blk.1.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  17/ 642]                  blk.1.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  18/ 642]                  blk.1.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q5_K .. size =    24.00 MiB ->     8.25 MiB
[  19/ 642]               blk.1.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  20/ 642]                blk.1.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q5_K .. size =   792.00 MiB ->   272.25 MiB
[  21/ 642]                  blk.1.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  22/ 642]               blk.2.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  23/ 642]                blk.2.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q5_K .. size =   792.00 MiB ->   272.25 MiB
[  24/ 642]                blk.2.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  25/ 642]                  blk.2.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  26/ 642]             blk.2.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  27/ 642]                  blk.2.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  28/ 642]             blk.2.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  29/ 642]             blk.2.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  30/ 642]                  blk.2.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  31/ 642]                  blk.2.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  32/ 642]             blk.3.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  33/ 642]             blk.3.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  34/ 642]               blk.3.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  35/ 642]                blk.3.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q5_K .. size =   792.00 MiB ->   272.25 MiB
[  36/ 642]                blk.3.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  37/ 642]                  blk.3.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  38/ 642]                  blk.3.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  39/ 642]             blk.3.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  40/ 642]                  blk.3.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  41/ 642]                  blk.3.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  42/ 642]                blk.4.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  43/ 642]             blk.4.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  44/ 642]                  blk.4.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  45/ 642]             blk.4.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  46/ 642]             blk.4.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  47/ 642]                  blk.4.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  48/ 642]                  blk.4.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  49/ 642]               blk.4.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  50/ 642]                blk.4.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[  51/ 642]                  blk.4.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  52/ 642]               blk.5.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  53/ 642]                blk.5.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[  54/ 642]                blk.5.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  55/ 642]                  blk.5.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  56/ 642]             blk.5.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  57/ 642]                  blk.5.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  58/ 642]             blk.5.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  59/ 642]             blk.5.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  60/ 642]                  blk.5.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  61/ 642]                  blk.5.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  62/ 642]             blk.6.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  63/ 642]             blk.6.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  64/ 642]               blk.6.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  65/ 642]                blk.6.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[  66/ 642]                blk.6.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  67/ 642]                  blk.6.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  68/ 642]                  blk.6.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  69/ 642]             blk.6.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  70/ 642]                  blk.6.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  71/ 642]                  blk.6.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  72/ 642]                blk.7.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  73/ 642]             blk.7.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  74/ 642]                  blk.7.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  75/ 642]             blk.7.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  76/ 642]             blk.7.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  77/ 642]                  blk.7.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  78/ 642]                  blk.7.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  79/ 642]               blk.7.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  80/ 642]                blk.7.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[  81/ 642]                  blk.7.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  82/ 642]               blk.8.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[  83/ 642]                blk.8.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[  84/ 642]                blk.8.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  85/ 642]                  blk.8.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  86/ 642]             blk.8.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  87/ 642]                  blk.8.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  88/ 642]             blk.8.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  89/ 642]             blk.8.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  90/ 642]                  blk.8.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[  91/ 642]                  blk.8.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[  92/ 642]             blk.9.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  93/ 642]             blk.9.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  94/ 642]               blk.10.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[  95/ 642]            blk.10.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[  96/ 642]                 blk.10.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[  97/ 642]            blk.10.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[  98/ 642]            blk.10.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[  99/ 642]                 blk.10.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 100/ 642]                 blk.10.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 101/ 642]               blk.9.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 102/ 642]                blk.9.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 103/ 642]                blk.9.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 104/ 642]                  blk.9.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 105/ 642]                  blk.9.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 106/ 642]             blk.9.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 107/ 642]                  blk.9.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 108/ 642]                  blk.9.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 109/ 642]              blk.10.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 110/ 642]               blk.10.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 111/ 642]                 blk.10.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 112/ 642]              blk.11.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 113/ 642]               blk.11.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 114/ 642]               blk.11.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 115/ 642]                 blk.11.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 116/ 642]            blk.11.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 117/ 642]                 blk.11.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 118/ 642]            blk.11.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 119/ 642]            blk.11.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 120/ 642]                 blk.11.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 121/ 642]                 blk.11.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 122/ 642]            blk.12.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 123/ 642]            blk.12.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 124/ 642]              blk.12.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 125/ 642]               blk.12.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 126/ 642]               blk.12.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 127/ 642]                 blk.12.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 128/ 642]                 blk.12.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 129/ 642]            blk.12.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 130/ 642]                 blk.12.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 131/ 642]                 blk.12.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 132/ 642]               blk.13.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 133/ 642]            blk.13.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 134/ 642]                 blk.13.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 135/ 642]            blk.13.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 136/ 642]            blk.13.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 137/ 642]                 blk.13.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 138/ 642]                 blk.13.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 139/ 642]              blk.13.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 140/ 642]               blk.13.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 141/ 642]                 blk.13.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 142/ 642]              blk.14.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 143/ 642]               blk.14.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 144/ 642]               blk.14.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 145/ 642]                 blk.14.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 146/ 642]            blk.14.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 147/ 642]                 blk.14.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 148/ 642]            blk.14.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 149/ 642]            blk.14.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 150/ 642]                 blk.14.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 151/ 642]                 blk.14.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 152/ 642]            blk.15.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 153/ 642]            blk.15.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 154/ 642]              blk.15.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 155/ 642]               blk.15.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 156/ 642]               blk.15.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 157/ 642]                 blk.15.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 158/ 642]                 blk.15.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 159/ 642]            blk.15.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 160/ 642]                 blk.15.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 161/ 642]                 blk.15.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 162/ 642]               blk.16.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 163/ 642]            blk.16.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 164/ 642]                 blk.16.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 165/ 642]            blk.16.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 166/ 642]            blk.16.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 167/ 642]                 blk.16.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 168/ 642]                 blk.16.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 169/ 642]              blk.16.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 170/ 642]               blk.16.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 171/ 642]                 blk.16.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 172/ 642]              blk.17.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 173/ 642]               blk.17.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 174/ 642]               blk.17.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 175/ 642]                 blk.17.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 176/ 642]            blk.17.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 177/ 642]                 blk.17.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 178/ 642]            blk.17.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 179/ 642]            blk.17.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 180/ 642]                 blk.17.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 181/ 642]                 blk.17.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 182/ 642]            blk.18.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 183/ 642]            blk.18.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 184/ 642]              blk.18.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 185/ 642]               blk.18.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 186/ 642]               blk.18.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 187/ 642]                 blk.18.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 188/ 642]                 blk.18.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 189/ 642]            blk.18.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 190/ 642]                 blk.18.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 191/ 642]                 blk.18.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 192/ 642]               blk.19.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 193/ 642]            blk.19.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 194/ 642]                 blk.19.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 195/ 642]            blk.19.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 196/ 642]            blk.19.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 197/ 642]                 blk.19.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 198/ 642]                 blk.19.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 199/ 642]              blk.19.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 200/ 642]               blk.19.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 201/ 642]                 blk.19.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 202/ 642]              blk.20.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 203/ 642]               blk.20.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 204/ 642]               blk.20.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 205/ 642]                 blk.20.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 206/ 642]            blk.20.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 207/ 642]                 blk.20.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 208/ 642]            blk.20.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 209/ 642]            blk.20.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 210/ 642]                 blk.20.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 211/ 642]                 blk.20.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 212/ 642]            blk.21.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 213/ 642]            blk.21.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 214/ 642]              blk.21.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 215/ 642]               blk.21.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 216/ 642]               blk.21.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 217/ 642]                 blk.21.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 218/ 642]                 blk.21.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 219/ 642]            blk.21.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 220/ 642]                 blk.21.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 221/ 642]                 blk.21.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 222/ 642]               blk.22.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 223/ 642]            blk.22.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 224/ 642]                 blk.22.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 225/ 642]            blk.22.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 226/ 642]            blk.22.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 227/ 642]                 blk.22.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 228/ 642]                 blk.22.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 229/ 642]              blk.22.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 230/ 642]               blk.22.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 231/ 642]                 blk.22.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 232/ 642]              blk.23.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 233/ 642]               blk.23.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 234/ 642]               blk.23.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 235/ 642]                 blk.23.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 236/ 642]            blk.23.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 237/ 642]                 blk.23.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 238/ 642]            blk.23.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 239/ 642]            blk.23.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 240/ 642]                 blk.23.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 241/ 642]                 blk.23.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 242/ 642]            blk.24.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 243/ 642]            blk.24.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 244/ 642]              blk.24.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 245/ 642]               blk.24.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 246/ 642]               blk.24.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 247/ 642]                 blk.24.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 248/ 642]                 blk.24.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 249/ 642]            blk.24.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 250/ 642]                 blk.24.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 251/ 642]                 blk.24.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 252/ 642]               blk.25.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 253/ 642]            blk.25.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 254/ 642]                 blk.25.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 255/ 642]            blk.25.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 256/ 642]            blk.25.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 257/ 642]                 blk.25.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 258/ 642]                 blk.25.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 259/ 642]              blk.25.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 260/ 642]               blk.25.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 261/ 642]                 blk.25.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 262/ 642]              blk.26.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 263/ 642]               blk.26.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 264/ 642]               blk.26.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 265/ 642]                 blk.26.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 266/ 642]            blk.26.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 267/ 642]                 blk.26.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 268/ 642]            blk.26.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 269/ 642]            blk.26.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 270/ 642]                 blk.26.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 271/ 642]                 blk.26.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 272/ 642]            blk.27.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 273/ 642]            blk.27.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 274/ 642]              blk.27.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 275/ 642]               blk.27.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 276/ 642]               blk.27.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 277/ 642]                 blk.27.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 278/ 642]                 blk.27.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 279/ 642]            blk.27.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 280/ 642]                 blk.27.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 281/ 642]                 blk.27.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 282/ 642]               blk.28.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 283/ 642]            blk.28.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 284/ 642]                 blk.28.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 285/ 642]            blk.28.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 286/ 642]            blk.28.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 287/ 642]                 blk.28.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 288/ 642]                 blk.28.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 289/ 642]              blk.28.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 290/ 642]               blk.28.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 291/ 642]                 blk.28.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 292/ 642]              blk.29.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 293/ 642]               blk.29.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 294/ 642]               blk.29.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 295/ 642]                 blk.29.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 296/ 642]            blk.29.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 297/ 642]                 blk.29.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 298/ 642]            blk.29.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 299/ 642]            blk.29.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 300/ 642]                 blk.29.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 301/ 642]                 blk.29.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 302/ 642]            blk.30.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 303/ 642]            blk.30.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 304/ 642]              blk.30.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 305/ 642]               blk.30.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 306/ 642]               blk.30.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 307/ 642]                 blk.30.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 308/ 642]                 blk.30.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 309/ 642]            blk.30.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 310/ 642]                 blk.30.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 311/ 642]                 blk.30.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 312/ 642]               blk.31.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 313/ 642]            blk.31.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 314/ 642]                 blk.31.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 315/ 642]            blk.31.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 316/ 642]            blk.31.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 317/ 642]                 blk.31.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 318/ 642]                 blk.31.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 319/ 642]              blk.31.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 320/ 642]               blk.31.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 321/ 642]                 blk.31.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 322/ 642]              blk.32.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 323/ 642]               blk.32.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 324/ 642]               blk.32.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 325/ 642]                 blk.32.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 326/ 642]            blk.32.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 327/ 642]                 blk.32.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 328/ 642]            blk.32.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 329/ 642]            blk.32.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 330/ 642]                 blk.32.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 331/ 642]                 blk.32.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 332/ 642]            blk.33.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 333/ 642]            blk.33.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 334/ 642]              blk.33.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 335/ 642]               blk.33.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 336/ 642]               blk.33.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 337/ 642]                 blk.33.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 338/ 642]                 blk.33.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 339/ 642]            blk.33.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 340/ 642]                 blk.33.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 341/ 642]                 blk.33.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 342/ 642]               blk.34.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 343/ 642]            blk.34.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 344/ 642]                 blk.34.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 345/ 642]            blk.34.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 346/ 642]            blk.34.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 347/ 642]                 blk.34.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 348/ 642]                 blk.34.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 349/ 642]              blk.34.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 350/ 642]               blk.34.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 351/ 642]                 blk.34.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 352/ 642]              blk.35.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 353/ 642]               blk.35.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 354/ 642]               blk.35.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 355/ 642]                 blk.35.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 356/ 642]            blk.35.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 357/ 642]                 blk.35.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 358/ 642]            blk.35.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 359/ 642]            blk.35.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 360/ 642]                 blk.35.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 361/ 642]                 blk.35.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 362/ 642]            blk.36.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 363/ 642]            blk.36.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 364/ 642]              blk.36.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 365/ 642]               blk.36.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 366/ 642]               blk.36.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 367/ 642]                 blk.36.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 368/ 642]                 blk.36.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 369/ 642]            blk.36.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 370/ 642]                 blk.36.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 371/ 642]                 blk.36.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 372/ 642]               blk.37.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 373/ 642]            blk.37.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 374/ 642]                 blk.37.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 375/ 642]            blk.37.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 376/ 642]            blk.37.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 377/ 642]                 blk.37.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 378/ 642]                 blk.37.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 379/ 642]              blk.37.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 380/ 642]               blk.37.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 381/ 642]                 blk.37.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 382/ 642]              blk.38.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 383/ 642]               blk.38.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 384/ 642]               blk.38.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 385/ 642]                 blk.38.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 386/ 642]            blk.38.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 387/ 642]                 blk.38.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 388/ 642]            blk.38.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 389/ 642]            blk.38.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 390/ 642]                 blk.38.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 391/ 642]                 blk.38.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 392/ 642]            blk.39.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 393/ 642]            blk.39.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 394/ 642]              blk.39.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 395/ 642]               blk.39.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 396/ 642]               blk.39.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 397/ 642]                 blk.39.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 398/ 642]                 blk.39.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 399/ 642]            blk.39.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 400/ 642]                 blk.39.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 401/ 642]                 blk.39.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 402/ 642]               blk.40.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 403/ 642]            blk.40.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 404/ 642]                 blk.40.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 405/ 642]            blk.40.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 406/ 642]            blk.40.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 407/ 642]                 blk.40.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 408/ 642]                 blk.40.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 409/ 642]              blk.40.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 410/ 642]               blk.40.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 411/ 642]                 blk.40.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 412/ 642]              blk.41.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 413/ 642]               blk.41.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 414/ 642]               blk.41.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 415/ 642]                 blk.41.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 416/ 642]            blk.41.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 417/ 642]                 blk.41.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 418/ 642]            blk.41.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 419/ 642]            blk.41.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 420/ 642]                 blk.41.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 421/ 642]                 blk.41.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 422/ 642]            blk.42.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 423/ 642]            blk.42.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 424/ 642]              blk.42.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 425/ 642]               blk.42.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 426/ 642]               blk.42.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 427/ 642]                 blk.42.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 428/ 642]                 blk.42.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 429/ 642]            blk.42.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 430/ 642]                 blk.42.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 431/ 642]                 blk.42.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 432/ 642]               blk.43.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 433/ 642]            blk.43.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 434/ 642]                 blk.43.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 435/ 642]            blk.43.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 436/ 642]            blk.43.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 437/ 642]                 blk.43.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 438/ 642]                 blk.43.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 439/ 642]              blk.43.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 440/ 642]               blk.43.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 441/ 642]                 blk.43.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 442/ 642]              blk.44.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 443/ 642]               blk.44.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 444/ 642]               blk.44.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 445/ 642]                 blk.44.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 446/ 642]            blk.44.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 447/ 642]                 blk.44.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 448/ 642]            blk.44.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 449/ 642]            blk.44.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 450/ 642]                 blk.44.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 451/ 642]                 blk.44.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 452/ 642]            blk.45.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 453/ 642]            blk.45.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 454/ 642]              blk.45.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 455/ 642]               blk.45.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 456/ 642]               blk.45.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 457/ 642]                 blk.45.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 458/ 642]                 blk.45.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 459/ 642]            blk.45.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 460/ 642]                 blk.45.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 461/ 642]                 blk.45.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 462/ 642]               blk.46.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 463/ 642]            blk.46.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 464/ 642]                 blk.46.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 465/ 642]            blk.46.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 466/ 642]            blk.46.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 467/ 642]                 blk.46.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 468/ 642]                 blk.46.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 469/ 642]              blk.46.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 470/ 642]               blk.46.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 471/ 642]                 blk.46.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 472/ 642]              blk.47.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 473/ 642]               blk.47.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 474/ 642]               blk.47.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 475/ 642]                 blk.47.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 476/ 642]            blk.47.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 477/ 642]                 blk.47.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 478/ 642]            blk.47.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 479/ 642]            blk.47.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 480/ 642]                 blk.47.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 481/ 642]                 blk.47.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 482/ 642]            blk.48.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 483/ 642]            blk.48.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 484/ 642]              blk.48.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 485/ 642]               blk.48.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 486/ 642]               blk.48.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 487/ 642]                 blk.48.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 488/ 642]                 blk.48.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 489/ 642]            blk.48.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 490/ 642]                 blk.48.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 491/ 642]                 blk.48.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 492/ 642]               blk.49.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 493/ 642]            blk.49.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 494/ 642]                 blk.49.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 495/ 642]            blk.49.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 496/ 642]            blk.49.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 497/ 642]                 blk.49.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 498/ 642]                 blk.49.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 499/ 642]              blk.49.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 500/ 642]               blk.49.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 501/ 642]                 blk.49.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 502/ 642]              blk.50.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 503/ 642]               blk.50.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 504/ 642]               blk.50.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 505/ 642]                 blk.50.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 506/ 642]            blk.50.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 507/ 642]                 blk.50.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 508/ 642]            blk.50.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 509/ 642]            blk.50.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 510/ 642]                 blk.50.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 511/ 642]                 blk.50.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 512/ 642]            blk.51.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 513/ 642]            blk.51.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 514/ 642]              blk.51.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 515/ 642]               blk.51.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 516/ 642]               blk.51.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 517/ 642]                 blk.51.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 518/ 642]                 blk.51.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 519/ 642]            blk.51.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 520/ 642]                 blk.51.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 521/ 642]                 blk.51.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 522/ 642]               blk.52.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 523/ 642]            blk.52.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 524/ 642]                 blk.52.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 525/ 642]            blk.52.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 526/ 642]            blk.52.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 527/ 642]                 blk.52.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 528/ 642]                 blk.52.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 529/ 642]              blk.52.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 530/ 642]               blk.52.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 531/ 642]                 blk.52.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 532/ 642]              blk.53.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 533/ 642]               blk.53.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 534/ 642]               blk.53.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 535/ 642]                 blk.53.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 536/ 642]            blk.53.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 537/ 642]                 blk.53.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 538/ 642]            blk.53.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 539/ 642]            blk.53.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 540/ 642]                 blk.53.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 541/ 642]                 blk.53.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 542/ 642]            blk.54.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 543/ 642]            blk.54.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 544/ 642]              blk.54.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 545/ 642]               blk.54.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 546/ 642]               blk.54.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 547/ 642]                 blk.54.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 548/ 642]                 blk.54.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 549/ 642]            blk.54.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 550/ 642]                 blk.54.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 551/ 642]                 blk.54.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 552/ 642]               blk.55.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 553/ 642]            blk.55.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 554/ 642]                 blk.55.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 555/ 642]            blk.55.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 556/ 642]            blk.55.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 557/ 642]                 blk.55.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 558/ 642]                 blk.55.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 559/ 642]              blk.55.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 560/ 642]               blk.55.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 561/ 642]                 blk.55.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 562/ 642]              blk.56.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 563/ 642]               blk.56.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 564/ 642]               blk.56.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 565/ 642]                 blk.56.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 566/ 642]            blk.56.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 567/ 642]                 blk.56.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 568/ 642]            blk.56.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 569/ 642]            blk.56.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 570/ 642]                 blk.56.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 571/ 642]                 blk.56.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 572/ 642]            blk.57.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 573/ 642]            blk.57.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 574/ 642]              blk.57.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 575/ 642]               blk.57.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 576/ 642]               blk.57.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 577/ 642]                 blk.57.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 578/ 642]                 blk.57.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 579/ 642]            blk.57.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 580/ 642]                 blk.57.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 581/ 642]                 blk.57.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 582/ 642]               blk.58.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 583/ 642]            blk.58.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 584/ 642]                 blk.58.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 585/ 642]            blk.58.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 586/ 642]            blk.58.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 587/ 642]                 blk.58.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 588/ 642]                 blk.58.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 589/ 642]              blk.58.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 590/ 642]               blk.58.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 591/ 642]                 blk.58.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 592/ 642]              blk.59.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 593/ 642]               blk.59.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 594/ 642]               blk.59.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 595/ 642]                 blk.59.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 596/ 642]            blk.59.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 597/ 642]                 blk.59.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 598/ 642]            blk.59.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 599/ 642]            blk.59.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 600/ 642]                 blk.59.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 601/ 642]                 blk.59.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 602/ 642]            blk.60.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 603/ 642]            blk.60.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 604/ 642]              blk.60.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 605/ 642]               blk.60.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 606/ 642]               blk.60.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 607/ 642]                 blk.60.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 608/ 642]                 blk.60.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 609/ 642]            blk.60.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 610/ 642]                 blk.60.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 611/ 642]                 blk.60.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 612/ 642]               blk.61.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 613/ 642]            blk.61.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 614/ 642]                 blk.61.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 615/ 642]            blk.61.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 616/ 642]            blk.61.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 617/ 642]                 blk.61.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 618/ 642]                 blk.61.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 619/ 642]              blk.61.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 620/ 642]               blk.61.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 621/ 642]                 blk.61.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 622/ 642]              blk.62.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 623/ 642]               blk.62.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 624/ 642]               blk.62.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 625/ 642]                 blk.62.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 626/ 642]            blk.62.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 627/ 642]                 blk.62.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 628/ 642]            blk.62.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 629/ 642]            blk.62.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 630/ 642]                 blk.62.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 631/ 642]                 blk.62.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 632/ 642]            blk.63.attn_k_norm.weight - [  128,     8,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 8 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.00 MiB ->     0.00 MiB
[ 633/ 642]            blk.63.attn_q_norm.weight - [  128,    96,     1,     1], type =    f16,

llama_tensor_get_type : tensor cols 128 x 96 are not divisible by 256, required for q3_K - using fallback quantization iq4_nl
converting to iq4_nl .. size =     0.02 MiB ->     0.01 MiB
[ 634/ 642]              blk.63.attn_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
[ 635/ 642]               blk.63.ffn_down.weight - [33792, 12288,     1,     1], type =    f16, converting to q4_K .. size =   792.00 MiB ->   222.75 MiB
[ 636/ 642]               blk.63.ffn_gate.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 637/ 642]                 blk.63.ffn_up.weight - [12288, 33792,     1,     1], type =    f16, converting to q3_K .. size =   792.00 MiB ->   170.16 MiB
[ 638/ 642]                 blk.63.attn_k.weight - [12288,  1024,     1,     1], type =    f16, converting to q3_K .. size =    24.00 MiB ->     5.16 MiB
[ 639/ 642]            blk.63.attn_output.weight - [12288, 12288,     1,     1], type =    f16, converting to q4_K .. size =   288.00 MiB ->    81.00 MiB
[ 640/ 642]                 blk.63.attn_q.weight - [12288, 12288,     1,     1], type =    f16, converting to q3_K .. size =   288.00 MiB ->    61.88 MiB
[ 641/ 642]                 blk.63.attn_v.weight - [12288,  1024,     1,     1], type =    f16, converting to q4_K .. size =    24.00 MiB ->     6.75 MiB
[ 642/ 642]                   output_norm.weight - [12288,     1,     1,     1], type =    f32, size =    0.047 MB
llama_model_quantize_internal: model size  = 198004.67 MB
llama_model_quantize_internal: quant size  = 48607.44 MB
llama_model_quantize_internal: WARNING: 128 of 577 tensor(s) required fallback quantization

main: quantize time = 338672.88 ms
main:    total time = 338672.88 ms

N8python · 2024-04-04T22:00:28Z

So they work now?

github-actions · 2024-04-04T22:01:20Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 499 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=9414.96ms p(90)=26375.41ms fails=0, finish reason: stop=499 truncated=0
Prompt processing (pp): avg=243.1tk/s p(90)=734.8tk/s total=198.12tk/s
Token generation (tg): avg=97.59tk/s p(90)=260.19tk/s total=130.44tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=ec613b856c91d219b6d6efb7852e286b862fe797

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 499 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1712318663 --> 1712319289
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 378.73, 378.73, 378.73, 378.73, 378.73, 680.85, 680.85, 680.85, 680.85, 680.85, 676.32, 676.32, 676.32, 676.32, 676.32, 678.41, 678.41, 678.41, 678.41, 678.41, 697.47, 697.47, 697.47, 697.47, 697.47, 703.74, 703.74, 703.74, 703.74, 703.74, 711.3, 711.3, 711.3, 711.3, 711.3, 720.09, 720.09, 720.09, 720.09, 720.09, 726.53, 726.53, 726.53, 726.53, 726.53, 723.13, 723.13, 723.13, 723.13, 723.13, 729.69, 729.69, 729.69, 729.69, 729.69, 740.46, 740.46, 740.46, 740.46, 740.46, 754.32, 754.32, 754.32, 754.32, 754.32, 737.94, 737.94, 737.94, 737.94, 737.94, 706.59, 706.59, 706.59, 706.59, 706.59, 714.81, 714.81, 714.81, 714.81, 714.81, 715.69, 715.69, 715.69, 715.69, 715.69, 713.64, 713.64, 713.64, 713.64, 713.64, 722.62, 722.62, 722.62, 722.62, 722.62, 722.91, 722.91, 722.91, 722.91, 722.91, 721.62, 721.62, 721.62, 721.62, 721.62, 719.16, 719.16, 719.16, 719.16, 719.16, 719.07, 719.07, 719.07, 719.07, 719.07, 719.27, 719.27, 719.27, 719.27, 719.27, 726.47, 726.47, 726.47, 726.47, 726.47, 705.55, 705.55, 705.55, 705.55, 705.55, 707.87, 707.87, 707.87, 707.87, 707.87, 708.87, 708.87, 708.87, 708.87, 708.87, 713.01, 713.01, 713.01, 713.01, 713.01, 711.79, 711.79, 711.79, 711.79, 711.79, 711.47, 711.47, 711.47, 711.47, 711.47, 710.31, 710.31, 710.31, 710.31, 710.31, 707.69, 707.69, 707.69, 707.69, 707.69, 710.59, 710.59, 710.59, 710.59, 710.59, 709.37, 709.37, 709.37, 709.37, 709.37, 710.85, 710.85, 710.85, 710.85, 710.85, 712.17, 712.17, 712.17, 712.17, 712.17, 711.65, 711.65, 711.65, 711.65, 711.65, 721.68, 721.68, 721.68, 721.68, 721.68, 724.08, 724.08, 724.08, 724.08, 724.08, 723.86, 723.86, 723.86, 723.86, 723.86, 721.8, 721.8, 721.8, 721.8, 721.8, 721.5, 721.5, 721.5, 721.5, 721.5, 723.45, 723.45, 723.45, 723.45, 723.45, 721.35, 721.35, 721.35, 721.35, 721.35, 724.78, 724.78, 724.78, 724.78, 724.78, 724.94, 724.94, 724.94, 724.94, 724.94, 717.2, 717.2, 717.2, 717.2, 717.2, 715.39, 715.39, 715.39, 715.39, 715.39, 713.86, 713.86, 713.86, 713.86, 713.86, 712.93, 712.93, 712.93, 712.93, 712.93, 712.48, 712.48, 712.48, 712.48, 712.48, 718.16, 718.16, 718.16, 718.16, 718.16, 718.93, 718.93, 718.93, 718.93, 718.93, 720.05, 720.05, 720.05, 720.05, 720.05, 718.07, 718.07, 718.07, 718.07, 718.07, 719.49, 719.49, 719.49, 719.49, 719.49, 718.62, 718.62, 718.62, 718.62, 718.62, 719.37, 719.37, 719.37, 719.37, 719.37, 720.59, 720.59, 720.59, 720.59, 720.59, 721.52, 721.52, 721.52, 721.52]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 499 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1712318663 --> 1712319289
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 29.82, 29.82, 29.82, 29.82, 29.82, 22.94, 22.94, 22.94, 22.94, 22.94, 17.63, 17.63, 17.63, 17.63, 17.63, 18.69, 18.69, 18.69, 18.69, 18.69, 19.0, 19.0, 19.0, 19.0, 19.0, 19.6, 19.6, 19.6, 19.6, 19.6, 20.31, 20.31, 20.31, 20.31, 20.31, 20.46, 20.46, 20.46, 20.46, 20.46, 20.49, 20.49, 20.49, 20.49, 20.49, 20.39, 20.39, 20.39, 20.39, 20.39, 20.39, 20.39, 20.39, 20.39, 20.39, 19.9, 19.9, 19.9, 19.9, 19.9, 19.56, 19.56, 19.56, 19.56, 19.56, 19.3, 19.3, 19.3, 19.3, 19.3, 19.04, 19.04, 19.04, 19.04, 19.04, 18.38, 18.38, 18.38, 18.38, 18.38, 18.47, 18.47, 18.47, 18.47, 18.47, 18.64, 18.64, 18.64, 18.64, 18.64, 18.78, 18.78, 18.78, 18.78, 18.78, 18.61, 18.61, 18.61, 18.61, 18.61, 18.53, 18.53, 18.53, 18.53, 18.53, 18.44, 18.44, 18.44, 18.44, 18.44, 18.31, 18.31, 18.31, 18.31, 18.31, 18.23, 18.23, 18.23, 18.23, 18.23, 18.34, 18.34, 18.34, 18.34, 18.34, 18.31, 18.31, 18.31, 18.31, 18.31, 18.38, 18.38, 18.38, 18.38, 18.38, 18.42, 18.42, 18.42, 18.42, 18.42, 18.45, 18.45, 18.45, 18.45, 18.45, 18.43, 18.43, 18.43, 18.43, 18.43, 18.27, 18.27, 18.27, 18.27, 18.27, 18.23, 18.23, 18.23, 18.23, 18.23, 18.27, 18.27, 18.27, 18.27, 18.27, 18.32, 18.32, 18.32, 18.32, 18.32, 18.39, 18.39, 18.39, 18.39, 18.39, 18.48, 18.48, 18.48, 18.48, 18.48, 18.55, 18.55, 18.55, 18.55, 18.55, 18.51, 18.51, 18.51, 18.51, 18.51, 18.47, 18.47, 18.47, 18.47, 18.47, 18.44, 18.44, 18.44, 18.44, 18.44, 18.37, 18.37, 18.37, 18.37, 18.37, 18.34, 18.34, 18.34, 18.34, 18.34, 18.37, 18.37, 18.37, 18.37, 18.37, 18.43, 18.43, 18.43, 18.43, 18.43, 18.46, 18.46, 18.46, 18.46, 18.46, 18.52, 18.52, 18.52, 18.52, 18.52, 18.42, 18.42, 18.42, 18.42, 18.42, 18.32, 18.32, 18.32, 18.32, 18.32, 18.12, 18.12, 18.12, 18.12, 18.12, 17.97, 17.97, 17.97, 17.97, 17.97, 17.85, 17.85, 17.85, 17.85, 17.85, 17.66, 17.66, 17.66, 17.66, 17.66, 17.56, 17.56, 17.56, 17.56, 17.56, 17.64, 17.64, 17.64, 17.64, 17.64, 17.68, 17.68, 17.68, 17.68, 17.68, 17.71, 17.71, 17.71, 17.71, 17.71, 17.71, 17.71, 17.71, 17.71, 17.71, 17.69, 17.69, 17.69, 17.69, 17.69, 17.68, 17.68, 17.68, 17.68, 17.68, 17.64, 17.64, 17.64, 17.64, 17.64, 17.65, 17.65, 17.65, 17.65]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 499 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1712318663 --> 1712319289
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.28, 0.28, 0.28, 0.28, 0.28, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.33, 0.33, 0.33, 0.33, 0.33, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.23, 0.23, 0.23, 0.23, 0.23, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.27, 0.27, 0.27, 0.27, 0.27, 0.23, 0.23, 0.23, 0.23, 0.23, 0.26, 0.26, 0.26, 0.26, 0.26, 0.29, 0.29, 0.29, 0.29, 0.29, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.26, 0.26, 0.26, 0.26, 0.26, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.22, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.1, 0.1, 0.1, 0.1, 0.1, 0.25, 0.25, 0.25, 0.25, 0.25, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.42, 0.42, 0.42, 0.42, 0.42, 0.53, 0.53, 0.53, 0.53, 0.53, 0.39, 0.39, 0.39, 0.39, 0.39, 0.36, 0.36, 0.36, 0.36, 0.36, 0.31, 0.31, 0.31, 0.31, 0.31, 0.14, 0.14, 0.14, 0.14, 0.14, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.2, 0.2, 0.32, 0.32, 0.32, 0.32, 0.32, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 499 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1712318663 --> 1712319289
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0]

slaren · 2024-04-04T22:01:30Z

The norm layer is probably being quantized if it is exported as a 2d tensor, but it needs to be f32. Exporting it as 1d (reshaped) may work.

N8python · 2024-04-04T22:15:21Z

Does the inference appear to be sane?

Noeda · 2024-04-04T22:20:23Z

Does the inference appear to be sane?

No. I don't get crashes but output looks random. Something is off. I'm using functionally identical code to the PR here.

I tried transposing the tensors; also I worked around the norm datatype problem by hacking quantizing to keep them at f16 or f32. These don't seem to help.

I started studying the transformers code again and comparing with llama.cpp. I think before the plus model, CohereLayerNorm was never used with 2D tensors; and I think slapping basic llm_build_norm won't work out of box. Need to figure out how is it diverging.

"hello world" prompt:

RefractAI · 2024-04-04T23:11:06Z

The norm layer is probably being quantized if it is exported as a 2d tensor, but it needs to be f32. Exporting it as 1d (reshaped) may work.

I am now exporting it in 1d f32 in the latest commit, and the issue remains - nonsense output because the Layer Norm should be 2D not 1D.

Noeda · 2024-04-05T01:15:23Z

In the current implementation seems like most of the values in the computation graph are zero. (Also I learned how to more systematically track intermediate computation values). It's very different compared to old Command-R model even before it hits the code path that uses the new norms.

Command-R+ (new model; 5 first and 5 last values from the first intermediate computed values)

tensor: 0x108178300 (inp_embd)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x108178300 (inp_embd)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x1081787b0 (norm-0)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x1081787b0 (norm-0)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x108178940 (attn_norm-0)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x108178940 (attn_norm-0)
0 0 0 0 0  ... 0 0 0 0 0

Command-R (old model, same tensors)

tensor: 0x108148300 (inp_embd)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x108148300 (inp_embd)
-0.000882447 0.00211787 0.000705957 -0.00141191 0.00123543  ... -0.0255127 -0.00927734 0.0742188 -0.00695801 0.00927734
tensor: 0x1081487b0 (norm-0)
0 0 0 0 0  ... 0 0 0 0 0
tensor: 0x1081487b0 (norm-0)
-0.0881477 0.0831225 0.0025248 -0.118372 0.032749  ... -1.10396 -0.398331 3.23062 -0.297527 0.408103
tensor: 0x108148940 (attn_norm-0)
-0.0881477 0.0831225 0.0025248 -0.118372 0.032749  ... -1.10396 -0.398331 3.23062 -0.297527 0.408103
tensor: 0x108148940 (attn_norm-0)
-0.00680045 0.00490344 0.000171977 -0.00778116 0.00199284  ... -0.0504679 -0.0273999 0.177759 -0.0168158 0.0285453

It's not zero across the board but it looks fairly broken. A bit surprising since it isn't that different of a model.

The quants I'm working with have a suspiciously good compression rates with zstd. They don't look like entire zeroes in hexedit but zstd compression % looks pretty anomalous for the first few gigabytes (only a few percentage %, then jumping up. Old Cohere models and other similar quants get barely any compression no matter where you are in the file).

Maybe worth checking if the GGUF converter isn't throwing away data somehow.

Although possibly I have corrupted files. Checksums from HF seem to match....hrm. It would be annoying if I've had trouble only because of corrupted files.

Edit: I don't get the zeroes in intermediate computations with f16. It's just so big the test workflow takes forever. I wonder if there might be another quant bug with tensors being larger than 2**31-1 like we found with the previous model, but more subtle this time.

Noeda · 2024-04-05T01:57:07Z

I verified with ddrescue that there are large blocks of zeroes in the quantized files, including the Q4 and Q8. I now suspect there possibly an overflow bug of some kind in quantizing somewhere, but of different nature than last time.

F16 also looks broken but broken in a different way (the Q4/Q8 were random with random symbols; this at least repeats words):

@Carolinabanana are you able to verify if your code branch has the same issue? That is, any quantized files have big blocks of zeroes. (Or anyone else who has .ggufs for that matter).

Some ways to test: You can run 1) zstd --compress --keep <path to .gguf> and check is the compression rate <95%. Or 2) you can run:

ddrescue -b 1000000 --generate-mode /dev/zero <path to .gguf> report.txt

This command detects long runs of zeroes in a file. You can interpret the output:

$ cat report.txt
# Mapfile. Created by GNU ddrescue version 1.28
# Command line: ddrescue -b 1000000 --generate-mode /dev/zero ./commandr_plus_Q4_K.gguf mapfile
# Start time:   2024-04-04 18:44:43
# Current time: 2024-04-04 18:44:48
# Finished
# current_pos  current_status  current_pass
0xE9C033CC0     +               1
#      pos        size  status
0x00000000  0x00A66600  +
0x00A66600  0xC737FE00  ?    <-------- this here says there's a block of zeroes around 0xC737FE00 bytes long (~3 gigs).
0xC7DE6400  0xDD430B140  +

A good file from ddrescue has just one line and it didn't "rescue" anything.

Seems to be at the beginning of the file. The embedding tensors are broken? Both Q4 and Q8 have a big block of zeroes there. That was the issue for the plain Command-R but it broke much more visibly that time. Hmm.

Edit: Found a very suspicious 2024-03-09 15:53:59 +0200 20360) const int n = nrows * n_per_row; in ggml.c that would overflow with these tensors. Will open a separate PR if I can confirm.

dranger003 · 2024-04-05T02:05:08Z

# Mapfile. Created by GNU ddrescue version 1.28
# Command line: ddrescue -b 1000000 --generate-mode /dev/zero /md0/models/CohereForAI/c4ai-command-r-plus/ggml-c4ai-command-r-plus-q8_0.gguf mapfile
# Start time:   2024-04-04 22:02:17
# Current time: 2024-04-04 22:02:42
# Finished
# current_pos  current_status  current_pass
0x19AF3A0E80     +               1
#      pos        size  status
0x00000000  0x00A7D8C0  +
0x00A7D8C0  0xC732DF80  ?
0xC7DAB840  0x18E76868E0  +

Noeda · 2024-04-05T02:06:46Z

@dranger003 Ah thanks! Yeah, that indicate that you also have a zero hole in your file. Okay good, so it's not just me. I think I may have found the part that overflows. It's the same tensor as last Command-R model that also had an overflow, but it overflows in a different part this time. Maybe the tensor is even larger this time.

Noeda · 2024-04-05T03:28:37Z

I'm getting coherent text now from Q8 after overflow fixes and some clean ups (non-colored text is AI output, and the stuff before is my prompt).

I'm doing a few cleanups and then closing for the day and write some notes if anyone wants to use my code. The code is not really something you'd want to merge because it has my graveyard of debugging code and other crap.

Not very fast. ~1 second per token on a Mac Studio. It's big so makes sense. Haven't tried the other quants.

N8python · 2024-04-05T03:43:11Z

Congrats on getting it working :D :D :D. My ballpark for a m3 max (my device) at Q8 would be 2-3 tok/sec (logic -> 3-5 tok/sec at Q3 for 120 b)... what mac studio do you have? Maybe there's a slower part of the inference code?

Noeda · 2024-04-08T17:23:11Z

Would you care uploading the Q4_K to huggingface?

Thnx for your work btw

I'll be honest; I straight up might not have time. I have some possibly high-stakes interviews this week and will spend time prepping for those instead, starting right about when I finish typing this comment :D I'm not sure if @dranger003 has an older version in their HF if you look in past commits if there's a working Q4_K_M one uploaded.

I'm testing on Metal and Q4_0 is broken - produces garbage and nan during perplexity:
# garbage
./quantize ./models/command-r-plus/ggml-model-f16.gguf ./models/command-r-plus/ggml-model-q4_0.gguf q4_0

# works (requires #6541)
./quantize --token-embedding-type q8_0 ./models/command-r-plus/ggml-model-f16.gguf ./models/command-r-plus/ggml-model-q4_0.gguf q4_0
Similar observations as @Noeda noted earlier, though Q4_0 + Q6 token_embd tensor does not work for me. Probably more integer overflow problems in the Metal backend. Looking into this

Just rechecked my setup:

quantize, gguf-dump.py, main and perplexity from Q4_0

# From history:
./bin/quantize /Volumes/T9/rplus_banana_f16.gguf /Volumes/T9/rplus_banana_Q4_0.gguf Q4_0

# token embd
(venv) mikkojuola@Mikkos-Mac-Studio ~/realllama/banana/llama.cpp> python gguf-py/scripts/gguf-dump.py /Volumes/T9/rplus_banana_Q4_0.gguf|grep token
     17: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     18: [STRING]   |   256000 | tokenizer.ggml.tokens
     19: [INT32]    |   256000 | tokenizer.ggml.token_type
     20: [STRING]   |   253333 | tokenizer.ggml.merges
     21: UINT32     |        1 | tokenizer.ggml.bos_token_id = 5
     22: UINT32     |        1 | tokenizer.ggml.eos_token_id = 255001
     23: UINT32     |        1 | tokenizer.ggml.padding_token_id = 0
     24: BOOL       |        1 | tokenizer.ggml.add_bos_token = True
     25: BOOL       |        1 | tokenizer.ggml.add_eos_token = False
      1: 3145728000 | 12288, 256000,     1,     1 | Q6_K    | token_embd.weight

# main test
./bin/main --model /Volumes/T9/rplus_banana_Q4_0.gguf --prompt "hello world! my name is" --top-k 1
<omitted log output>
hello world! my name is kate and i am a 20 year old college student. i am a huge fan of the twilight series and i am currently working on my first fanfic. i am a huge fan of the twilight series and i am currently working on my

# perplexity
$ mikkojuola@Mikkos-Mac-Studio ~/realllama/banana/llama.cpp/build> ./bin/perplexity --model /Volumes/T9/rplus_banana_Q4_0.gguf -f ~/llama.cpp/ci/wikitext-2-raw/wiki.test.raw -ngl 256
<omitted log output>
perplexity: calculating perplexity over 560 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 21.87 seconds per pass - ETA 51.02 minutes
[1]3.4319,[2]4.4709,[3]3.8271,[4]3.9028,[5]3.7290,[6]3.8120,[7]3.9015,[8]4.1230,^C

This is a case of "works for me". Wondering what could be different.

MacOS version and snippet from `llama.cpp` loading itself when using Q4_0.

$ sw_vers -productVersion
14.4.1
$ xcodebuild -version
Xcode 15.3

$
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 75000000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Ultra
ggml_metal_init: picking default device: Apple M2 Ultra
ggml_metal_init: loading '/Users/mikkojuola/realllama/banana/llama.cpp/build/bin/default.metallib'
ggml_metal_init: GPU name:   Apple M2 Ultra
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 193986.56 MB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   512.00 MiB, (56981.12 / 185000.00)
llama_kv_cache_init:      Metal KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     3.91 MiB
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size =   524.00 MiB, (57505.12 / 185000.00)
llama_new_context_with_model:      Metal compute buffer size =   524.00 MiB
llama_new_context_with_model:        CPU compute buffer size =    28.01 MiB
llama_new_context_with_model: graph nodes  = 2312
llama_new_context_with_model: graph splits = 2

dranger003 · 2024-04-08T19:25:58Z

@teis-e Why not use IQ4_XS?

teis-e · 2024-04-08T19:44:55Z

@teis-e Why not use IQ4_XS?

I saw your HuggingFace, but I don't understand, there are 2 files for IQ4_XS. Which one do I download?

dranger003 · 2024-04-08T19:46:55Z

@teis-e Why not use IQ4_XS?

I saw your HuggingFace, but I don't understand, there are 2 files for IQ4_XS. Which one do I download?

iq4_xs-00001-of-00002 and iq4_xs-00002-of-00002 because HF only allows for max 50G per file.

teis-e · 2024-04-08T19:53:24Z

@teis-e Why not use IQ4_XS?

I saw your HuggingFace, but I don't understand, there are 2 files for IQ4_XS. Which one do I download?

iq4_xs-00001-of-00002 and iq4_xs-00002-of-00002 because HF only allows for max 50G per file.

And then i merge them? ./gguf-split --merg

How exactly?

candre23 · 2024-04-08T19:59:33Z

@teis-e Why not use IQ4_XS?

I saw your HuggingFace, but I don't understand, there are 2 files for IQ4_XS. Which one do I download?

iq4_xs-00001-of-00002 and iq4_xs-00002-of-00002 because HF only allows for max 50G per file.

And then i merge them? ./gguf-split --merg

How exactly?

You don't need to merge them at all. Download both files and just point LCPP at the first one. It will load both parts properly.

If you want to merge them though, it's

gguf-split --merge /location/of/first-part.gguf /location/of/merged.gguf

dranger003 · 2024-04-08T20:00:59Z

copy /b file_name1 + file_name2 file_name_final

This is incorrect, you need to use gguf-split if you want to merge (which isn't needed to run the model).

candre23 · 2024-04-08T20:02:25Z

copy /b file_name1 + file_name2 file_name_final

Models that have been split with the now-built-in splitting utility can't simply be concatenated. You can either leave them in multiple pieces and LCPP will load them as-is, or you can use the utility to recombine the pieces into a single large GGUF.

teis-e · 2024-04-08T20:32:22Z

I got it now. Thnx for all the answers.

I'm now moving to the next step using it on LocalAI

Has anybody got that working?

dranger003 · 2024-04-08T21:00:17Z

Anyone able to run perplexity using CUDA? ./build/bin/perplexity -b 512 -ngl 65 -f /sdc1/models/wikitext-2-raw/wiki.test.raw -m /md0/models/CohereForAI/c4ai-command-r-plus/ggml-c4ai-command-r-plus-104b-iq4_xs.gguf

I have perplexity working again for this model using CUDA. I pushed the changes here dranger003@0bcfc87 and here dranger003@835d702

Quantization	Model size (GiB)	Perplexity	Delta (FP16)
IQ1_S	21.59	8.2530 +/- 0.05234	88.23%
IQ1_M	23.49	7.4267 +/- 0.04646	69.39%
IQ2_XXS	26.65	6.1138 +/- 0.03683	39.44%
IQ2_XS	29.46	5.6489 +/- 0.03309	28.84%
IQ2_S	31.04	5.5187 +/- 0.03210	25.87%
IQ2_M	33.56	5.1930 +/- 0.02989	18.44%
IQ3_XXS	37.87	4.8258 +/- 0.02764	10.07%
IQ3_XS	40.61	4.7263 +/- 0.02665	7.80%
IQ3_S	42.80	4.6321 +/- 0.02600	5.65%
IQ3_M	44.41	4.6202 +/- 0.02585	5.38%
Q3_K_M	47.48	4.5770 +/- 0.02609	4.39%
Q3_K_L	51.60	4.5568 +/- 0.02594	3.93%
IQ4_XS	52.34	4.4428 +/- 0.02508	1.33%
Q5_K_S	66.87	4.3833 +/- 0.02466	-0.03%
Q6_K	79.32	4.3672 +/- 0.02455	-0.39%
Q8_0	102.74	4.3858 +/- 0.02469	0.03%
FP16	193.38	4.3845 +/- 0.02468	-

EDIT: @Carolinabanana I'm running PPL on all the quants to test the code, looks like we'll need more updates, I'll continue to commit as I find them.

ghchris2021 · 2024-04-09T01:51:34Z

copy /b file_name1 + file_name2 file_name_final

Models that have been split with the now-built-in splitting utility can't simply be concatenated. You can either leave them in multiple pieces and LCPP will load them as-is, or you can use the utility to recombine the pieces into a single large GGUF.

The above comment caught my eye. Forgive a simplistic question but I've long seen "split" files which are for GGUF or any other model formats on, e.g. hf, and usually in the file name there's not a lot of information / format consistency other than typically something along the lines of ggml-c4ai-command-r-plus-104b-f16-00005-of-00005.gguf or pytorch_model-00001-of-00004.bin.

So if there's no reliably consistent nomenclature as to file naming / "extension" to indicate "specially split" files, and many files are just "ordinarily split" by sharding without other transformation, how should a user tell whether the gguf file can be trivially concatenated or whether it's been somehow altered / wrapped with headers / trailers / whatever and may / must be gguf-split processed or left sharded and loaded that way?

I assume there's some kind of identifiable header / magic flag ("/usr/bin/file") or command line option to gguf-using utilities that can check the format / integrity / usability of sharded / not files?

phymbert · 2024-04-09T02:00:18Z

how should a user tell whether the gguf file can be trivially concatenated or whether it's been somehow altered / wrapped with headers / trailers / whatever and may / must be gguf-split processed or left sharded and loaded that way?

The format is specified in llama_split_path in llama.h:

llama.cpp/llama.h

Lines 1038 to 1041 in cc4a954

    
           /// @details Build a split GGUF final path for this chunk. 
        
           ///          llama_split_path(split_path, sizeof(split_path), "/models/ggml-model-q4_0", 2, 4) => split_path = "/models/ggml-model-q4_0-00002-of-00004.gguf" 
        
           //  Returns the split_path length. 
        
           LLAMA_API int llama_split_path(char * split_path, size_t maxlen, const char * path_prefix, int split_no, int split_count);

I have tried to summarize all here:

How to use the `gguf-split` / Model sharding demo #6404

Feel free to add an improvment request with split label:
https://github.com/ggerganov/llama.cpp/issues?q=is%3Aopen+is%3Aissue+label%3Asplit

ggerganov · 2024-04-09T08:08:02Z

I'm testing on Metal and Q4_0 is broken - produces garbage and nan during perplexity:
# garbage
./quantize ./models/command-r-plus/ggml-model-f16.gguf ./models/command-r-plus/ggml-model-q4_0.gguf q4_0

# works (requires #6541)
./quantize --token-embedding-type q8_0 ./models/command-r-plus/ggml-model-f16.gguf ./models/command-r-plus/ggml-model-q4_0.gguf q4_0
Similar observations as @Noeda noted earlier, though Q4_0 + Q6 token_embd tensor does not work for me. Probably more integer overflow problems in the Metal backend. Looking into this

The problem was on my end - somehow I had the QK normalization tensors quantized to Q4_0 instead of keeping them in F32. I redid the conversion and quantization using the latest branch and I no longer observe issues with Metal. Q4_0 + F16 token_embd tensor also works correctly

I wouldn't be surprised if we have integer overflows in the Metal kernels (I'm actually more surprised that we don't 😄 ). We'll fix those as they occur

phymbert · 2024-04-09T08:22:10Z

convert-hf-to-gguf.py

@@ -160,7 +160,7 @@ def write_tensors(self):
                data = data.astype(np.float32)

            # TODO: Why cant we use these float16 as-is? There should be not reason to store float16 as float32
-            if self.ftype == 1 and data_dtype == np.float16 and n_dims == 1:
+            if self.ftype == 1 and data_dtype == np.float16 and (n_dims == 1 or new_name.endswith("_norm.weight")):


Would be nice to update the comment

phymbert · 2024-04-09T08:24:24Z

ggml-cuda.cu

@@ -1225,7 +1225,7 @@ static void ggml_cuda_op_mul_mat_cublas(

    // the main device has a larger memory buffer to hold the results from all GPUs
    // ldc == nrows of the matrix that cuBLAS writes into
-    int ldc = id == ctx.device ? ne0 : row_diff;
+    int64_t ldc = id == ctx.device ? ne0 : row_diff;


It would be great to update the PR description to summarize why we upcasted all int params to int64 in this cntext.

@phymbert I reverted that one in dranger003@835d702 but it looks like the PR got merged before it could be pulled. My level of knowledge of these is no where near to be on par with those that created the code and so I definitely rely on your reviews. I looked at some of the values through the debugger but since we have so many overflowing I had to change them in batches, so this means I most likely changed some that don't need to be changed. Hopefully this makes some sense. I can submit another PR to master with that last commit, otherwise perplexity was still broken using CUDA for this model without that commit.

Thanks for the explanation, please raise with @ggerganov as I am out of the subject regarding CommandR+

Thanks, I opened a PR as a follow-up (#6563)

kalomaze · 2024-04-09T19:52:42Z

@dranger003 is there a chance you could upload an imatrix q4_K_S quant? (and/or imatrix q3_K_L)
CPU decoding is extremely slow on IQ quants, and it might be bottlenecking partial offloading from being feasible speed wise on 2x3090 setups :/
I can get around ~90% of the layers, but those last few are bringing down my t/s pretty dramatically.
Also heard someone claim the 5 bit model was 2x faster compared to IQ3_M.

EDIT: Apparently IQ3 is quite slow, while q4_K_S is equivalent to IQ4_XS.
It may be worth it to add q3_K_L and q3_K_M.

dranger003 · 2024-04-09T23:20:13Z

EDIT: Apparently IQ3 is quite slow, while q4_K_S is equivalent to IQ4_XS.
It may be worth it to add q3_K_L and q3_K_M.

@kalomaze Sure, I'l see what I can do.

EDIT: I uploaded the new quants, I'll update the perplexity table shortly.

kalomaze · 2024-04-10T01:16:14Z

Thanks a lot. For now I'm using IQ_3XXS and it seems fairly servicable

yamikumo-DSD · 2024-04-10T09:25:51Z

I've tried IQ3_M variant for several hours on my Apple silicon.
Basically I felt it has high fluency in daily language ignoring the time before it spits out first token.
However, what I often observed was lower capability of coding task like missing indentation, unclosed bracket, etc., comparing to same size Command-R.
I'm not sure about imatrix quantization, but I read it extracts importance matrix using wikitext.
So, I suspect it's due to the bias and will try normal Q3 variants just uploaded.
Anybody else observed phenomenon like my case?

christianwengert · 2024-04-10T10:20:55Z

I've tried IQ3_M variant for several hours on my Apple silicon. Basically I felt it has high fluency in daily language ignoring the time before it spits out first token. However, what I often observed was lower capability of coding task like missing indentation, unclosed bracket, etc., comparing to same size Command-R. I'm not sure about imatrix quantization, but I read it extracts importance matrix using wikitext. So, I suspect it's due to the bias and will try normal Q3 variants just uploaded. Anybody else observed phenomenon like my case?

Can you show me your command line for that? When I use Q1_M I can use command-r on apple silicon M1 (64GB), but when I use a Q3_M I only get garbage and the logs (I use the server) show the following for each token:

ggml_metal_graph_compute: command buffer 3 failed with status 5

yamikumo-DSD · 2024-04-10T10:34:39Z

I've tried IQ3_M variant for several hours on my Apple silicon. Basically I felt it has high fluency in daily language ignoring the time before it spits out first token. However, what I often observed was lower capability of coding task like missing indentation, unclosed bracket, etc., comparing to same size Command-R. I'm not sure about imatrix quantization, but I read it extracts importance matrix using wikitext. So, I suspect it's due to the bias and will try normal Q3 variants just uploaded. Anybody else observed phenomenon like my case?

Can you show me your command line for that? When I use Q1_M I can use command-r on apple silicon M1 (64GB), but when I use a Q3_M I only get garbage and the logs (I use the server) show the following for each token:
ggml_metal_graph_compute: command buffer 3 failed with status 5

I'm currently using llama-cpp-python backed by the latest Llama.cpp, so I'm not using CLI now.
But I guess I had same errors when I missed 'sysctl iogpu.wired_limit_mb=foobar' on my Mac with older version of repository. Not sure i can reproduce same problem currently, so I'm gonna test it when I have time.

4cecoder · 2024-04-12T04:46:51Z

These weights are split by gguf-split, so you cannot use cat to merge them. There's no need to merge them manually. Simply pass the first split and llama.cpp will automatically load all splits. If, for any reason, you want to merge splits, you must use the gguf-split with --merge option:
./gguf-split --merge ggml-c4ai-command-r-plus-104b-iq4_xs-00001-of-00002.gguf ggml-c4ai-command-r-plus-104b-iq4_xs.gguf
That is correct, I tested all quants successfully using just the first chunk. And I also have this information on the model page in the bullet list.

Please help building gguf-split on windows

#6404 (reply in thread)

Sintayew4 · 2024-04-12T23:29:43Z

#6491

* Add Command R Plus GGUF * Add Command R Plus GGUF * Loading works up to LayerNorm2D * Export new tensors in 1D so they are not quantized. * Fix embedding layer based on Noeda's example * Whitespace * Add line * Fix unexpected tokens on MPS. Re-add F16 fix. ((Noeda) * dranger003: Fix block index overflow in CUDA dequantizing. * Reverted blocked multiplication code as it still has issues and could affect other Llama arches * export norms as f32 * fix overflow issues during quant and other cleanup * Type convention Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * dranger003: Fix more int overflow during quant. --------- Co-authored-by: S <seast@Ss-Mac-Studio.local> Co-authored-by: S <s@example.com> Co-authored-by: slaren <slarengh@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

$RefractAI$

S added 2 commits April 4, 2024 18:23

Add Command R Plus GGUF

2efcd87

Add Command R Plus GGUF

fbab984

$@RefractAI$ RefractAI mentioned this pull request Apr 4, 2024

Convert-hf-to-gguf fails with command-r-plus #6488

Closed

Loading works up to LayerNorm2D

e4b2e2d

acanis mentioned this pull request Apr 4, 2024

Add Command-R Model #6033

Merged

Export new tensors in 1D so they are not quantized.

c354db7

$@RefractAI$ RefractAI marked this pull request as draft April 5, 2024 00:48

he29-net mentioned this pull request Apr 8, 2024

Re-quantization of a split gguf file produces "invalid split file" #6548

Closed

dranger003: Fix more int overflow during quant.

ea1aeba

ggerganov merged commit 5dc9dd7 into ggerganov:master Apr 9, 2024
53 of 59 checks passed

phymbert reviewed Apr 9, 2024

View reviewed changes

dranger003 mentioned this pull request Apr 9, 2024

Fix more int overflow during quant (PPL/CUDA). #6563

Merged

ggerganov mentioned this pull request Apr 25, 2024

llamafile : improve sgemm.cpp #6796

Merged

TomoshibiAkira mentioned this pull request Apr 27, 2024

Command R Plus crashed on large context (~40K) with CUDA #6948

Closed

Add Command R Plus support #6491

Add Command R Plus support #6491

Conversation

RefractAI commented Apr 4, 2024

bartowski1182 commented Apr 4, 2024

N8python commented Apr 4, 2024

bartowski1182 commented Apr 4, 2024

Noeda commented Apr 4, 2024 • edited Loading

bartowski1182 commented Apr 4, 2024

candre23 commented Apr 4, 2024

Noeda commented Apr 4, 2024 • edited Loading

N8python commented Apr 4, 2024

sammcj commented Apr 4, 2024

RefractAI commented Apr 4, 2024 • edited Loading

sammcj commented Apr 4, 2024 • edited Loading

N8python commented Apr 4, 2024

github-actions bot commented Apr 4, 2024 • edited Loading

slaren commented Apr 4, 2024

N8python commented Apr 4, 2024

Noeda commented Apr 4, 2024

RefractAI commented Apr 4, 2024 • edited Loading

Noeda commented Apr 5, 2024 • edited Loading

Noeda commented Apr 5, 2024 • edited Loading

dranger003 commented Apr 5, 2024

Noeda commented Apr 5, 2024

Noeda commented Apr 5, 2024

N8python commented Apr 5, 2024

Noeda commented Apr 8, 2024

dranger003 commented Apr 8, 2024

teis-e commented Apr 8, 2024

dranger003 commented Apr 8, 2024 • edited Loading

teis-e commented Apr 8, 2024

candre23 commented Apr 8, 2024

dranger003 commented Apr 8, 2024

candre23 commented Apr 8, 2024

teis-e commented Apr 8, 2024

dranger003 commented Apr 8, 2024 • edited Loading

ghchris2021 commented Apr 9, 2024

phymbert commented Apr 9, 2024

ggerganov commented Apr 9, 2024 • edited Loading

phymbert Apr 9, 2024

Choose a reason for hiding this comment

phymbert Apr 9, 2024

Choose a reason for hiding this comment

dranger003 Apr 9, 2024 • edited Loading

Choose a reason for hiding this comment

phymbert Apr 9, 2024

Choose a reason for hiding this comment

dranger003 Apr 9, 2024

Choose a reason for hiding this comment

kalomaze commented Apr 9, 2024 • edited Loading

dranger003 commented Apr 9, 2024 • edited Loading

kalomaze commented Apr 10, 2024

yamikumo-DSD commented Apr 10, 2024 • edited Loading

christianwengert commented Apr 10, 2024

yamikumo-DSD commented Apr 10, 2024

4cecoder commented Apr 12, 2024 • edited Loading

Please help building gguf-split on windows

Sintayew4 commented Apr 12, 2024

$@RefractAI$ RefractAI commented Apr 4, 2024

Noeda commented Apr 4, 2024 •

edited

Loading

Noeda commented Apr 4, 2024 •

edited

Loading

RefractAI commented Apr 4, 2024 •

edited

Loading

sammcj commented Apr 4, 2024 •

edited

Loading

github-actions bot commented Apr 4, 2024 •

edited

Loading

RefractAI commented Apr 4, 2024 •

edited

Loading

Noeda commented Apr 5, 2024 •

edited

Loading

Noeda commented Apr 5, 2024 •

edited

Loading

dranger003 commented Apr 8, 2024 •

edited

Loading

dranger003 commented Apr 8, 2024 •

edited

Loading

ggerganov commented Apr 9, 2024 •

edited

Loading

dranger003 Apr 9, 2024 •

edited

Loading

kalomaze commented Apr 9, 2024 •

edited

Loading

dranger003 commented Apr 9, 2024 •

edited

Loading

yamikumo-DSD commented Apr 10, 2024 •

edited

Loading

4cecoder commented Apr 12, 2024 •

edited

Loading