this format is no longer supported #1408

apcameron · 2023-05-11T22:15:12Z

The latest merge breaks the old models.
We need the instructions on how to update them

./examples/chat.sh
main: build = 529 (b9fd7ee)
main: seed = 1683842865
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
error loading model: this format is no longer supported (see #1305)
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/7B/ggml-model-q4_0.bin'
main: error: unable to load model

The text was updated successfully, but these errors were encountered:

moh21amed · 2023-05-11T22:22:50Z

same, right now just check out to older commit.

redthing1 · 2023-05-11T23:01:15Z

Which older commit?

moh21amed · 2023-05-11T23:05:07Z

write this in llama.cpp directory from command line
git checkout cf348a6
it worked for me

javea7171 · 2023-05-12T12:25:26Z

Competing forks and ecosystem fracture are looming. Needs a common file format and a plan for breaking changes. Without this, the community's time and money will be wasted.

philpax · 2023-05-12T14:14:23Z

I'm a maintainer of llm (a Rust version of llama.cpp and other models), and we're not entirely sure how we're going to handle this. We'd like to maintain compatibility with the previous models, but it doesn't seem like that's an option at all if we update to the latest version of GGML.

Is it possible to restore the old quantisation functions as an option for loading? I understand the desire to keep iterating, but there's an extant ecosystem of hundreds of models out there, and this is going to cause a lot of unnecessary disruption 😦

This will also cause a lot of unnecessary confusion for users given the use of the same names for the quantisation methods - it's very unclear if a q4_1 model released today will be compatible with the old or new quantisation methods.

LostRuins · 2023-05-12T17:40:04Z

Hi @philpax, I think this is because the llama converter is still tagging new models as file version 1 instead of version 2. See: #1405 (comment)

philpax · 2023-05-12T17:41:44Z

Hi @philpax, I think this is because the llama converter is still tagging new models as file version 1 instead of version 2. See: #1405 (comment)

Good to know - I haven't tackled it yet - but even still, what are your plans for loading both v1 and v2? Will you keep both versions of ggml and switch based on the version?

LostRuins · 2023-05-12T17:44:11Z

For KoboldCPP I have already prepared a GGML shim that contains refactored version of all the old quantizers

https://github.com/LostRuins/koboldcpp/blob/concedo_experimental/ggml_v2.c

Once this is ironed out I'll probably swap the dequant functions as necessary based on file version if possible.

I plan to maintain 100% backward compatibility with every single past ggml model.

batmac · 2023-05-12T18:50:30Z

I'm just here to confirm that it would be really nice to continue using all the quantized models out there! (for instance, all the "ggml" models on huggingface) :)

ggerganov · 2023-05-12T18:53:27Z

Wouldn't it be better instead of distributing the quantized models, to distribute the F16 models.
The F16 format is unlikely to change in the future, while the quantized formats are likely to change again in the future.

If the F16 models are distributed, then your software can decide when to re-generate a quantized model and when to pick one directly from the cache.

philpax · 2023-05-12T18:58:16Z

For us, personally, we're not distributing anything - we're supporting what's out there, to make sure that users can use a model, no matter how old it is. Users prefer downloading the quantised models because they're smaller and ready to go - the f16 models are often not accessible to them, being up to four times as large and requiring additional work.

I'm likely to take the same workaround @LostRuins has and bundle an older ggml (or a part of it) to keep supporting old models. For better or worse, those hundreds of models have proliferated to tens of thousands of users in the few months this has been active - we need to be able to either keep using those models, or to be able to update those models in a seamless and convenient way.

ZenKemuri · 2023-05-12T21:25:56Z

Sorry, I'm a newbie and yesterday was my fisrt day using LLaMA.cpp, the thing is that it looks like I decided to start in a very difficult moment. I ignore the majority of the terms and I got lost when I have to understand how to install these technologies on my PC.
Now... my question is, what if I want to try a model right now? What model should I download? Because I have the same issue, I did everything right! But it tells me that it can'ts load the model :(
So, what should I do? Do I have to quantize anything?

Sorry, I'm really interested but now I am more lost than yesterday

Ubuntu 22.04
i7-5410
8Gb RAM
4Gb VRAM GTM840-M

grencez · 2023-05-13T05:49:21Z

Wouldn't it be better instead of distributing the quantized models, to distribute the F16 models.

There are so many fine-tunes floating around. If someone wants to try a variety of them, grabbing the quantized versions directly makes a lot of sense storage-wise. Maybe if LoRAs loaded faster, more of those would be shared instead of weights, but iiuc we need to read a f16 model for good LoRA results... pretty slow.

Even with a good reason to distribute the quantized models, I think changing format like this is still reasonable because it's so easy to requantize. The popular models on huggingface were requantized within a day.

Anyway, the point of this bug is to give better advice in the error message, right? Would it make sense to link to docs on how quantize an f16 model?

MadaraLoL · 2023-05-13T21:07:32Z

can someone please like yk explain how to fix this

philpax · 2023-05-14T00:18:12Z

can someone please like yk explain how to fix this

There is no way to currently convert your existing quantized (qX_Y) models to the new quantization format. You will have to do one of the following:

Downgrade the version of llama.cpp you're using
Find/retrieve the original source models (f32, f16), and quantize them with the new scheme
Hope someone's requantized the models for you (TheBloke has updated all of his models)

titlestad · 2023-05-15T10:53:44Z

UPDATE - scratch this, seems when running the code in a different environment on the same computer, it is as fast as before.

Yes please make sure that older code to handle older models can still be used. Things like this can break the momentum of the project. I lost productivity today because my old model didn't load, and the "fixed" model is many times slower with the new code - almost so it can't be used. (Vicuna). I have to look to downgrade.

apcameron · 2023-05-15T17:57:29Z

I just reran the conversion program to convert from the f16 format to the new format

moh21amed · 2023-05-16T22:23:30Z

I just reran the conversion program to convert from the f16 format to the new format

the thing is most of us don't have the original version we have the quantized version

ghost · 2023-05-21T07:03:27Z

Follow these steps to easily acquire the (Alpaca/LLaMA) F16 model:

Download and install the Alpaca-lora repo. https://github.com/tloen/alpaca-lora
Once you've successfully downloaded the model weights, you should have them inside a folder like this (on linux):
run python convert-pth-to-ggml.py ~/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348 1
Once you get your f16 model, copy it into the llama.cpp/models folder.
Run: ./quantize ./models/ggml-model-f16.bin ./models/ggml-model-q4_0.bin q4_0

Done.

ghost · 2023-05-21T07:15:51Z

Ah, nevermind.

My alpaca model is now spitting out some weird hallucinations. Not even responding to any of my prompts. Even typing things in chinese, and french.

I even downloaded them model from The Bloke, and it doesn't work:

llama.cpp: loading model from /home/twinlizzie/llm_models/7B/llama-7b.ggmlv3.q4_0.bin
error loading model: unknown (magic, version) combination: 67676a74, 00000003; is this really a GGML file?
llama_init_from_file: failed to load model

I should have never updated my pip installation. I guess it's time to spend another week on terminal runarounds.

Update: (I think?) It seems to work using llama.cpp, but the python bindings are now broken.
I'll have a look and see if I can switch to the python bindings of abetlen/llama-cpp-python and get it to work properly.

UPDATE2: My bad. It is working - but the python bindings I am using no longer work.

github-actions · 2024-04-09T01:09:21Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

philpax mentioned this issue May 11, 2023

Support new quantization formats for quantizer rustformers/llm#209

Closed

grencez mentioned this issue May 12, 2023

build(dep): Update llama.cpp past its May 11th ggml format change rendezqueue/rendezllama#26

Closed

philpax mentioned this issue May 13, 2023

Support the bit-shuffling changes from llama.cpp rustformers/llm#198

Closed

nikhil-xb mentioned this issue May 17, 2023

can't use mmap because of ggml? zylon-ai/private-gpt#190

Closed

klosax mentioned this issue Aug 5, 2023

ggml : PoC for normalizing weights for better quantization packing #2434

Draft

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

this format is no longer supported #1408

this format is no longer supported #1408

apcameron commented May 11, 2023

moh21amed commented May 11, 2023

redthing1 commented May 11, 2023

moh21amed commented May 11, 2023

javea7171 commented May 12, 2023 •

edited

Loading

philpax commented May 12, 2023

LostRuins commented May 12, 2023

philpax commented May 12, 2023

LostRuins commented May 12, 2023

batmac commented May 12, 2023

ggerganov commented May 12, 2023

philpax commented May 12, 2023

ZenKemuri commented May 12, 2023

grencez commented May 13, 2023 •

edited

Loading

MadaraLoL commented May 13, 2023

philpax commented May 14, 2023

titlestad commented May 15, 2023 •

edited

Loading

apcameron commented May 15, 2023

moh21amed commented May 16, 2023

ghost commented May 21, 2023

ghost commented May 21, 2023 •

edited by ghost

Loading

github-actions bot commented Apr 9, 2024

this format is no longer supported #1408

this format is no longer supported #1408

Comments

apcameron commented May 11, 2023

moh21amed commented May 11, 2023

redthing1 commented May 11, 2023

moh21amed commented May 11, 2023

javea7171 commented May 12, 2023 • edited Loading

philpax commented May 12, 2023

LostRuins commented May 12, 2023

philpax commented May 12, 2023

LostRuins commented May 12, 2023

batmac commented May 12, 2023

ggerganov commented May 12, 2023

philpax commented May 12, 2023

ZenKemuri commented May 12, 2023

grencez commented May 13, 2023 • edited Loading

MadaraLoL commented May 13, 2023

philpax commented May 14, 2023

titlestad commented May 15, 2023 • edited Loading

apcameron commented May 15, 2023

moh21amed commented May 16, 2023

ghost commented May 21, 2023

ghost commented May 21, 2023 • edited by ghost Loading

github-actions bot commented Apr 9, 2024

javea7171 commented May 12, 2023 •

edited

Loading

grencez commented May 13, 2023 •

edited

Loading

titlestad commented May 15, 2023 •

edited

Loading

ghost commented May 21, 2023 •

edited by ghost

Loading