Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

this format is no longer supported #1408

Closed
apcameron opened this issue May 11, 2023 · 21 comments
Closed

this format is no longer supported #1408

apcameron opened this issue May 11, 2023 · 21 comments
Labels

Comments

@apcameron
Copy link
Contributor

The latest merge breaks the old models.
We need the instructions on how to update them

./examples/chat.sh
main: build = 529 (b9fd7ee)
main: seed = 1683842865
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
error loading model: this format is no longer supported (see #1305)
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/7B/ggml-model-q4_0.bin'
main: error: unable to load model

@moh21amed
Copy link

same, right now just check out to older commit.

@redthing1
Copy link

Which older commit?

@moh21amed
Copy link

write this in llama.cpp directory from command line
git checkout cf348a6
it worked for me

@javea7171
Copy link

javea7171 commented May 12, 2023

Competing forks and ecosystem fracture are looming. Needs a common file format and a plan for breaking changes. Without this, the community's time and money will be wasted.

@philpax
Copy link

philpax commented May 12, 2023

I'm a maintainer of llm (a Rust version of llama.cpp and other models), and we're not entirely sure how we're going to handle this. We'd like to maintain compatibility with the previous models, but it doesn't seem like that's an option at all if we update to the latest version of GGML.

Is it possible to restore the old quantisation functions as an option for loading? I understand the desire to keep iterating, but there's an extant ecosystem of hundreds of models out there, and this is going to cause a lot of unnecessary disruption 😦

This will also cause a lot of unnecessary confusion for users given the use of the same names for the quantisation methods - it's very unclear if a q4_1 model released today will be compatible with the old or new quantisation methods.

@LostRuins
Copy link
Collaborator

Hi @philpax, I think this is because the llama converter is still tagging new models as file version 1 instead of version 2. See: #1405 (comment)

@philpax
Copy link

philpax commented May 12, 2023

Hi @philpax, I think this is because the llama converter is still tagging new models as file version 1 instead of version 2. See: #1405 (comment)

Good to know - I haven't tackled it yet - but even still, what are your plans for loading both v1 and v2? Will you keep both versions of ggml and switch based on the version?

@LostRuins
Copy link
Collaborator

For KoboldCPP I have already prepared a GGML shim that contains refactored version of all the old quantizers

https://github.com/LostRuins/koboldcpp/blob/concedo_experimental/ggml_v2.c

Once this is ironed out I'll probably swap the dequant functions as necessary based on file version if possible.

I plan to maintain 100% backward compatibility with every single past ggml model.

@batmac
Copy link

batmac commented May 12, 2023

I'm just here to confirm that it would be really nice to continue using all the quantized models out there! (for instance, all the "ggml" models on huggingface) :)

@ggerganov
Copy link
Member

Wouldn't it be better instead of distributing the quantized models, to distribute the F16 models.
The F16 format is unlikely to change in the future, while the quantized formats are likely to change again in the future.

If the F16 models are distributed, then your software can decide when to re-generate a quantized model and when to pick one directly from the cache.

@philpax
Copy link

philpax commented May 12, 2023

For us, personally, we're not distributing anything - we're supporting what's out there, to make sure that users can use a model, no matter how old it is. Users prefer downloading the quantised models because they're smaller and ready to go - the f16 models are often not accessible to them, being up to four times as large and requiring additional work.

I'm likely to take the same workaround @LostRuins has and bundle an older ggml (or a part of it) to keep supporting old models. For better or worse, those hundreds of models have proliferated to tens of thousands of users in the few months this has been active - we need to be able to either keep using those models, or to be able to update those models in a seamless and convenient way.

@ZenKemuri
Copy link

Sorry, I'm a newbie and yesterday was my fisrt day using LLaMA.cpp, the thing is that it looks like I decided to start in a very difficult moment. I ignore the majority of the terms and I got lost when I have to understand how to install these technologies on my PC.
Now... my question is, what if I want to try a model right now? What model should I download? Because I have the same issue, I did everything right! But it tells me that it can'ts load the model :(
So, what should I do? Do I have to quantize anything?

Sorry, I'm really interested but now I am more lost than yesterday

  • Ubuntu 22.04
  • i7-5410
  • 8Gb RAM
  • 4Gb VRAM GTM840-M

@grencez
Copy link
Contributor

grencez commented May 13, 2023

Wouldn't it be better instead of distributing the quantized models, to distribute the F16 models.

There are so many fine-tunes floating around. If someone wants to try a variety of them, grabbing the quantized versions directly makes a lot of sense storage-wise. Maybe if LoRAs loaded faster, more of those would be shared instead of weights, but iiuc we need to read a f16 model for good LoRA results... pretty slow.

Even with a good reason to distribute the quantized models, I think changing format like this is still reasonable because it's so easy to requantize. The popular models on huggingface were requantized within a day.

Anyway, the point of this bug is to give better advice in the error message, right? Would it make sense to link to docs on how quantize an f16 model?

@MadaraLoL
Copy link

can someone please like yk explain how to fix this

@philpax
Copy link

philpax commented May 14, 2023

can someone please like yk explain how to fix this

There is no way to currently convert your existing quantized (qX_Y) models to the new quantization format. You will have to do one of the following:

  1. Downgrade the version of llama.cpp you're using
  2. Find/retrieve the original source models (f32, f16), and quantize them with the new scheme
  3. Hope someone's requantized the models for you (TheBloke has updated all of his models)

@titlestad
Copy link

titlestad commented May 15, 2023

UPDATE - scratch this, seems when running the code in a different environment on the same computer, it is as fast as before.

Yes please make sure that older code to handle older models can still be used. Things like this can break the momentum of the project. I lost productivity today because my old model didn't load, and the "fixed" model is many times slower with the new code - almost so it can't be used. (Vicuna). I have to look to downgrade.

@apcameron
Copy link
Contributor Author

I just reran the conversion program to convert from the f16 format to the new format

@moh21amed
Copy link

I just reran the conversion program to convert from the f16 format to the new format

the thing is most of us don't have the original version we have the quantized version

@ghost
Copy link

ghost commented May 21, 2023

Follow these steps to easily acquire the (Alpaca/LLaMA) F16 model:

  1. Download and install the Alpaca-lora repo. https://github.com/tloen/alpaca-lora
  2. Once you've successfully downloaded the model weights, you should have them inside a folder like this (on linux):
  3. run python convert-pth-to-ggml.py ~/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348 1
  4. Once you get your f16 model, copy it into the llama.cpp/models folder.
  5. Run: ./quantize ./models/ggml-model-f16.bin ./models/ggml-model-q4_0.bin q4_0

Done.

@ghost
Copy link

ghost commented May 21, 2023

Ah, nevermind.

My alpaca model is now spitting out some weird hallucinations. Not even responding to any of my prompts. Even typing things in chinese, and french.

I even downloaded them model from The Bloke, and it doesn't work:

llama.cpp: loading model from /home/twinlizzie/llm_models/7B/llama-7b.ggmlv3.q4_0.bin
error loading model: unknown (magic, version) combination: 67676a74, 00000003; is this really a GGML file?
llama_init_from_file: failed to load model

I should have never updated my pip installation. I guess it's time to spend another week on terminal runarounds.

Update: (I think?) It seems to work using llama.cpp, but the python bindings are now broken.
I'll have a look and see if I can switch to the python bindings of abetlen/llama-cpp-python and get it to work properly.

UPDATE2: My bad. It is working - but the python bindings I am using no longer work.

Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests