-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
this format is no longer supported #1408
Comments
same, right now just check out to older commit. |
Which older commit? |
write this in llama.cpp directory from command line |
Competing forks and ecosystem fracture are looming. Needs a common file format and a plan for breaking changes. Without this, the community's time and money will be wasted. |
I'm a maintainer of Is it possible to restore the old quantisation functions as an option for loading? I understand the desire to keep iterating, but there's an extant ecosystem of hundreds of models out there, and this is going to cause a lot of unnecessary disruption 😦 This will also cause a lot of unnecessary confusion for users given the use of the same names for the quantisation methods - it's very unclear if a |
Hi @philpax, I think this is because the llama converter is still tagging new models as file version 1 instead of version 2. See: #1405 (comment) |
Good to know - I haven't tackled it yet - but even still, what are your plans for loading both v1 and v2? Will you keep both versions of |
For KoboldCPP I have already prepared a GGML shim that contains refactored version of all the old quantizers https://github.com/LostRuins/koboldcpp/blob/concedo_experimental/ggml_v2.c Once this is ironed out I'll probably swap the dequant functions as necessary based on file version if possible. I plan to maintain 100% backward compatibility with every single past ggml model. |
I'm just here to confirm that it would be really nice to continue using all the quantized models out there! (for instance, all the "ggml" models on huggingface) :) |
Wouldn't it be better instead of distributing the quantized models, to distribute the F16 models. If the F16 models are distributed, then your software can decide when to re-generate a quantized model and when to pick one directly from the cache. |
For us, personally, we're not distributing anything - we're supporting what's out there, to make sure that users can use a model, no matter how old it is. Users prefer downloading the quantised models because they're smaller and ready to go - the f16 models are often not accessible to them, being up to four times as large and requiring additional work. I'm likely to take the same workaround @LostRuins has and bundle an older ggml (or a part of it) to keep supporting old models. For better or worse, those hundreds of models have proliferated to tens of thousands of users in the few months this has been active - we need to be able to either keep using those models, or to be able to update those models in a seamless and convenient way. |
Sorry, I'm a newbie and yesterday was my fisrt day using LLaMA.cpp, the thing is that it looks like I decided to start in a very difficult moment. I ignore the majority of the terms and I got lost when I have to understand how to install these technologies on my PC. Sorry, I'm really interested but now I am more lost than yesterday
|
There are so many fine-tunes floating around. If someone wants to try a variety of them, grabbing the quantized versions directly makes a lot of sense storage-wise. Maybe if LoRAs loaded faster, more of those would be shared instead of weights, but iiuc we need to read a f16 model for good LoRA results... pretty slow. Even with a good reason to distribute the quantized models, I think changing format like this is still reasonable because it's so easy to requantize. The popular models on huggingface were requantized within a day. Anyway, the point of this bug is to give better advice in the error message, right? Would it make sense to link to docs on how quantize an f16 model? |
can someone please like yk explain how to fix this |
There is no way to currently convert your existing quantized (
|
UPDATE - scratch this, seems when running the code in a different environment on the same computer, it is as fast as before. Yes please make sure that older code to handle older models can still be used. Things like this can break the momentum of the project. I lost productivity today because my old model didn't load, and the "fixed" model is many times slower with the new code - almost so it can't be used. (Vicuna). I have to look to downgrade. |
I just reran the conversion program to convert from the f16 format to the new format |
the thing is most of us don't have the original version we have the quantized version |
Follow these steps to easily acquire the (Alpaca/LLaMA) F16 model:
Done. |
Ah, nevermind. My alpaca model is now spitting out some weird hallucinations. Not even responding to any of my prompts. Even typing things in chinese, and french. I even downloaded them model from The Bloke, and it doesn't work:
I should have never updated my pip installation. I guess it's time to spend another week on terminal runarounds. Update: (I think?) It seems to work using llama.cpp, but the python bindings are now broken. UPDATE2: My bad. It is working - but the python bindings I am using no longer work. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
The latest merge breaks the old models.
We need the instructions on how to update them
./examples/chat.sh
main: build = 529 (b9fd7ee)
main: seed = 1683842865
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
error loading model: this format is no longer supported (see #1305)
llama_init_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './models/7B/ggml-model-q4_0.bin'
main: error: unable to load model
The text was updated successfully, but these errors were encountered: