-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : remove bit shuffling #1305
Conversation
Unfortunately, |
A few remarks:
|
Yes, I'm still hesitating. But I think |
Somehow perplexity computation with Edit: fixed |
Yes, if this is going to be a complete non-backwards compatible change, might I humbly suggest changing the magic in the file header at least, or utilizing the version field? |
"End users" here are not state bureaucrats with IE8, but adventurous devs who are involved with an experimental approach to a new technology. Breakage is the name of the game. It takes a minute to cleanup and rerun the scripts. For my models I prefer minimal and fast. If anything I would like to have the possibility to break compatibility for the sake of performance and size. |
@hmih With all due respect, I know this project is all about optimization and experimentation. I am just suggesting that since this is such a major change, helping others identify older and newer models elsewhere would be very useful, since there is already a thriving ggml ecosystem beyond this repo. I understand that reconverting the models for those familiar with this project is easy enough - but I just think a 1 line change for the file magic, or simply incrementing the existing version header by 1, something that requires minimal effort, would make future maintenance and identifying these new formats easier. The in-place swap suggested by @digiwombat would be even better if possible. |
Firstly, I agree very much in spirit with your statement and the speedup the new version will bring. I also would offer that it is fairly early in the lifecycle for this project and that is an argument in favor of just putting through the breaking change and letting it be. On the other side, ggml (especially via llama.cpp) is being used pretty widely and I would ask that a thought be spared for the maintainers of supporting projects like @LostRuins with Koboldcpp or the fine folks doing llama-cpp-python (and downstream of it, oobabooga's webui), among others who will likely bear the brunt of user confusion on these issues. It will cause a burden in a wider radius that the user-facing software people don't have a lot of control over since they are generally not in control of the model repos to make the updates themselves. That's all. Just wanted to toss out a bit of an explanation since the nature of the users was raised. I think the repos for front-end projects may see the target userbase for their projects much differently than core llama.cpp, generally speaking. |
Line 191 in 0e48eb6
can we increment this value by 1 ? edit: oh, it was all in the llama.h/.cpp |
That would make the unaffected formats incompatible - F16, Q8. The clean way would be to define new formats Q4_4, Q4_5, etc. But that gets unwieldy quickly. |
@sw doesn't have to be though, during loading exceptions can be added in llama.cpp to treat old f16 and q8 format with either file versions 1 or 2 as forwards compatible. |
Close in favor of #1405 |
@ProfessorSparrs if you have the f16 files, qnantizing is very easy and WAY less recource intensive than running the model. :) (check the |
Implementation of #1241
Avoid unnecessary bit shuffling by packing the quants in a better way.
Requires model re-quantization
Q4_0
Q4_1
Q5_0
Q5_1
New timings:
Old timings:
overall, all these numbers seem to have about +/- 10% variablility from run to run. not ideal benchmark, but not sure what else to do