ggml : remove bit shuffling #1305

ggerganov · 2023-05-03T20:19:30Z

Implementation of #1241

Avoid unnecessary bit shuffling by packing the quants in a better way.
Requires model re-quantization

Q4_0
- quantize
- dequantize
- dot ARM NEON
- dot AVX
Q4_1
- quantize
- dequantize
- dot ARM NEON
- dot AVX
Q5_0
- quantize
- dequantize
- dot ARM NEON
- dot AVX
- dot WASM SIMD
Q5_1
- quantize
- dequantize
- dot ARM NEON
- dot AVX
- dot WASM SIMD

New timings:

Model	Measure	F16	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0
7B	ms/tok @ 4th	127	49	56	89	92	74
7B	ms/tok @ 8th	120	44	52	49	52	70
13B	ms/tok @ 4th	261*	91	103	173	177	139
13B	ms/tok @ 8th	316*	81	95	103	113	134

these numbers vary a lot since the model is on the 32GB limit of my MacBook

Old timings:

Model	Measure	F16	Q4_0	Q4_1	Q5_0	Q5_1	Q8_0
7B	ms/tok @ 4th	128	56	61	91	95	75
7B	ms/tok @ 8th	128	47	55	53	59	75
13B	ms/tok @ 4th	239	104	113	176	185	141
13B	ms/tok @ 8th	240	85	99	108	117	147

overall, all these numbers seem to have about +/- 10% variablility from run to run. not ideal benchmark, but not sure what else to do

ggerganov · 2023-05-04T19:07:03Z

Unfortunately, Q4_2 does not fit into this pattern, unless we introduce a Q8_2 block with size of 16 instead of 32 elements.
Any suggestions how to proceed? Maybe drop Q4_2 support?

sw · 2023-05-06T17:34:57Z

A few remarks:

quantize_row_q4_1_reference is broken, it's missing the i*qk in x[...]
Q5 interleaves the nibbles but not the single MSBs, that could be confusing. The scalar implementation seems to be broken, maybe it's because of that?
I still think this could be done in a non-breaking manner for Q4, by simply changing the order of Q8

ggerganov · 2023-05-06T17:49:14Z

@sw

I still think this could be done in a non-breaking manner for Q4, by simply changing the order of Q8

Yes, I'm still hesitating. But I think Q8 quantization will be sub-optimal this way.
Not much, but still.

ggerganov · 2023-05-07T17:02:12Z

Q5 interleaves the nibbles but not the single MSBs, that could be confusing. The scalar implementation seems to be broken, maybe it's because of that?

Somehow perplexity computation with Q5_0 + cuBLAS is currently broken and I don't see the error.
It works on M1.

Edit: fixed

LostRuins · 2023-05-09T07:39:53Z

Yes, if this is going to be a complete non-backwards compatible change, might I humbly suggest changing the magic in the file header at least, or utilizing the version field?

hmih · 2023-05-09T07:42:24Z

Is there enough bloat caused by supporting the old models that a header change to allow loading of both the new and old versions would be an imposition? That would keep old model compatibility and allow new quants to benefit from the speedup without any impact on end users.

"End users" here are not state bureaucrats with IE8, but adventurous devs who are involved with an experimental approach to a new technology. Breakage is the name of the game. It takes a minute to cleanup and rerun the scripts. For my models I prefer minimal and fast. If anything I would like to have the possibility to break compatibility for the sake of performance and size.

LostRuins · 2023-05-09T08:00:40Z

@hmih With all due respect, I know this project is all about optimization and experimentation. I am just suggesting that since this is such a major change, helping others identify older and newer models elsewhere would be very useful, since there is already a thriving ggml ecosystem beyond this repo.

I understand that reconverting the models for those familiar with this project is easy enough - but I just think a 1 line change for the file magic, or simply incrementing the existing version header by 1, something that requires minimal effort, would make future maintenance and identifying these new formats easier.

The in-place swap suggested by @digiwombat would be even better if possible.

digiwombat · 2023-05-09T09:45:45Z

"End users" here are not state bureaucrats with IE8

Firstly, I agree very much in spirit with your statement and the speedup the new version will bring. I also would offer that it is fairly early in the lifecycle for this project and that is an argument in favor of just putting through the breaking change and letting it be.

On the other side, ggml (especially via llama.cpp) is being used pretty widely and I would ask that a thought be spared for the maintainers of supporting projects like @LostRuins with Koboldcpp or the fine folks doing llama-cpp-python (and downstream of it, oobabooga's webui), among others who will likely bear the brunt of user confusion on these issues. It will cause a burden in a wider radius that the user-facing software people don't have a lot of control over since they are generally not in control of the model repos to make the updates themselves.

That's all. Just wanted to toss out a bit of an explanation since the nature of the users was raised. I think the repos for front-end projects may see the target userbase for their projects much differently than core llama.cpp, generally speaking.

Green-Sky · 2023-05-09T14:01:47Z

llama.cpp/ggml.h

Line 191 in 0e48eb6

#define GGML_FILE_VERSION 1

can we increment this value by 1 ?

edit: oh, it was all in the llama.h/.cpp

sw · 2023-05-09T14:17:36Z

can we increment this value by 1 ?

That would make the unaffected formats incompatible - F16, Q8. The clean way would be to define new formats Q4_4, Q4_5, etc. But that gets unwieldy quickly.

LostRuins · 2023-05-09T16:07:58Z

@sw doesn't have to be though, during loading exceptions can be added in llama.cpp to treat old f16 and q8 format with either file versions 1 or 2 as forwards compatible.

ggerganov · 2023-05-11T18:36:54Z

Close in favor of #1405

Green-Sky · 2023-05-15T11:21:34Z

@ProfessorSparrs if you have the f16 files, qnantizing is very easy and WAY less recource intensive than running the model. :) (check the quantize executable)

ggerganov added performance Speed related topics breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. labels May 3, 2023

ggerganov changed the title ~~ggml : remove bit shufling~~ ggml : remove bit shuffling May 3, 2023

ggerganov mentioned this pull request May 4, 2023

Continuous layouts for quantization q4_0c #1073

Closed

4 tasks

ggerganov force-pushed the remove-vzip branch from 94f5d4a to 79e49c9 Compare May 5, 2023 14:13

JohannesGaessler mentioned this pull request May 6, 2023

More GPU threads for dequantization #1341

Closed

slaren mentioned this pull request May 6, 2023

[DRAFT] Speedup dequantize kernels #1221

Closed

ggerganov mentioned this pull request May 7, 2023

ggml : add AVX support and modify AVX2 code #1331

Closed

ggerganov force-pushed the remove-vzip branch from 166e60f to f9968a5 Compare May 7, 2023 15:30

ggerganov added 14 commits May 8, 2023 21:35

ggml : remove Q4_0 bit shufling (ARM NEON)

a546dc6

ggml : remove Q4_1 bit shuffling (ARM NEON + reference)

edb6c8b

ggml : nibbles_from_floats() + bytes_from_nibbles() (ARM NEON)

086cfea

ggml : remove Q4_2 bit shuffling (WIP, BROKEN)

a6a1d96

ggml : remove Q5_0 bit shuffling (ARM NEON)

796f8ae

ggml : 2x faster scalar implementations

39bb8e7

ggml : remove Q5_1 bit shuffling (ARM NEON + scalar)

c7af904

ggml : simplify scalar dot

ba953d6

ggml : remove WASM SIMD bit shuffling + remove vzip for ARM 32-bit

4991499

ggml : fix Q4_1 quantization

c216656

ggml : update cuBLAS + normalize variable names

b47bd28

ggml : remove Q4_2 mode

7cdc08a

ggml : minor formatting

60f62bb

ggml : fix Q5_0 quantization

8fbf777

ggerganov force-pushed the remove-vzip branch from ee19c8b to 8fbf777 Compare May 8, 2023 18:37

scripts : add script for measuring the time per token

d155f0f

sw mentioned this pull request May 8, 2023

AVX implementations for remove-vzip #1370

Merged

slaren mentioned this pull request May 9, 2023

use pause asm insn in busyloop to run the CPU (13600K) 10 °C cooler #1314

Merged

ggerganov added 2 commits May 9, 2023 18:19

llama : produce error upon loading old model files

4201fa5

llama : fix model magic/version write

ffd76e1

philpax mentioned this pull request May 9, 2023

Support the bit-shuffling changes from llama.cpp rustformers/llm#198

Closed

sw mentioned this pull request May 9, 2023

Remove Q4/Q5 bit shuffling without breaking compatibility #1384

Closed

ggml : speed-up Q5_0 + Q5_1 at 4 threads

e116eb6

ggerganov mentioned this pull request May 11, 2023

ggml : remove bit shuffling #1405

Merged

ggerganov closed this May 11, 2023

apcameron mentioned this pull request May 11, 2023

this format is no longer supported #1408

Closed

michael7908 mentioned this pull request May 14, 2023

NameError: Could not load Llama model from path: D:\privateGPT\ggml-model-q4_0.bin zylon-ai/private-gpt#113

Closed

b007zk mentioned this pull request May 14, 2023

raise NameError(f"Could not load Llama model from path: {model_path}") NameError: Could not load Llama model from path: models/ggml-model-q4_0.bin zylon-ai/private-gpt#140

Closed

jasonogrady mentioned this pull request May 15, 2023

llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this zylon-ai/private-gpt#15

Closed

hippalectryon-0 mentioned this pull request May 16, 2023

ingest.py - versioning su77ungr/CASALIOY#56

Closed

andreakiro mentioned this pull request May 16, 2023

LLaMa model no longer supported zylon-ai/private-gpt#220

Closed

milobestcat mentioned this pull request May 17, 2023

error loading model oobabooga/text-generation-webui#2135

Closed

1 task

peterchanws mentioned this pull request May 17, 2023

Could not load Llama model from path: models/ggml-model-q4_0.bin zylon-ai/private-gpt#261

Closed

ChaoticByte mentioned this pull request May 18, 2023

Bump llama-cpp-python[server] from 0.1.48 to 0.1.50 ChaoticByte/Eucalyptus-Chat#2

Merged

QuantumPickleJar mentioned this pull request May 23, 2023

UnboundLocalError: cannot access local variable 'llm' where it is not associated with a value zylon-ai/private-gpt#394

Closed

sandyrs9421 mentioned this pull request May 24, 2023

File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__ pydantic.error_wrappers.ValidationError: 1 validation error for LlamaCppEmbeddings zylon-ai/private-gpt#461

Closed

Huge mentioned this pull request May 29, 2023

How to run with -ngl parameter? abetlen/llama-cpp-python#268

Closed

DjToMeK30 mentioned this pull request May 31, 2023

ggml-old-vic13b-q5_1.bin not supported zylon-ai/private-gpt#567

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : remove bit shuffling #1305

ggml : remove bit shuffling #1305

ggerganov commented May 3, 2023 •

edited

Loading

ggerganov commented May 4, 2023 •

edited

Loading

sw commented May 6, 2023

ggerganov commented May 6, 2023

ggerganov commented May 7, 2023 •

edited

Loading

LostRuins commented May 9, 2023

hmih commented May 9, 2023 •

edited

Loading

LostRuins commented May 9, 2023 •

edited

Loading

digiwombat commented May 9, 2023 •

edited

Loading

Green-Sky commented May 9, 2023 •

edited

Loading

sw commented May 9, 2023

LostRuins commented May 9, 2023

ggerganov commented May 11, 2023

Green-Sky commented May 15, 2023

ggml : remove bit shuffling #1305

ggml : remove bit shuffling #1305

Conversation

ggerganov commented May 3, 2023 • edited Loading

ggerganov commented May 4, 2023 • edited Loading

sw commented May 6, 2023

ggerganov commented May 6, 2023

ggerganov commented May 7, 2023 • edited Loading

LostRuins commented May 9, 2023

hmih commented May 9, 2023 • edited Loading

LostRuins commented May 9, 2023 • edited Loading

digiwombat commented May 9, 2023 • edited Loading

Green-Sky commented May 9, 2023 • edited Loading

sw commented May 9, 2023

LostRuins commented May 9, 2023

ggerganov commented May 11, 2023

Green-Sky commented May 15, 2023

ggerganov commented May 3, 2023 •

edited

Loading

ggerganov commented May 4, 2023 •

edited

Loading

ggerganov commented May 7, 2023 •

edited

Loading

hmih commented May 9, 2023 •

edited

Loading

LostRuins commented May 9, 2023 •

edited

Loading

digiwombat commented May 9, 2023 •

edited

Loading

Green-Sky commented May 9, 2023 •

edited

Loading