-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 #1508
Conversation
Hi again, just wondering if backward compatibility is possible (e.g. transparently handle f32->f16 scaling factors when loading older quantizations during model load)? If not, and it breaks all old models, can I request an increase in LLAMA_FILE_VERSION so this change can be detected and handled if it ends up being merged? Edit: also looked at the chart, seems like q4_0 actually gets a regression in speed with this change? |
Do I read |
I believe that means "4 threads" @Green-Sky |
Maybe we should use |
Although this saves some space, it will have a negative impact on performance. If the architecture doesn't support F16, then a lookup table is required, which will result in random (to the CPU) memory accesses and wasted caches. And even if F16 is supported, as on x86, which has an F16 conversion instruction, there will be extra instructions in the loop. |
My thoughts on model format changes: it's better to keep the short and portable model format for distribution, but make a script to convert that format to a format that's better suited for performance on a particular architecture. Also allow it to be loaded and converted in memory. Because, for example, for the performance of vector code, it is better to fuse four Q4_0: typedef struct {
float d[4];
uint8_t qs[4][QK4_0 / 2];
} block4_q4_0; Or even 8 at once. So 4-8 |
It kind of seems like the level of sacrifice required to allow Without the limitation of models having to be |
This change breaks
The regression in Regarding the performance concerns by @ilyakurdyukov and @KerfuffleV2 Alternative memory layouts are being explored in #1073 and #1256 |
My concern wasn't really about performance specifically, but the restriction of being tied to a If the model file must be Now that there's GPU support, it may be even more of an issue. From what I know, GPUs (generally) like 16bit floats, x86 CPUs at least (generally) like 32bit floats. You have to choose one or the other at the moment, it seems, even if it's not ideal. |
Hey Mr @ggerganov now that the F16 implementation has been merged, could you please increase LLAMA_FILE_VERSION so this change can be detected and handled for backwards compatibility with other models? Do you have any throughts on transparently handling f32->f16 scaling factors when loading older quantizations during model load? |
Loading older models now results in a crash instead of an error message, as expected. |
* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0 * llama : bump LLAMA_FILE_VERSION to 3 * cuda : update Q4 and Q8 dequantize kernels * ggml : fix AVX dot products * readme : update performance table + hot topics
…oadcasting for ggml_mul (#1483) * Broadcasting for ggml_mul * CUDA kernel for ggml_mul, norms in VRAM * GPU weights not in RAM, direct loading with cuFile * fixup! GPU weights not in RAM, direct loading with cuFile * fixup! GPU weights not in RAM, direct loading with cuFile * define default model path once, sync path with readme (#1366) * ~7% faster Q5_1 AVX2 code (#1477) * convert.py: Support models which are stored in a single pytorch_model.bin (#1469) * Support models in a single pytorch_model.bin * Remove spurious line with typo * benchmark-matmul: Print the average of the test results (#1490) * Remove unused n_parts parameter (#1509) * Fixes #1511 lambda issue for w64devkit (mingw) (#1513) * Fix for w64devkit and mingw * make kv_f16 the default for api users (#1517) * minor : fix compile warnings * readme : adds WizardLM to the list of supported models (#1485) * main : make reverse prompt option act as a stop token in non-interactive mode (#1032) * Make reverse prompt option act as a stop token in non-interactive scenarios * Making requested review changes * Update gpt_params_parse and fix a merge error * Revert "Update gpt_params_parse and fix a merge error" This reverts commit 2bb2ff1. * Update gpt_params_parse and fix a merge error take 2 * examples : add persistent chat (#1495) * examples : add persistent chat * examples : fix whitespace --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * tests : add missing header * ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508) * ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0 * llama : bump LLAMA_FILE_VERSION to 3 * cuda : update Q4 and Q8 dequantize kernels * ggml : fix AVX dot products * readme : update performance table + hot topics * ggml : fix scalar implementation of Q4_1 dot * llama : fix compile warnings in llama_set_state_data() * llama : fix name shadowing and C4146 (#1526) * Fix name shadowing and C4146 * Fix if macros not using defined when required * Update llama-util.h Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Update llama-util.h Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Code style Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix for mingw (#1462) * llama : add llama_init_backend() API (close #1527) * feature : add blis and other BLAS implementation support (#1502) * feature: add blis support * feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927 * fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake * Fix typo in INTEGER Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Revert "feature : add blis and other BLAS implementation support (#1502)" This reverts commit 07e9ace. * GPU weights not in RAM, direct loading with cuFile * llama : code style fixes + progress print fix * ggml : ggml_mul better broadcast support * cmake : workarounds for cufile when CMake version < 3.25 * gg rebase fixup * Loop in llama.cpp, fixed progress callback * Attempt clang-tidy fix * llama : fix vram size computation * Add forgotten fclose() --------- Co-authored-by: András Salamon <ott2@users.noreply.github.com> Co-authored-by: Ilya Kurdyukov <59548320+ilyakurdyukov@users.noreply.github.com> Co-authored-by: Tom Jobbins <784313+TheBloke@users.noreply.github.com> Co-authored-by: rankaiyx <rankaiyx@rankaiyx.com> Co-authored-by: Stephan Walter <stephan@walter.name> Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com> Co-authored-by: Erik Scholz <Green-Sky@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: David Kennedy <dakennedyd@gmail.com> Co-authored-by: Jason McCartney <jmac@theroot.org> Co-authored-by: Evan Jones <evan.q.jones@gmail.com> Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zenix <zenixls2@gmail.com>
... I know I cheated a little, batchsize of 1, so it fits into vram (idk looks like it allocates vram based on batch size), only context size of 128
so I can run 65B with 32gig ram and 8gig vram with 9.2sec/tok might get a little faster + 1 more layer on gpu, when there is no desktop session running.. 🎉 |
The precision for Q4_0 has degraded since #1508
@KerfuffleV2 I don't agree with your dislike of mmap-able file formats. If you actually need to read the file, you need to read the whole thing before you can even start doing anything. If you mmap it, you can start right away, and the parts that you're not using (yet) do not need to be read. |
You're attacking a straw man here, in reality I never said anything about disliking All I said is that the tradeoffs might not be worth it at this point.
You need the entirety of the model data before you can generate the first token. So whether or not it's
That's a weird way to look at it, since buffer cache isn't really used memory. Anyway, you're arguing like I said there weren't benefits to
In practice, this takes a very small amount of time when everything's already in the buffer cache. I'd say, generally speaking most people are going to care more about the time/memory required for actually generating tokens as opposed to making loading the model file slightly faster. The majority of the time is spent doing inference, not loading a model (especially if the cache is hot). However, like I already mentioned, if it was decided that the tradeoff of being able to So people who really care about load time can have their cake and eat it too. |
This is really amazing work! With this PR merged, should the Memory/Disk Requirements in the Readme also be updated? |
Since people will be able to refine a 4-bit model directly, a full precision non-quantized version of that model might not even exist. We really need some kind of way to convert a model to the new format without requantizing it from a bigger file. |
Quantization is lossy "compression". To use those models without a large quality loss it would probably be necessary for GGML to be able to run inference on them without converting the actual tensor data again. May make sense for GGML to take the direction of coming up with its own format that accomplishes the same thing, especially since formats like Q5_0, Q5_1 have substantially higher quality than 4bit quantization and don't take up much more memory/disk. |
The precision for Q4_0 has degraded since #1508
The precision for Q4_0 has degraded since ggerganov#1508
No need to keep the scaling factor in F32 - use F16 instead to get smaller models and improve inference speed
This change makes the model files smaller, the inference faster and allows to fit more GPU layers in VRAM
New stats:
Old stats:
🤖 Generated by Copilot at d627025
Summary
:compression:💾🔄
Use 16-bit floats for quantization deltas in
ggml.c
to save memory and compress better.Walkthrough
float
type with theggml_fp16_t
type for thedelta
field of theblock_q4_0
,block_q4_1
,block_q8_0
, andblock_q8_1
structs inggml.c
to reduce the memory footprint and improve the compression ratio of the quantized blocks (link, link, link, link)GGML_FP32_TO_FP16
to convert thedelta
value from a 32-bit float to a 16-bit half-float before storing it in they[i].d
field of theblock_q4_0
,block_q4_1
,block_q8_0
, andblock_q8_1
structs inggml.c
for various quantization schemes and SIMD implementations (link, link, link, link, link, link, link)GGML_FP16_TO_FP32
to convert thedelta
,min
, andsum
values from 16-bit half-floats to 32-bit floats before using them in the dequantization and dot product computations inggml.c
for various quantization schemes and SIMD implementations (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)d * sum
andsum * d
inggml.c
to avoid a potential loss of precision when multiplying a 32-bit float by a 16-bit half-float (link, link, link, link, link, link)_mm256_broadcast_ss
and_mm_broadcast_ss
with the calls to_mm256_set1_ps
and_mm_set1_ps
inggml.c
to avoid loading the values from memory and instead use an immediate operand, which may improve the performance and reduce the memory access (link, link, link, link)ggml.c
to improve the readability and consistency of the code style (link)m4b
ands8b
variable declarations inggml.c
to fix a minor formatting issue and align the code style with the rest of the file (link)