Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 #1508

Merged
merged 5 commits into from
May 19, 2023
Merged

ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 #1508

merged 5 commits into from
May 19, 2023

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented May 17, 2023

No need to keep the scaling factor in F32 - use F16 instead to get smaller models and improve inference speed

This change makes the model files smaller, the inference faster and allows to fit more GPU layers in VRAM

  • check AVX
  • implement CUDA
  • recalc PPL

New stats:

Model Measure F16 Q4_0 Q4_1 Q5_0 Q5_1 Q8_0
7B perplexity 5.9066 6.1565 6.0912 5.9862 5.9481 5.9070
7B file size 13.0G 3.5G 3.9G 4.3G 4.7G 6.7G
7B ms/tok @ 4th 127 55 54 76 83 72
7B ms/tok @ 8th 122 43 45 52 56 67
7B bits/weight 16.0 4.5 5.0 5.5 6.0 8.5
13B perplexity 5.2543 5.3860 5.3608 5.2856 5.2706 5.2548
13B file size 25.0G 6.8G 7.6G 8.3G 9.1G 13G
13B ms/tok @ 4th - 103 105 148 160 131
13B ms/tok @ 8th - 73 82 98 105 128
13B bits/weight 16.0 4.5 5.0 5.5 6.0 8.5

Old stats:

Model Measure F16 Q4_0 Q4_1 Q5_0 Q5_1 Q8_0
7B perplexity 5.9066 6.1565 6.0910 5.9862 5.9481 5.9069
7B file size 13.0G 4.0G 4.8G 4.4G 4.8G 7.1G
7B ms/tok @ 4th 128 50 54 75 83 75
7B ms/tok @ 8th 123 44 52 53 58 72
7B bits/weight 16.0 5.0 6.0 5.5 6.0 9.0
13B perplexity 5.2543 5.3860 5.3607 5.2856 5.2706 5.2548
13B file size 25.0G 7.6G 9.1G 8.4G 9.1G 14G
13B ms/tok @ 4th - 93 101 150 164 141
13B ms/tok @ 8th - 81 96 96 104 136
13B bits/weight 16.0 5.0 6.0 5.5 6.0 9.0

🤖 Generated by Copilot at d627025

Summary

:compression:💾🔄

Use 16-bit floats for quantization deltas in ggml.c to save memory and compress better.

We're sailing on the sea of bits, we need to save some space
We'll change the delta field to ggml_fp16_t
Heave away, me hearties, heave away with grace
We'll convert and swap and compress at a faster rate

Walkthrough

  • Replace the float type with the ggml_fp16_t type for the delta field of the block_q4_0, block_q4_1, block_q8_0, and block_q8_1 structs in ggml.c to reduce the memory footprint and improve the compression ratio of the quantized blocks (link, link, link, link)
  • Add macro calls to GGML_FP32_TO_FP16 to convert the delta value from a 32-bit float to a 16-bit half-float before storing it in the y[i].d field of the block_q4_0, block_q4_1, block_q8_0, and block_q8_1 structs in ggml.c for various quantization schemes and SIMD implementations (link, link, link, link, link, link, link)
  • Add macro calls to GGML_FP16_TO_FP32 to convert the delta, min, and sum values from 16-bit half-floats to 32-bit floats before using them in the dequantization and dot product computations in ggml.c for various quantization schemes and SIMD implementations (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
  • Swap the order of the multiplication operands in the expressions d * sum and sum * d in ggml.c to avoid a potential loss of precision when multiplying a 32-bit float by a 16-bit half-float (link, link, link, link, link, link)
  • Replace the calls to _mm256_broadcast_ss and _mm_broadcast_ss with the calls to _mm256_set1_ps and _mm_set1_ps in ggml.c to avoid loading the values from memory and instead use an immediate operand, which may improve the performance and reduce the memory access (link, link, link, link)
  • Add curly braces around the for loop body in ggml.c to improve the readability and consistency of the code style (link)
  • Remove an extra space character between the m4b and s8b variable declarations in ggml.c to fix a minor formatting issue and align the code style with the rest of the file (link)

@ggerganov ggerganov added performance Speed related topics breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. labels May 17, 2023
@LostRuins
Copy link
Collaborator

LostRuins commented May 18, 2023

Hi again, just wondering if backward compatibility is possible (e.g. transparently handle f32->f16 scaling factors when loading older quantizations during model load)?

If not, and it breaks all old models, can I request an increase in LLAMA_FILE_VERSION so this change can be detected and handled if it ends up being merged?

Edit: also looked at the chart, seems like q4_0 actually gets a regression in speed with this change?

@Green-Sky
Copy link
Collaborator

Do I read ms/tok @ 4th correctly as "Milliseconds per token for the 4th token" or "Milliseconds per token after the 4th token" ?

@j-f1
Copy link
Collaborator

j-f1 commented May 18, 2023

I believe that means "4 threads" @Green-Sky

@ivanstepanovftw
Copy link
Collaborator

Maybe we should use , 4T instead of @ 4th?

@ilyakurdyukov
Copy link
Contributor

Although this saves some space, it will have a negative impact on performance. If the architecture doesn't support F16, then a lookup table is required, which will result in random (to the CPU) memory accesses and wasted caches. And even if F16 is supported, as on x86, which has an F16 conversion instruction, there will be extra instructions in the loop.

@ilyakurdyukov
Copy link
Contributor

My thoughts on model format changes: it's better to keep the short and portable model format for distribution, but make a script to convert that format to a format that's better suited for performance on a particular architecture. Also allow it to be loaded and converted in memory. Because, for example, for the performance of vector code, it is better to fuse four Q4_0:

typedef struct {
	float d[4];
	uint8_t qs[4][QK4_0 / 2];
} block4_q4_0;

Or even 8 at once. So 4-8 d can be read with the one vector load instruction. And each field will be perfectly aligned for use with aligned vector load instructions. I think it's impossible to create an ideal model format for the best performance for every architecture.

@KerfuffleV2
Copy link
Collaborator

It kind of seems like the level of sacrifice required to allow mmaping models is getting bigger and bigger. Maybe that should be the special case, where there's a tool to convert the model to something that can be mmaped directly for people that really care about making loading a little bit faster.

Without the limitation of models having to be mmapable, doing small conversions online during the load process would be easy. Personally, I'm perfectly happy to trade slightly slower model loading for lower memory usage/faster inference and more options.

@ggerganov
Copy link
Owner Author

ggerganov commented May 19, 2023

Hi again, just wondering if backward compatibility is possible (e.g. transparently handle f32->f16 scaling factors when loading older quantizations during model load)?

If not, and it breaks all old models, can I request an increase in LLAMA_FILE_VERSION so this change can be detected and handled if it ends up being merged?

Edit: also looked at the chart, seems like q4_0 actually gets a regression in speed with this change?

This change breaks Q4_0, Q4_1 and Q8_0 models

LLAMA_FILE_VERSION has been increased to 3 together with GGML_QNT_VERSION to 2

The regression in Q4_0 is only when using 4 threads. My explanations is that at 4 threads we are not yet memory bound and the F16 - > F32 conversion has a small negative impact on the overall performance due to extra compute. With more threads, the computation is memory bound, so we win by reducing the amount of memory used by the weights

Regarding the performance concerns by @ilyakurdyukov and @KerfuffleV2
Since the computation on the CPU is memory-bound, this change actually improves the performance.
Additionally, having smaller weights allows to load more of them in VRAM which brings additional improvement when using GPU

Alternative memory layouts are being explored in #1073 and #1256

@KerfuffleV2
Copy link
Collaborator

My concern wasn't really about performance specifically, but the restriction of being tied to a mmapable file format. There isn't necessarily a one-size-fits-all approach to the model format that will be ideal for every CPU/GPU like ilyakurdyukov mentioned.

If the model file must be mmapable then you basically have to try to pick some middle ground even if there are small fixups that could easily be done. It also makes changes that break model file compatibility a lot more frequent, since the format must be directly evaluable exactly as it exists on disk.

Now that there's GPU support, it may be even more of an issue. From what I know, GPUs (generally) like 16bit floats, x86 CPUs at least (generally) like 32bit floats. You have to choose one or the other at the moment, it seems, even if it's not ideal.

@ggerganov ggerganov merged commit 2d5db48 into master May 19, 2023
@ggerganov ggerganov deleted the qnt-f16 branch May 19, 2023 19:17
@YellowRoseCx
Copy link
Contributor

YellowRoseCx commented May 19, 2023

Hey Mr @ggerganov now that the F16 implementation has been merged, could you please increase LLAMA_FILE_VERSION so this change can be detected and handled for backwards compatibility with other models?

Do you have any throughts on transparently handling f32->f16 scaling factors when loading older quantizations during model load?

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 19, 2023

Loading older models now results in a crash instead of an error message, as expected.

ggerganov added a commit to JohannesGaessler/llama.cpp that referenced this pull request May 20, 2023
* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0

* llama : bump LLAMA_FILE_VERSION to 3

* cuda : update Q4 and Q8 dequantize kernels

* ggml : fix AVX dot products

* readme : update performance table + hot topics
ggerganov added a commit that referenced this pull request May 20, 2023
…oadcasting for ggml_mul (#1483)

* Broadcasting for ggml_mul

* CUDA kernel for ggml_mul, norms in VRAM

* GPU weights not in RAM, direct loading with cuFile

* fixup! GPU weights not in RAM, direct loading with cuFile

* fixup! GPU weights not in RAM, direct loading with cuFile

* define default model path once, sync path with readme (#1366)

* ~7% faster Q5_1 AVX2 code (#1477)

* convert.py: Support models which are stored in a single pytorch_model.bin (#1469)

* Support models in a single pytorch_model.bin

* Remove spurious line with typo

* benchmark-matmul: Print the average of the test results (#1490)

* Remove unused n_parts parameter (#1509)

* Fixes #1511 lambda issue for w64devkit (mingw) (#1513)

* Fix for w64devkit and mingw

* make kv_f16 the default for api users (#1517)

* minor : fix compile warnings

* readme : adds WizardLM to the list of supported models (#1485)

* main : make reverse prompt option act as a stop token in non-interactive mode (#1032)

* Make reverse prompt option act as a stop token in non-interactive scenarios

* Making requested review changes

* Update gpt_params_parse and fix a merge error

* Revert "Update gpt_params_parse and fix a merge error"

This reverts commit 2bb2ff1.

* Update gpt_params_parse and fix a merge error take 2

* examples : add persistent chat (#1495)

* examples : add persistent chat

* examples : fix whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* tests : add missing header

* ggml : use F16 instead of F32 in Q4_0, Q4_1, Q8_0 (#1508)

* ggml : use F16 instead of F32 in Q4_0, Q4_1 and Q8_0

* llama : bump LLAMA_FILE_VERSION to 3

* cuda : update Q4 and Q8 dequantize kernels

* ggml : fix AVX dot products

* readme : update performance table + hot topics

* ggml : fix scalar implementation of Q4_1 dot

* llama : fix compile warnings in llama_set_state_data()

* llama : fix name shadowing and C4146 (#1526)

* Fix name shadowing and C4146

* Fix if macros not using defined when required

* Update llama-util.h

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Update llama-util.h

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Code style

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix for mingw (#1462)

* llama : add llama_init_backend() API (close #1527)

* feature : add blis and other BLAS implementation support (#1502)

* feature: add blis support

* feature: allow all BLA_VENDOR to be assigned in cmake arguments. align with whisper.cpp pr 927

* fix: version detection for BLA_SIZEOF_INTEGER, recover min version of cmake

* Fix typo in INTEGER

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Revert "feature : add blis and other BLAS implementation support (#1502)"

This reverts commit 07e9ace.

* GPU weights not in RAM, direct loading with cuFile

* llama : code style fixes + progress print fix

* ggml : ggml_mul better broadcast support

* cmake : workarounds for cufile when CMake version < 3.25

* gg rebase fixup

* Loop in llama.cpp, fixed progress callback

* Attempt clang-tidy fix

* llama : fix vram size computation

* Add forgotten fclose()

---------

Co-authored-by: András Salamon <ott2@users.noreply.github.com>
Co-authored-by: Ilya Kurdyukov <59548320+ilyakurdyukov@users.noreply.github.com>
Co-authored-by: Tom Jobbins <784313+TheBloke@users.noreply.github.com>
Co-authored-by: rankaiyx <rankaiyx@rankaiyx.com>
Co-authored-by: Stephan Walter <stephan@walter.name>
Co-authored-by: DannyDaemonic <DannyDaemonic@gmail.com>
Co-authored-by: Erik Scholz <Green-Sky@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: David Kennedy <dakennedyd@gmail.com>
Co-authored-by: Jason McCartney <jmac@theroot.org>
Co-authored-by: Evan Jones <evan.q.jones@gmail.com>
Co-authored-by: Maxime <672982+maximegmd@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Zenix <zenixls2@gmail.com>
@Green-Sky
Copy link
Collaborator

$ ./main -m /mnt/nvme/ml/llama/ggjt_3/65B/llama-65B-q4_0-ggjt3.bin -ngl 16 -b 1 -n 10 -t 8 -s 3 -p "Once" -c 128

... I know I cheated a little, batchsize of 1, so it fits into vram (idk looks like it allocates vram based on batch size), only context size of 128

WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 580 (3f008ca)
main: seed  = 3
llama.cpp: loading model from /mnt/nvme/ml/llama/ggjt_3/65B/llama-65B-q4_0-ggjt3.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 128
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0,18 MB
llama_model_load_internal: mem required  = 31661,46 MB (+ 5120,00 MB per state)
llama_model_load_internal: [cublas] offloading 16 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 6949 MB
.....................
llama_init_from_file: kv self size  =  320,00 MB

system_info: n_threads = 8 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1,100000, presence_penalty = 0,000000, frequency_penalty = 0,000000, top_k = 40, tfs_z = 1,000000, top_p = 0,950000, typical_p = 1,000000, temp = 0,800000, mirostat = 0, mirostat_lr = 0,100000, mirostat_ent = 5,000000
generate: n_ctx = 128, n_batch = 1, n_predict = 10, n_keep = 0


 Once again we are happy to be able to offer a
llama_print_timings:        load time = 34773,73 ms
llama_print_timings:      sample time =    15,87 ms /    10 runs   (    1,59 ms per token)
llama_print_timings: prompt eval time =     0,00 ms /     1 tokens (    0,00 ms per token)
llama_print_timings:        eval time = 101378,22 ms /    11 runs   ( 9216,20 ms per token)
llama_print_timings:       total time = 119457,45 ms

so I can run 65B with 32gig ram and 8gig vram with 9.2sec/tok

might get a little faster + 1 more layer on gpu, when there is no desktop session running..

🎉

@Rhialto
Copy link

Rhialto commented May 23, 2023

@KerfuffleV2 I don't agree with your dislike of mmap-able file formats. If you actually need to read the file, you need to read the whole thing before you can even start doing anything. If you mmap it, you can start right away, and the parts that you're not using (yet) do not need to be read.
Furthermore, mmaping saves you half the memory. When reading the file, you have one copy of the whole data in the process address space, and another sitting in your file cache. When mmap-ing, it's in memory only once. Furthermore, it can be shared by multiple running copies of the program. And if you quit the program and restart it, pretty much everything should still be cached and you can do work right away. Where on the other hand without mmap, the program has to read all the data again. Even if it is all still in the file cache, this takes again time.

@KerfuffleV2
Copy link
Collaborator

I don't agree with your dislike of mmap-able file formats.

You're attacking a straw man here, in reality I never said anything about disliking mmap-able file formats. Also, in reality, I do not in fact dislike them.

All I said is that the tradeoffs might not be worth it at this point.

If you actually need to read the file, you need to read the whole thing before you can even start doing anything.

You need the entirety of the model data before you can generate the first token. So whether or not it's mmaped, the entire thing must be read from disk at least once before you see a single output. Pretty sure this also applies to feeding a token from the prompt as well.

Furthermore, mmaping saves you half the memory.

That's a weird way to look at it, since buffer cache isn't really used memory. Anyway, you're arguing like I said there weren't benefits to mmaping, which is not true. There are both pros and cons to both the mmap and non-mmap approach.

mmaping can use the buffer cache more effectively (it can also be faster in some cases). However, a trade off is that the exact data, byte for byte, must be evaluable. That means if a format is optimal for your architecture (CPU or GPU) but not another, then you have to make a choice. You can't convert it to the format that's most optimal for actual evaluation on your specific system, you have to live with whatever the common format is.

And if you quit the program and restart it, pretty much everything should still be cached and you can do work right away.

In practice, this takes a very small amount of time when everything's already in the buffer cache. I'd say, generally speaking most people are going to care more about the time/memory required for actually generating tokens as opposed to making loading the model file slightly faster. The majority of the time is spent doing inference, not loading a model (especially if the cache is hot).

However, like I already mentioned, if it was decided that the tradeoff of being able to mmap and evaluate the models directly wasn't worth it anymore that wouldn't mean it isn't possible to make a simple tool that just creates a mmapable model from the more flexible distributable one.

So people who really care about load time can have their cake and eat it too.

@EwoutH
Copy link
Contributor

EwoutH commented May 24, 2023

This is really amazing work!

With this PR merged, should the Memory/Disk Requirements in the Readme also be updated?

@Dwedit
Copy link

Dwedit commented May 24, 2023

Since people will be able to refine a 4-bit model directly, a full precision non-quantized version of that model might not even exist. We really need some kind of way to convert a model to the new format without requantizing it from a bigger file.

@KerfuffleV2
Copy link
Collaborator

We really need some kind of way to convert a model to the new format without requantizing it from a bigger file.

Quantization is lossy "compression". To use those models without a large quality loss it would probably be necessary for GGML to be able to run inference on them without converting the actual tensor data again.

May make sense for GGML to take the direction of coming up with its own format that accomplishes the same thing, especially since formats like Q5_0, Q5_1 have substantially higher quality than 4bit quantization and don't take up much more memory/disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.