-
Notifications
You must be signed in to change notification settings - Fork 13.7k
threadpool: skip polling for unused threads #9461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
threadpool: skip polling for unused threads #9461
Conversation
|
Haven't looked at the changes yet, but on Mac have you tried running the thread sanitizer? It detects data races when running CPU-only mode, even with this PR. Started happening after #8672. Here are the steps to run it: cmake \
-DCMAKE_BUILD_TYPE=Debug \
-DLLAMA_SANITIZE_THREAD=ON \
-DGGML_METAL=OFF \
-DGGML_LLAMAFILE=OFF ..
make -j && ./bin/llama-simple -m ${model} -p "hello" -ngl 0You should see output like this: We have to find a way to resolve these warnings. |
Oh. I assumed those are benign. We did have a bunch of similar thread sanitizer warnings on x86-64 with openmp before the threadpool changes. So I figured it's some minor overlap in matmul kernels. |
|
@ggerganov @slaren Quick update on the thread sanitizer warnings. Built for Mac with OpenMP and Thread Sanitizer using LLVM 18 installed via homebrew. And we're getting a bunch of those warnings. That's why I assumed those are sort of known / benign. Perhaps, they are not? The threadpool support makes timing/behavior very similar to openmp that's why those warnings are now showing up in the default builds (ie threadpool is enabled by default on the Mac with Apple toolchains). As I mentioned earlier we do see a bunch of those sanitizer warnings on x86-64 with openmp/threadpool as well. How do you guys want to proceed? |
|
I don't think that the warnings reported by address sanitizer here are benign. OpenMP has known compatibility issues with Address Sanitizer since it is not aware of the synchronization mechanism used by OpenMP, but this should not happen when using plain pthreads and atomics. I believe that this is due to using relaxed memory order in It could probably be done with more relaxed memory order, but these changes (on top of this PR) seem to fix the tsan warnings: diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index c3b462b3..a49d3992 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -3188,7 +3188,7 @@ static void ggml_barrier(struct ggml_threadpool * threadpool) {
}
#else
static void ggml_barrier(struct ggml_threadpool * threadpool) {
- int n_threads = atomic_load_explicit(&threadpool->n_threads_cur, memory_order_relaxed);
+ int n_threads = atomic_load_explicit(&threadpool->n_threads_cur, memory_order_seq_cst);
if (n_threads == 1) {
return;
}
@@ -3196,16 +3196,16 @@ static void ggml_barrier(struct ggml_threadpool * threadpool) {
atomic_int * n_barrier = &threadpool->n_barrier;
atomic_int * n_barrier_passed = &threadpool->n_barrier_passed;
- int passed_old = atomic_load_explicit(n_barrier_passed, memory_order_relaxed);
+ int passed_old = atomic_load_explicit(n_barrier_passed, memory_order_seq_cst);
if (atomic_fetch_add(n_barrier, 1) == n_threads - 1) {
// last thread
atomic_store(n_barrier, 0);
- atomic_fetch_add_explicit(n_barrier_passed, 1, memory_order_relaxed);
+ atomic_fetch_add_explicit(n_barrier_passed, 1, memory_order_seq_cst);
} else {
// wait for other threads
while (true) {
- if (atomic_load_explicit(n_barrier_passed, memory_order_relaxed) != passed_old) {
+ if (atomic_load_explicit(n_barrier_passed, memory_order_seq_cst) != passed_old) {
return;
}
ggml_thread_cpu_relax();
@@ -12879,7 +12879,8 @@ UseGgmlGemm1:;
if (ith == 0) {
// Every thread starts at ith, so the first unprocessed chunk is nth. This save a bit of coordination right at the start.
- atomic_store_explicit(¶ms->threadpool->current_chunk, nth, memory_order_relaxed);
+ //atomic_store_explicit(¶ms->threadpool->current_chunk, nth, memory_order_relaxed);
+ atomic_store(¶ms->threadpool->current_chunk, nth);
}
ggml_barrier(params->threadpool);
@@ -12990,7 +12991,8 @@ UseGgmlGemm2:;
break;
}
- current_chunk = atomic_fetch_add_explicit(¶ms->threadpool->current_chunk, 1, memory_order_relaxed);
+ //current_chunk = atomic_fetch_add_explicit(¶ms->threadpool->current_chunk, 1, memory_order_relaxed);
+ current_chunk = atomic_fetch_add(¶ms->threadpool->current_chunk, 1);
}
} |
|
@slaren This patch fixes the sanitizer warnings on my end.
Yes, I agree. Reading on the internet about this, it appears that when OpenMP is enabled, the sanitizers can report issues, which is to be expected and there is not much we can do about it. We just have to make sure there are no issues when OpenMP is off. |
I did a bunch of digging and actually convinced myself that the warnings are benign :). In our case, we just need to make sure that the thread sanitizer understands that ggml_barrier() enforces ordering. Take a look at the latest updates. I made everything explicit and documented in the Thread sanitizer is happy now and performance looks good (same as before). |
|
Quick update. I realized In terms of the overall correctness I further convinced myself that we should be good on that. As I mentioned above, the main thing we need to make sure is that all CPUs finish processing an Op (matmul, copy, etc) and that all of the memory writes complete before we exit The updates to the threadpool state itself are done only from the main thread, under the mutex (which is also a full barrier) and when the worker threads are either spinning on If you can think of a scenario where we do have a true race condition do let me know. Maybe I'm missing something.
|
|
I am not convinced that the |
I'd going to try to convince you :) There is no need for the threads to complete Op processing at the same time. Re: just doing strict ordering everywhere. It's hard to measure the overhead with high-level tests. |
|
I have no doubt that what you are saying is true in practice for the specific hardware. It certainly is for x86 where all atomic load/stores have rel/acq semantics and, chances are, both versions of the code generate the exact same asm. I defer to your knowledge about the way this works in ARM. But ultimately we are not programming for any specific hardware, we are programming for the C virtual machine and the semantics specified thereof. Quoting cppreference.com:
The important part here is and the load in thread B reads a value written by the store in thread A. Thread 0 in your example does not load a value written by thread 1 or thread 2, so there is no guarantee that it will see the writes that happened before |
|
@slaren Sorry for the delayed response. The threading/sequence example I provided above is actually generic and assumes the C/C++ memory order semantics (not a specific arch). Perhaps, I shortened the ops a bit too much. The Here is the reference from the same source (https://en.cppreference.com/w/c/atomic/memory_order)
On the arm64 (armv8.2-a and up) that translates to LDADDAL instruction. In other words, once all the threads go through that btw The Thread Sanitizer issue I linked to earlier (about the fences) is similar in the sense that this 'atomic_fetch_add_explicit(n_barrier, memory_order_seq_cst)' is acting as a full fence. And OpenMP causes the exact same confusion for the Thread Sanitizer. Now, the new test that I added (it does tons of ggml_barriers) did highlight the need for making M2 Max and Snapdragon Gen 3 are looking good. But I didn't yet get a chance to do more testing on the Ryzen, EPYC and X-Elite yet. Will try to do that later today and provide an update. |
Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions).
Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit.
This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead.
b71d8a0 to
c4411d5
Compare
Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order.
Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency.
fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.
e833ca4 to
eb55059
Compare
|
Take a look at the latest. Seems like this should be good to go:
Tested on M2 Max, Snapdragon Gen3 (S24 Ultra), Ryzen 3950X, EPYC 7543.
Performance-wise things are looking good. On the arm64 our threadpool does quite a bit better than OpenMP. Here are some microbenchmark numbers using that new test. S24 UltraOur default Android NDK armv8.7 build with and without OpenMP. AMD Ryzen 9 3950XLLVM 18 build. |
|
Looks good to me. I get these results with 13900k |
Definitely makes sense to followup on improving n_threads > 8 cases on x86-64 in the next iteration (i.e as part of threadpool v3 that we discussed). |
|
I've done a few tests on M1 Pro, M2 Ultra and Ryzen 9 5950X and all seems good. Thank you. |
|
OpenMP with metal is broken after this commit on a M2 with Sequoia 15.0, using clang 18.1.8.
|
Oh. Interesting. Can you please share how exactly you build it? |
Ok. I was able to reproduce it on M2 Max with Sequoia 15 and llvm 18. Interestingly enough. If just a single layer runs on on the CPU then it works fine I'll try to figure out what exactly broke with OpenMP in this case. It's not immediately obvious. |
Fixed in |
* threadpool: skip polling for unused threads Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions). * threadpool: further simplify and improve ggml_barrier Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit. * threads: add simple barrier test This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead. * threadpool: improve thread sync for new-graphs Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order. * threadpool: improve abort handling Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency. * test-barrier: release threadpool before releasing the context fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.
* threadpool: skip polling for unused threads Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions). * threadpool: further simplify and improve ggml_barrier Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit. * threads: add simple barrier test This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead. * threadpool: improve thread sync for new-graphs Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order. * threadpool: improve abort handling Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency. * test-barrier: release threadpool before releasing the context fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.
* threadpool: skip polling for unused threads Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1). This commit adds a check to skip the polling for unused threads (ith >= n_threads_cur). n_threads_cur is now an atomic_int to explicitly tell thread sanitizer that it is written from one thread and read from other threads (not a race conditions). * threadpool: further simplify and improve ggml_barrier Avoid using strict memory order while polling, yet make sure that all threads go through full memory barrier (memory fence) on ggml_barrier entrace and exit. * threads: add simple barrier test This test does lots of small, parallel matmul ops where the barriers in between dominate the overhead. * threadpool: improve thread sync for new-graphs Using the same tricks as ggml_barrier. All the polling is done with relaxed memory order to keep it efficient, once the new graph is detected we do full fence using read-modify-write with strict memory order. * threadpool: improve abort handling Do not use threadpool->ec (exit code) to decide whether to exit the compute loop. threadpool->ec is not atomic which makes thread-sanitizer rightfully unhappy about it. Instead introduce atomic threadpool->abort flag used for this. This is consistent with how we handle threadpool->stop or pause. While at it add an explicit atomic_load for n_threads_cur for consistency. * test-barrier: release threadpool before releasing the context fixes use-after-free detected by gcc thread-sanitizer on x86-64 for some reason llvm sanitizer is not detecting this issue.
commit bb2a4d125e33148d2b4e9363bf8ace14f722a610
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Nov 11 08:59:32 2024 +0100
8x22b
commit 9d4926ff9559ecae25f19fadcb55586677575b61
Merge: 9c65f44 b0cefea
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Nov 11 08:59:07 2024 +0100
Merge branch 'master' into Nexes_CQ30
commit 9c65f44
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Nov 3 04:30:14 2024 +0100
Test base 2048
commit 8ccafe8
Merge: d0d276f 9830b69
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Nov 3 04:28:33 2024 +0100
Merge branch 'master' into Nexes_CQ30
commit d0d276f
Merge: 7cecefd 418f5ee
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Nov 1 20:18:27 2024 +0100
Merge branch 'master' into Nexes_CQ30
commit 7cecefd
Merge: a5303b7 8841ce3
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 28 06:45:16 2024 +0100
Merge branch 'master' into Nexes_CQ30
commit a5303b7
Merge: f21ab1e 167a515
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Oct 24 19:55:36 2024 +0200
Merge branch 'master' into Nexes_CQ30
commit f21ab1e
Merge: c72289e 20011f1
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Oct 23 20:26:42 2024 +0200
Merge branch 'gg/default-kq-f32-prec' into Nexes_CQ20
commit c72289e
Merge: eaee12e 190a37d
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Oct 23 20:26:34 2024 +0200
Merge branch 'master' into Nexes_CQ20
commit 20011f1
Author: Georgi Gerganov <ggerganov@gmail.com>
Date: Wed Oct 23 14:32:27 2024 +0300
llama : switch KQ multiplication to use F32 precision by default
ggml-ci
commit eaee12e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 21 15:41:24 2024 +0200
EXL SXL and UXL types to test the new bits formula
commit 6abef2a
Merge: aa73a4e d5ebd79
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 21 15:40:22 2024 +0200
Merge branch 'master' into Nexes_CQ20
commit aa73a4e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 19 19:04:33 2024 +0200
use_some_bits and use_most_bits
commit 7794c8f
Merge: 1cf274d cda0e4b
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 19 19:04:05 2024 +0200
Merge branch 'master' into Nexes_CQ20
commit 1cf274d
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 18 21:00:56 2024 +0200
ML UXL and EXL boost
commit f105e0f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 18 21:05:49 2024 +0200
Revert compile for Ampere
commit 1b25cbb
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 18 21:05:04 2024 +0200
Delete CMakePresets.json
commit 1c440a8
Merge: 366e0c8 afd9909
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 18 20:42:34 2024 +0200
Merge branch 'master' into Nexes_CQ20
commit 366e0c8
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Oct 16 16:57:30 2024 +0200
Fix indent model sizes
commit cf8375c
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Oct 16 16:41:57 2024 +0200
continue Q5_K mixes
commit 2d052f7
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Oct 15 17:42:48 2024 +0200
difquants three/four eights alt for Mistral Large
commit 29cecae
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Oct 15 16:03:12 2024 +0200
Q5_K_XSR, SR, ML, and XL revamp
commit 412b56f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 14 17:08:23 2024 +0200
IQ3_X5L and IQ3_X7L fix for Mistral Large
commit ca86ce8
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 14 15:24:37 2024 +0200
Pursue IQ3 revamp
commit 6c51f39
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 13 22:22:40 2024 +0200
IQ3_XXXXL, EXL and renaming >=IQ3_ML scheme
Test for Mistral Large
IQ3_XL = IQ3_X5L and so on.
commit 64bfe69
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 13 22:33:05 2024 +0200
Activate F16
commit 575ebc2
Merge: 38229d3 d4c19c0
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 13 22:22:30 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 38229d3
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 12 20:57:18 2024 +0200
Fix specify tensors in quantize
commit b947b6e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 12 13:38:22 2024 +0200
New FTYPE Q5_K_XL
commit ba1b854
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 12 13:36:25 2024 +0200
New FTYPE IQ4_XXSR
and beef up attn_k IQ4_XSR
commit 79fa98c
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 13 02:00:38 2024 +0200
GGML_MAX_COPIES_1 in CML
commit f95ed01
Merge: accd71d edc2656
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 13 02:02:06 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit accd71d
Merge: b5103f4 11ac980
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 12 13:23:11 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit b5103f4
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 11 13:43:48 2024 +0200
Better model info (ikawrakow#84)
Co-Authored-By: Kawrakow <iwankawrakow@gmail.com>
commit b302561
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 11 13:17:39 2024 +0200
IQ3_UXL for test
commit 8c6e408
Merge: 66a9b05 7eee341
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 11 13:17:30 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 66a9b05
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Oct 9 04:30:45 2024 +0200
correct iQ4_LR
commit 298990a
Merge: f1814f1 dca1d4b
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Oct 8 22:11:53 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit f1814f1
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 7 23:21:56 2024 +0200
Rebump attn_v
commit b94a9b0
Merge: 18677c8 6374743
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 7 23:21:38 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 18677c8
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 6 02:12:09 2024 +0200
IQ4_LR
commit a2500c1
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 6 02:12:55 2024 +0200
Crack down fallback GGML_types
commit 75b8800
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 5 23:18:02 2024 +0200
More overhaul for IQ4_XSR and new IQ4_MR
commit 167a3c5
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 5 17:17:50 2024 +0200
GGML SCHED MAX COPIES 1
commit 8433050
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 5 17:14:39 2024 +0200
Adapt CML
commit 1e0f64e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 5 17:07:07 2024 +0200
Compile for Ampere
commit 35ce3f6
Merge: 6480054 8c475b9
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 5 17:03:34 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 6480054
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 4 18:21:54 2024 +0200
IQ4_XSR revamp
commit 1ec8328
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 19 17:00:34 2024 +0200
Clarify PPL result
commit de50e13
Merge: ed67589 d5ed2b9
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Oct 3 22:23:08 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit ed67589
Merge: 06ab3a2 70392f1
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 10:22:50 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 06ab3a2
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 10:22:46 2024 +0200
More size logging
commit 9d97928
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 10:21:25 2024 +0200
Update llama.cpp
commit 700d205
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 03:51:26 2024 +0200
IQ3_XS more
commit da840a3
Merge: 056c47d 116efee
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 03:30:18 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 056c47d
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 03:30:15 2024 +0200
Reapply "threadpool : skip polling for unused threads (ggml-org#9461)"
This reverts commit 2a8dbf8.
commit 8d789ac
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 03:20:58 2024 +0200
IQ3_XS
commit 413fc43
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Sep 23 19:34:45 2024 +0200
Fix IQ3 <=M
commit 9ed3522
Merge: 2a8dbf8 1d48e98
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Sep 23 18:50:43 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 2a8dbf8
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Sep 22 02:48:50 2024 +0200
Revert "threadpool : skip polling for unused threads (ggml-org#9461)"
This reverts commit 0226613.
commit 6faac9f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Sep 22 02:46:37 2024 +0200
Revert "Update CUDA graph on scale change plus clear nodes/params (ggml-org#9550)"
This reverts commit 41f4778.
commit f377f88
Merge: e3ec684 d09770c
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Sep 21 17:25:04 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit e3ec684
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Sep 20 06:36:47 2024 +0200
reinsert cqs
commit d48aad3
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Sep 2 05:50:08 2024 +0200
Play with IQ3 quants
commit 5af6481
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Sep 2 01:41:19 2024 +0200
IQ4_XSR_rework
commit dd770d2
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 31 17:05:00 2024 +0200
refine IQ3 quants
commit 32ce04a
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 31 14:22:00 2024 +0200
Use of vocab as difquant criteria
The pre-vocab>128k models are more sensitive to ffn_down quant than to ffn_gate and up.
commit 86a7e4a
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 30 12:15:54 2024 +0200
IQ3_UXL
commit 97fbd74
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Aug 29 22:40:32 2024 +0200
New difquant seven_eights
commit c6732bf
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 28 16:06:38 2024 +0200
Bump a bit output for big models in IQ2 and IQ3
commit cce61d3
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 28 13:00:53 2024 +0200
Difquant attn_q and attn_o for IQ3_XXS, XS, and S
And also establishing a bump to difquant_first_last_tensors for attn_k and attn_v
commit 1e7e816
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 28 02:24:55 2024 +0200
Add IQ3_ML, reinstate IQ3_XXXL
commit 7b0dc30
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 28 00:52:45 2024 +0200
Bump IQ3_XS
commit 6263649
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Aug 27 16:19:10 2024 +0200
Revert variable V below Q5_K
commit eb4a69e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Aug 27 13:26:15 2024 +0200
Difquant for IQ2_XL & IQ3 for attn_k and attn_v
And prepare difquant for these quants for attn_o and attn_q
commit c84d981
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Aug 27 06:13:39 2024 +0200
correct settings
commit c667f2e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 26 23:05:04 2024 +0200
Temporary settings for IQ3 attn_k and attn_v
commit 294aeec
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 26 18:18:05 2024 +0200
Corrections and clean-up
Back to Q8_0 for attn_k and attn_v if 8 experts or more.
for attn_v and attn_k if experts>=4
GQA>=12 brought back to expert>=4 quant level instead of 8
GQA8 brought to GQA7, and GQA7 brought to GQA4.
commit e7c5163
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 26 14:33:34 2024 +0200
Shrink a bit Q2_K when GQA<2
and optimize difquants_first_last and fl_more
commit ff48606
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 26 14:02:09 2024 +0200
IQI_XL, IQ2_S, IQ2_XS enhanced
commit 8a1ab24
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 26 12:57:21 2024 +0200
IQ1_XS, IQ1_S, IQ1_M, IQ2_XXS, Q2_M, Q2_K enhanced
testing templates for other quants.
commit 26aac8e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 25 14:42:33 2024 +0200
Soften the token embeddings bump for experts >= 4
commit 5644d4c
Merge: 16aee45 6026da5
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Sep 20 01:38:20 2024 +0200
Merge branch 'master' into pr/8836
commit 16aee45
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 25 14:25:46 2024 +0200
correction
commit dd3df75
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 25 03:30:36 2024 +0200
Bad indents and trailing whitespaces
commit f63860e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 25 03:17:21 2024 +0200
Put back ffn_down tree where it was before.
commit 8fc46df
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 24 22:30:45 2024 +0200
Bump a bit ffn_gate and down for some GQA<2 models
commit 53b8eaa
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 24 21:57:07 2024 +0200
Remove deprecated rules for token embeddings
commit 844d11b
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 24 21:02:51 2024 +0200
bad indent
commit 5ae5971
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 24 20:50:07 2024 +0200
Revamp Q2_K and Q3_K quants
Q3_K_XL takes the place of Q3_K_L.
Q3_K_L becomes intermediary between Q3_K_M and XL.
commit 1bde168
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 23:27:26 2024 +0200
Usage of n_head to discriminate very small models
Of which the size is more sensitive to the non repeating tensors
commit 16e9c37
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 23:18:59 2024 +0200
various corrections on IQ2_S+ and IQ3 quants
commit 380b53d
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 21:59:34 2024 +0200
Fix IQ4_XSR
commit 6081085
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 17:48:31 2024 +0200
Ravamp attn_output
commit 6b5cebf
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 16:40:40 2024 +0200
Revamp a bit output weight
for more granularity in low quants.
commit f796954
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 14:17:19 2024 +0200
Revamp FFN down and attn_k
And complete FFN up
Shrink a bit more non GQA models
commit 596a4ae
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Aug 22 19:12:25 2024 +0200
Readd variable attn_k, attn_q, attn_o after merge
commit fb2b9ea
Merge: 3a027b8 e11bd85
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 25 02:59:57 2024 +0200
Merge branch 'master' into pr/8836
commit 3a027b8
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 00:08:42 2024 +0200
Revamp IQ4_XSR, remove IQ3_XXXL
commit e05da54
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Aug 22 19:12:13 2024 +0200
Overhaul of FFN, if GQA and if not
commit 1607a02
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 12:38:45 2024 +0200
Further adjustments difquant formulas
commit 179ad0f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 21 13:10:54 2024 +0200
Little rework of the difquant formulas
commit 644aa9f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 21 13:07:32 2024 +0200
Correction too small tensor embeddings to quantize
IQ2_XS doesn't seem to work as such, back to IQ2_S
commit 32f6ead
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 19 17:58:12 2024 +0200
Improve IQ1 and IQ2 quants
And fix mistakes for the attn.output of IQ2_XL and the ffn gate and up of IQ2_XS
Reformat attn_ouput mess and split GQA4/GQA2
commit d7b9d21
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Aug 20 12:45:30 2024 +0200
Shrink a bit IQ3_XXS, bump a bit IQ3_M
commit dbadcdd
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Aug 20 11:59:41 2024 +0200
harmonize formatting of tensor type conditions
commit ce86019
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 21 12:25:38 2024 +0200
change function use_*_bits into difquant_*_tensors
this to clarify what it does, especially with the 5 additional levels of difquant
commit cfe866e
Merge: fddff02 fc54ef0
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 21 12:23:41 2024 +0200
Merge branch 'master' into pr/8836
commit fddff02
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 19 01:43:31 2024 +0200
Rework IQ3_XXS and IQ3_XS
and fix parenthesis mistake on IQ3_S
commit 207ffe6
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 18 23:28:13 2024 +0200
Reorder, corrections, settling lower IQ3 quants
commit 8c1a3c5
Merge: a7f9164 cfac111
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Aug 20 00:48:05 2024 +0200
Merge branch 'master' into pr/8836
commit a7f9164
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 19 16:02:00 2024 +0200
Fix mistake
commit caeb839
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 18 17:58:17 2024 +0200
Boost embeddings and output weights for MOEs.
They are single and non-repeating, the boost is thus reasonable compared to the 4 or more experts size.
commit 503048a
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 18 17:44:11 2024 +0200
Correct IQ3_M
commit ddb1373
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 18 16:56:55 2024 +0200
IQ3_XXL and IQ3_XXXL
We now have a full range of quants between IQ3_M and IQ4_XS
commit a79633b
Merge: b02eaf6 554b049
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 18 22:12:39 2024 +0200
Merge branch 'master' into pr/8836
commit b02eaf6
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 17 14:58:25 2024 +0200
Mass use of the few/some/more/many bits bump logic
Add few bits logic and rework the 4 settings for 25/37.5/50/75% quant bump when used.
commit 4ba5618
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 17 12:31:36 2024 +0200
Adapt token embeddings and output.weight to vocab size
due to the huge increase of the embeddings and output weight size for models with huge vocab, they seem to quantize with less loss.
commit 17b7151
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 17 00:17:41 2024 +0200
Update IQ3_M attn_k and IQ3_XL token_embd
commit e4c506d
Merge: eeccd31 2fb9267
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 18 04:09:22 2024 +0200
Merge branch 'master' into pr/8836
commit eeccd31
Merge: 8c9017b 5fd89a7
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Aug 15 02:30:10 2024 +0200
Merge branch 'master' into pr/8836
commit 8c9017b
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 12 22:20:02 2024 +0200
Simplify IQ4_XSR
But leave in place as a "demo" the more complex template set by Ikawrakow to customize the layers quants, with the added attn_q, attn_k, and attn_output tensors.
commit 8c10533
Merge: cd92ba6 fc4ca27
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 12 20:28:38 2024 +0200
Merge branch 'master' into pr/8836
commit cd92ba6
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 12 19:45:46 2024 +0200
IQ4_XSR (test FTYPE) and attention_wv logic for all attn_*.weights
Also, Advise iMatrix for IQ2_M and Q2_K FTypes
commit 3e2eb6d
Merge: df9e6fd df5478f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 12 14:25:23 2024 +0200
Merge branch 'master' into pr/8836
commit df9e6fd
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 21:49:23 2024 +0200
Adjustments on output and embeddings
commit 1ad18f8
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 21:44:29 2024 +0200
Adjustments on attn_k
commit 8c2c03f
Merge: 91db53b 8cd1bcf
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 16:46:15 2024 +0200
Merge b3569
b3569
commit 91db53b
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 16:41:23 2024 +0200
IQ1_XL and some corrections
notably on attn_q and parenthesis
commit 1268d58
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 02:13:08 2024 +0200
More adjustments
commit ef83a87
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 01:30:18 2024 +0200
Revert of ffn gate and up on IQ3_M
and indent
commit e2e2d77
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 01:13:12 2024 +0200
misplaced file lol
commit 8ad71f4
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 01:11:24 2024 +0200
IQ1_XS
and small adjustments.
commit 14f4f40
Merge: 8bc7a98 6e02327
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 20:45:26 2024 +0200
Merge b3565
Merge b3565
commit 8bc7a98
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 20:40:27 2024 +0200
2 forgotten files
commit f0806ac
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 20:34:17 2024 +0200
IQ2_XL , IQ3_XL , Q2_K_L
Plus some adjustments on the FFNs
commit 49617b1
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 18:37:29 2024 +0200
Advancing on several tensors
- Progressivity for token embeddings and attn_qkv
- FFN down for IQ1 and IQ2 quants
- FFN gate and up for IQ2_S and IQ2_M, for progressivity in the IQ2 range.
commit 415d5e4
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 17:32:29 2024 +0200
Refactor furthermore attn.v
And also lower attn_q for IQ2_XS, in order to separate it more for the quite misnamed IQ2_S
commit 8c8e43c
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 16:38:11 2024 +0200
Settings for MOE >= 8 experts applied to >= 4 experts
commit aa4eb59
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 16:33:55 2024 +0200
Further refactor attn_k
With attn_k set for all quants bellow 3bpw except Q2_K_S.
commit 8f1b99f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 13:09:11 2024 +0200
Shortening formatting
commit 7212098
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 12:52:57 2024 +0200
IQ1 and IQ2 refactor
Attn_q in Q3_K for experts >= 8
Attn_k in Q5_K for experts >= 8
Attn_v in Q6_K for experts >= 8, in IQ3_XXS for IQ2_XXS and IQ2_XS
Attn_output in Q4_K for experts >= 8
commit 1bc4dc5
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 9 22:49:42 2024 +0200
Bump IQ3_M
attn.v in Q5_K
attn.k in IQ4_XS
commit 1118c04
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Aug 8 18:56:20 2024 +0200
correct mistake in conditionality for attn.k
commit 8006b15
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Aug 8 18:50:48 2024 +0200
Avoid to shrink attn.k.weight for IQ3_XS and XXS when GQA or MOE
commit 59c5d47
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 4 12:06:06 2024 +0200
attn_qkv.weight in IQ4_XS for FTYPE IQ3_M
If FTYPE IQ4_XS has attn_qkv.weight in IQ4_XS, then FTYPE IQ3_M should not have it in Q4_K (4.5BPW), but in IQ4_XS (4.25BPW) also.
commit 93c35f8
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 4 11:59:52 2024 +0200
attn.output.tensor of FYPE IQ3_M in IQ4_XS
If FTYPE IQ4_XS has attn.output.tensor in IQ4_XS (4.5BPW), there's no reason to have FTYPE IQ3_M to have attn.output.tensor in Q4_K (4.5BPW).
In terms of perplexity, on a Llama 3.1 70b model, the proposed change reduces the size by 1%, and increases the preplexity by 0.25%.
commit d5779c2
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 3 03:04:25 2024 +0200
More occurences of n_experts == 8 changed to >= in quant strategies
commit 7d337d0
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 3 01:35:08 2024 +0200
Slight reorder of the attn.weight tree
And application of the attn.v.weight logic I used for IQ2 and IQ3, but only when such logic is already implied by the existing quant strategies, as a compromise to not disturb too much Ikawrakow's quant strategies.
commit 6398663
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 2 23:49:03 2024 +0200
Apply the GQA2/Expert2 conditionality to the IQ3 quants
In coherence with the proposed modifications to the IQ2 quant strategies, which make even more sense for the IQ3 quant strategies.
commit b77cdd8
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 2 20:40:04 2024 +0200
Small changes for IQ2 quant strategies (notably IQ2_S and IQ2_M)
Here's a few edits I consider useful to improve a bit the IQ2 model quant strategies for some models:
- The tensor attn.v.weight passed in Q4_K for models like Gemma (GQA 2), and the various franken MOEs having 2 experts, this to not sabotage them with a too small value head quant (Q2_K is meh for such important head) while the size of that head is low relatively to the total size of the affected models.
- The tensor attn.k.weight passed in Q4_K for models with 8 experts or more, rather than simply 8 experts.
- The tensor attn.output.weight passed in IQ3_XXS (instead of IQ3_S) for the quant strategies IQ2_S and IQ2_M, this to have a progressiveness between the IQ2_XS quant strategies (which use IQ2_XS for the attn.output.weight) and the IQ3_XXS quant strategies (which use.. IQ3_S quant for attn.output.weight). The benefit of an IQ3_S quant instead of an IQ3_XXS for that tensor is quasi-inexistant on IQ2_S and IQ2_M quant strategies, especially compared to the size bump it provokes.
More broadly, I think that the whole IQ2 quant strategies bunch should be harmonized/refactored like the rest of the quant strategies are established (tensor by tensor), rather than under an different kind of tree mixing these 5 quant strategies.
I'm using these settings (and many more edits) for a long time, with benefit, and I think they could be standard.
commit bb2a4d125e33148d2b4e9363bf8ace14f722a610
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Nov 11 08:59:32 2024 +0100
8x22b
commit 9d4926ff9559ecae25f19fadcb55586677575b61
Merge: 9c65f44 b0cefea
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Nov 11 08:59:07 2024 +0100
Merge branch 'master' into Nexes_CQ30
commit 9c65f44
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Nov 3 04:30:14 2024 +0100
Test base 2048
commit 8ccafe8
Merge: d0d276f 9830b69
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Nov 3 04:28:33 2024 +0100
Merge branch 'master' into Nexes_CQ30
commit d0d276f
Merge: 7cecefd 418f5ee
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Nov 1 20:18:27 2024 +0100
Merge branch 'master' into Nexes_CQ30
commit 7cecefd
Merge: a5303b7 8841ce3
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 28 06:45:16 2024 +0100
Merge branch 'master' into Nexes_CQ30
commit a5303b7
Merge: f21ab1e 167a515
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Oct 24 19:55:36 2024 +0200
Merge branch 'master' into Nexes_CQ30
commit f21ab1e
Merge: c72289e 20011f1
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Oct 23 20:26:42 2024 +0200
Merge branch 'gg/default-kq-f32-prec' into Nexes_CQ20
commit c72289e
Merge: eaee12e 190a37d
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Oct 23 20:26:34 2024 +0200
Merge branch 'master' into Nexes_CQ20
commit 20011f1
Author: Georgi Gerganov <ggerganov@gmail.com>
Date: Wed Oct 23 14:32:27 2024 +0300
llama : switch KQ multiplication to use F32 precision by default
ggml-ci
commit eaee12e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 21 15:41:24 2024 +0200
EXL SXL and UXL types to test the new bits formula
commit 6abef2a
Merge: aa73a4e d5ebd79
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 21 15:40:22 2024 +0200
Merge branch 'master' into Nexes_CQ20
commit aa73a4e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 19 19:04:33 2024 +0200
use_some_bits and use_most_bits
commit 7794c8f
Merge: 1cf274d cda0e4b
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 19 19:04:05 2024 +0200
Merge branch 'master' into Nexes_CQ20
commit 1cf274d
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 18 21:00:56 2024 +0200
ML UXL and EXL boost
commit f105e0f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 18 21:05:49 2024 +0200
Revert compile for Ampere
commit 1b25cbb
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 18 21:05:04 2024 +0200
Delete CMakePresets.json
commit 1c440a8
Merge: 366e0c8 afd9909
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 18 20:42:34 2024 +0200
Merge branch 'master' into Nexes_CQ20
commit 366e0c8
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Oct 16 16:57:30 2024 +0200
Fix indent model sizes
commit cf8375c
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Oct 16 16:41:57 2024 +0200
continue Q5_K mixes
commit 2d052f7
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Oct 15 17:42:48 2024 +0200
difquants three/four eights alt for Mistral Large
commit 29cecae
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Oct 15 16:03:12 2024 +0200
Q5_K_XSR, SR, ML, and XL revamp
commit 412b56f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 14 17:08:23 2024 +0200
IQ3_X5L and IQ3_X7L fix for Mistral Large
commit ca86ce8
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 14 15:24:37 2024 +0200
Pursue IQ3 revamp
commit 6c51f39
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 13 22:22:40 2024 +0200
IQ3_XXXXL, EXL and renaming >=IQ3_ML scheme
Test for Mistral Large
IQ3_XL = IQ3_X5L and so on.
commit 64bfe69
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 13 22:33:05 2024 +0200
Activate F16
commit 575ebc2
Merge: 38229d3 d4c19c0
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 13 22:22:30 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 38229d3
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 12 20:57:18 2024 +0200
Fix specify tensors in quantize
commit b947b6e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 12 13:38:22 2024 +0200
New FTYPE Q5_K_XL
commit ba1b854
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 12 13:36:25 2024 +0200
New FTYPE IQ4_XXSR
and beef up attn_k IQ4_XSR
commit 79fa98c
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 13 02:00:38 2024 +0200
GGML_MAX_COPIES_1 in CML
commit f95ed01
Merge: accd71d edc2656
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 13 02:02:06 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit accd71d
Merge: b5103f4 11ac980
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 12 13:23:11 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit b5103f4
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 11 13:43:48 2024 +0200
Better model info (ikawrakow#84)
Co-Authored-By: Kawrakow <iwankawrakow@gmail.com>
commit b302561
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 11 13:17:39 2024 +0200
IQ3_UXL for test
commit 8c6e408
Merge: 66a9b05 7eee341
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 11 13:17:30 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 66a9b05
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Oct 9 04:30:45 2024 +0200
correct iQ4_LR
commit 298990a
Merge: f1814f1 dca1d4b
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Oct 8 22:11:53 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit f1814f1
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 7 23:21:56 2024 +0200
Rebump attn_v
commit b94a9b0
Merge: 18677c8 6374743
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Oct 7 23:21:38 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 18677c8
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 6 02:12:09 2024 +0200
IQ4_LR
commit a2500c1
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Oct 6 02:12:55 2024 +0200
Crack down fallback GGML_types
commit 75b8800
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 5 23:18:02 2024 +0200
More overhaul for IQ4_XSR and new IQ4_MR
commit 167a3c5
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 5 17:17:50 2024 +0200
GGML SCHED MAX COPIES 1
commit 8433050
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 5 17:14:39 2024 +0200
Adapt CML
commit 1e0f64e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 5 17:07:07 2024 +0200
Compile for Ampere
commit 35ce3f6
Merge: 6480054 8c475b9
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Oct 5 17:03:34 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 6480054
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Oct 4 18:21:54 2024 +0200
IQ4_XSR revamp
commit 1ec8328
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 19 17:00:34 2024 +0200
Clarify PPL result
commit de50e13
Merge: ed67589 d5ed2b9
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Oct 3 22:23:08 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit ed67589
Merge: 06ab3a2 70392f1
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 10:22:50 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 06ab3a2
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 10:22:46 2024 +0200
More size logging
commit 9d97928
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 10:21:25 2024 +0200
Update llama.cpp
commit 700d205
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 03:51:26 2024 +0200
IQ3_XS more
commit da840a3
Merge: 056c47d 116efee
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 03:30:18 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 056c47d
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 03:30:15 2024 +0200
Reapply "threadpool : skip polling for unused threads (ggml-org#9461)"
This reverts commit 2a8dbf8.
commit 8d789ac
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Sep 24 03:20:58 2024 +0200
IQ3_XS
commit 413fc43
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Sep 23 19:34:45 2024 +0200
Fix IQ3 <=M
commit 9ed3522
Merge: 2a8dbf8 1d48e98
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Sep 23 18:50:43 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit 2a8dbf8
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Sep 22 02:48:50 2024 +0200
Revert "threadpool : skip polling for unused threads (ggml-org#9461)"
This reverts commit 0226613.
commit 6faac9f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Sep 22 02:46:37 2024 +0200
Revert "Update CUDA graph on scale change plus clear nodes/params (ggml-org#9550)"
This reverts commit 41f4778.
commit f377f88
Merge: e3ec684 d09770c
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Sep 21 17:25:04 2024 +0200
Merge branch 'master' into Nexes_CQ_10
commit e3ec684
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Sep 20 06:36:47 2024 +0200
reinsert cqs
commit d48aad3
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Sep 2 05:50:08 2024 +0200
Play with IQ3 quants
commit 5af6481
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Sep 2 01:41:19 2024 +0200
IQ4_XSR_rework
commit dd770d2
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 31 17:05:00 2024 +0200
refine IQ3 quants
commit 32ce04a
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 31 14:22:00 2024 +0200
Use of vocab as difquant criteria
The pre-vocab>128k models are more sensitive to ffn_down quant than to ffn_gate and up.
commit 86a7e4a
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 30 12:15:54 2024 +0200
IQ3_UXL
commit 97fbd74
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Aug 29 22:40:32 2024 +0200
New difquant seven_eights
commit c6732bf
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 28 16:06:38 2024 +0200
Bump a bit output for big models in IQ2 and IQ3
commit cce61d3
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 28 13:00:53 2024 +0200
Difquant attn_q and attn_o for IQ3_XXS, XS, and S
And also establishing a bump to difquant_first_last_tensors for attn_k and attn_v
commit 1e7e816
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 28 02:24:55 2024 +0200
Add IQ3_ML, reinstate IQ3_XXXL
commit 7b0dc30
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 28 00:52:45 2024 +0200
Bump IQ3_XS
commit 6263649
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Aug 27 16:19:10 2024 +0200
Revert variable V below Q5_K
commit eb4a69e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Aug 27 13:26:15 2024 +0200
Difquant for IQ2_XL & IQ3 for attn_k and attn_v
And prepare difquant for these quants for attn_o and attn_q
commit c84d981
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Aug 27 06:13:39 2024 +0200
correct settings
commit c667f2e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 26 23:05:04 2024 +0200
Temporary settings for IQ3 attn_k and attn_v
commit 294aeec
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 26 18:18:05 2024 +0200
Corrections and clean-up
Back to Q8_0 for attn_k and attn_v if 8 experts or more.
for attn_v and attn_k if experts>=4
GQA>=12 brought back to expert>=4 quant level instead of 8
GQA8 brought to GQA7, and GQA7 brought to GQA4.
commit e7c5163
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 26 14:33:34 2024 +0200
Shrink a bit Q2_K when GQA<2
and optimize difquants_first_last and fl_more
commit ff48606
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 26 14:02:09 2024 +0200
IQI_XL, IQ2_S, IQ2_XS enhanced
commit 8a1ab24
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 26 12:57:21 2024 +0200
IQ1_XS, IQ1_S, IQ1_M, IQ2_XXS, Q2_M, Q2_K enhanced
testing templates for other quants.
commit 26aac8e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 25 14:42:33 2024 +0200
Soften the token embeddings bump for experts >= 4
commit 5644d4c
Merge: 16aee45 6026da5
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Sep 20 01:38:20 2024 +0200
Merge branch 'master' into pr/8836
commit 16aee45
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 25 14:25:46 2024 +0200
correction
commit dd3df75
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 25 03:30:36 2024 +0200
Bad indents and trailing whitespaces
commit f63860e
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 25 03:17:21 2024 +0200
Put back ffn_down tree where it was before.
commit 8fc46df
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 24 22:30:45 2024 +0200
Bump a bit ffn_gate and down for some GQA<2 models
commit 53b8eaa
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 24 21:57:07 2024 +0200
Remove deprecated rules for token embeddings
commit 844d11b
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 24 21:02:51 2024 +0200
bad indent
commit 5ae5971
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 24 20:50:07 2024 +0200
Revamp Q2_K and Q3_K quants
Q3_K_XL takes the place of Q3_K_L.
Q3_K_L becomes intermediary between Q3_K_M and XL.
commit 1bde168
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 23:27:26 2024 +0200
Usage of n_head to discriminate very small models
Of which the size is more sensitive to the non repeating tensors
commit 16e9c37
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 23:18:59 2024 +0200
various corrections on IQ2_S+ and IQ3 quants
commit 380b53d
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 21:59:34 2024 +0200
Fix IQ4_XSR
commit 6081085
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 17:48:31 2024 +0200
Ravamp attn_output
commit 6b5cebf
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 16:40:40 2024 +0200
Revamp a bit output weight
for more granularity in low quants.
commit f796954
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 14:17:19 2024 +0200
Revamp FFN down and attn_k
And complete FFN up
Shrink a bit more non GQA models
commit 596a4ae
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Aug 22 19:12:25 2024 +0200
Readd variable attn_k, attn_q, attn_o after merge
commit fb2b9ea
Merge: 3a027b8 e11bd85
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 25 02:59:57 2024 +0200
Merge branch 'master' into pr/8836
commit 3a027b8
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 00:08:42 2024 +0200
Revamp IQ4_XSR, remove IQ3_XXXL
commit e05da54
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Aug 22 19:12:13 2024 +0200
Overhaul of FFN, if GQA and if not
commit 1607a02
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 23 12:38:45 2024 +0200
Further adjustments difquant formulas
commit 179ad0f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 21 13:10:54 2024 +0200
Little rework of the difquant formulas
commit 644aa9f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 21 13:07:32 2024 +0200
Correction too small tensor embeddings to quantize
IQ2_XS doesn't seem to work as such, back to IQ2_S
commit 32f6ead
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 19 17:58:12 2024 +0200
Improve IQ1 and IQ2 quants
And fix mistakes for the attn.output of IQ2_XL and the ffn gate and up of IQ2_XS
Reformat attn_ouput mess and split GQA4/GQA2
commit d7b9d21
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Aug 20 12:45:30 2024 +0200
Shrink a bit IQ3_XXS, bump a bit IQ3_M
commit dbadcdd
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Aug 20 11:59:41 2024 +0200
harmonize formatting of tensor type conditions
commit ce86019
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 21 12:25:38 2024 +0200
change function use_*_bits into difquant_*_tensors
this to clarify what it does, especially with the 5 additional levels of difquant
commit cfe866e
Merge: fddff02 fc54ef0
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Wed Aug 21 12:23:41 2024 +0200
Merge branch 'master' into pr/8836
commit fddff02
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 19 01:43:31 2024 +0200
Rework IQ3_XXS and IQ3_XS
and fix parenthesis mistake on IQ3_S
commit 207ffe6
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 18 23:28:13 2024 +0200
Reorder, corrections, settling lower IQ3 quants
commit 8c1a3c5
Merge: a7f9164 cfac111
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Tue Aug 20 00:48:05 2024 +0200
Merge branch 'master' into pr/8836
commit a7f9164
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 19 16:02:00 2024 +0200
Fix mistake
commit caeb839
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 18 17:58:17 2024 +0200
Boost embeddings and output weights for MOEs.
They are single and non-repeating, the boost is thus reasonable compared to the 4 or more experts size.
commit 503048a
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 18 17:44:11 2024 +0200
Correct IQ3_M
commit ddb1373
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 18 16:56:55 2024 +0200
IQ3_XXL and IQ3_XXXL
We now have a full range of quants between IQ3_M and IQ4_XS
commit a79633b
Merge: b02eaf6 554b049
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 18 22:12:39 2024 +0200
Merge branch 'master' into pr/8836
commit b02eaf6
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 17 14:58:25 2024 +0200
Mass use of the few/some/more/many bits bump logic
Add few bits logic and rework the 4 settings for 25/37.5/50/75% quant bump when used.
commit 4ba5618
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 17 12:31:36 2024 +0200
Adapt token embeddings and output.weight to vocab size
due to the huge increase of the embeddings and output weight size for models with huge vocab, they seem to quantize with less loss.
commit 17b7151
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 17 00:17:41 2024 +0200
Update IQ3_M attn_k and IQ3_XL token_embd
commit e4c506d
Merge: eeccd31 2fb9267
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 18 04:09:22 2024 +0200
Merge branch 'master' into pr/8836
commit eeccd31
Merge: 8c9017b 5fd89a7
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Aug 15 02:30:10 2024 +0200
Merge branch 'master' into pr/8836
commit 8c9017b
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 12 22:20:02 2024 +0200
Simplify IQ4_XSR
But leave in place as a "demo" the more complex template set by Ikawrakow to customize the layers quants, with the added attn_q, attn_k, and attn_output tensors.
commit 8c10533
Merge: cd92ba6 fc4ca27
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 12 20:28:38 2024 +0200
Merge branch 'master' into pr/8836
commit cd92ba6
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 12 19:45:46 2024 +0200
IQ4_XSR (test FTYPE) and attention_wv logic for all attn_*.weights
Also, Advise iMatrix for IQ2_M and Q2_K FTypes
commit 3e2eb6d
Merge: df9e6fd df5478f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Mon Aug 12 14:25:23 2024 +0200
Merge branch 'master' into pr/8836
commit df9e6fd
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 21:49:23 2024 +0200
Adjustments on output and embeddings
commit 1ad18f8
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 21:44:29 2024 +0200
Adjustments on attn_k
commit 8c2c03f
Merge: 91db53b 8cd1bcf
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 16:46:15 2024 +0200
Merge b3569
b3569
commit 91db53b
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 16:41:23 2024 +0200
IQ1_XL and some corrections
notably on attn_q and parenthesis
commit 1268d58
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 02:13:08 2024 +0200
More adjustments
commit ef83a87
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 01:30:18 2024 +0200
Revert of ffn gate and up on IQ3_M
and indent
commit e2e2d77
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 01:13:12 2024 +0200
misplaced file lol
commit 8ad71f4
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 11 01:11:24 2024 +0200
IQ1_XS
and small adjustments.
commit 14f4f40
Merge: 8bc7a98 6e02327
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 20:45:26 2024 +0200
Merge b3565
Merge b3565
commit 8bc7a98
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 20:40:27 2024 +0200
2 forgotten files
commit f0806ac
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 20:34:17 2024 +0200
IQ2_XL , IQ3_XL , Q2_K_L
Plus some adjustments on the FFNs
commit 49617b1
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 18:37:29 2024 +0200
Advancing on several tensors
- Progressivity for token embeddings and attn_qkv
- FFN down for IQ1 and IQ2 quants
- FFN gate and up for IQ2_S and IQ2_M, for progressivity in the IQ2 range.
commit 415d5e4
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 17:32:29 2024 +0200
Refactor furthermore attn.v
And also lower attn_q for IQ2_XS, in order to separate it more for the quite misnamed IQ2_S
commit 8c8e43c
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 16:38:11 2024 +0200
Settings for MOE >= 8 experts applied to >= 4 experts
commit aa4eb59
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 16:33:55 2024 +0200
Further refactor attn_k
With attn_k set for all quants bellow 3bpw except Q2_K_S.
commit 8f1b99f
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 13:09:11 2024 +0200
Shortening formatting
commit 7212098
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 10 12:52:57 2024 +0200
IQ1 and IQ2 refactor
Attn_q in Q3_K for experts >= 8
Attn_k in Q5_K for experts >= 8
Attn_v in Q6_K for experts >= 8, in IQ3_XXS for IQ2_XXS and IQ2_XS
Attn_output in Q4_K for experts >= 8
commit 1bc4dc5
Author: Nexesenex <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 9 22:49:42 2024 +0200
Bump IQ3_M
attn.v in Q5_K
attn.k in IQ4_XS
commit 1118c04
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Aug 8 18:56:20 2024 +0200
correct mistake in conditionality for attn.k
commit 8006b15
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Thu Aug 8 18:50:48 2024 +0200
Avoid to shrink attn.k.weight for IQ3_XS and XXS when GQA or MOE
commit 59c5d47
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 4 12:06:06 2024 +0200
attn_qkv.weight in IQ4_XS for FTYPE IQ3_M
If FTYPE IQ4_XS has attn_qkv.weight in IQ4_XS, then FTYPE IQ3_M should not have it in Q4_K (4.5BPW), but in IQ4_XS (4.25BPW) also.
commit 93c35f8
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Sun Aug 4 11:59:52 2024 +0200
attn.output.tensor of FYPE IQ3_M in IQ4_XS
If FTYPE IQ4_XS has attn.output.tensor in IQ4_XS (4.5BPW), there's no reason to have FTYPE IQ3_M to have attn.output.tensor in Q4_K (4.5BPW).
In terms of perplexity, on a Llama 3.1 70b model, the proposed change reduces the size by 1%, and increases the preplexity by 0.25%.
commit d5779c2
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 3 03:04:25 2024 +0200
More occurences of n_experts == 8 changed to >= in quant strategies
commit 7d337d0
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Sat Aug 3 01:35:08 2024 +0200
Slight reorder of the attn.weight tree
And application of the attn.v.weight logic I used for IQ2 and IQ3, but only when such logic is already implied by the existing quant strategies, as a compromise to not disturb too much Ikawrakow's quant strategies.
commit 6398663
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 2 23:49:03 2024 +0200
Apply the GQA2/Expert2 conditionality to the IQ3 quants
In coherence with the proposed modifications to the IQ2 quant strategies, which make even more sense for the IQ3 quant strategies.
commit b77cdd8
Author: Nexes the Old <124105151+Nexesenex@users.noreply.github.com>
Date: Fri Aug 2 20:40:04 2024 +0200
Small changes for IQ2 quant strategies (notably IQ2_S and IQ2_M)
Here's a few edits I consider useful to improve a bit the IQ2 model quant strategies for some models:
- The tensor attn.v.weight passed in Q4_K for models like Gemma (GQA 2), and the various franken MOEs having 2 experts, this to not sabotage them with a too small value head quant (Q2_K is meh for such important head) while the size of that head is low relatively to the total size of the affected models.
- The tensor attn.k.weight passed in Q4_K for models with 8 experts or more, rather than simply 8 experts.
- The tensor attn.output.weight passed in IQ3_XXS (instead of IQ3_S) for the quant strategies IQ2_S and IQ2_M, this to have a progressiveness between the IQ2_XS quant strategies (which use IQ2_XS for the attn.output.weight) and the IQ3_XXS quant strategies (which use.. IQ3_S quant for attn.output.weight). The benefit of an IQ3_S quant instead of an IQ3_XXS for that tensor is quasi-inexistant on IQ2_S and IQ2_M quant strategies, especially compared to the size bump it provokes.
More broadly, I think that the whole IQ2 quant strategies bunch should be harmonized/refactored like the rest of the quant strategies are established (tensor by tensor), rather than under an different kind of tree mixing these 5 quant strategies.
I'm using these settings (and many more edits) for a long time, with benefit, and I think they could be standard.
Currently all threads do N polling rounds even if only 1 thread is active (n_threads_cur == 1).
For smaller graphs/models/prompts the unused threads may end up always polling and never sleeping because we keep getting new graphs to work on.
This PR adds support for skipping the polling for unused threads (ith >= n_threads_cur). They simply go to sleep and we wake them up when we get a new graph to work on.
n_threads_curis now anatomic_intto explicitly tell the compiler and thread sanitizer that it is written from one thread and read from other threads (free from race conditions). All loads and stores use relaxed memory order so there is no additional overhead.Here are some scenarios with the default build on M2 Max, with debug prints for n_thread updates, and for threads going to sleep.
Full offload (Metal)
8 threads are started. Only 1 is active, so the other 7 skip the polling and go to sleep.
CPU only
8 threads are started, and they are all active. hybrid-polling enabled by default prevents them from going to sleep.
No KV offload
8 threads are started, and we alternate between using all 8 and just one for different parts of the graph.