-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k) #842
Comments
Let me try explain this.
For any call to |
Thanks for the response. This should be relevant for most 12 and 13+ gen Intel CPUs. When I use 30 cores I seem to get the best performance today (yesterday that was not the case) inference goes down to 32ms/token. |
You could try adding this piece of code #572 (comment) You could also lock the thread affinity for the process to only use P-cores in the task manager and see if that improves performance as that discussion thread suggests. While the point of the P/E arch was that the intel thread director working in tandem with the os thread scheduler should know better which one to use for a given task, it seems that in reality it doesn't always know that. It also doesn't help that all the windows's after 7 are garbage and want to eat your resources in the background. But maybe locking apps to the perf cores would make the background crap threads run only in the E-cores and keep the P-cores free for actual work. Also if you have a AVX512 capable CPU you could try enabling it (which also disables the E-cores afaik) and see if AVX512 increases performance even further compared to AVX2. If there is a performance increase it could be a good idea to add it as an compile-flag |
Looks like a good start, I'll be looking into that. |
yeah obviously the compile-flag could be auto-detected in make/cmake like the other flags, and just like them the guard gives the option to leave that functionality out. that snippet can be extended with detection by checking the hybrid flag and then if either a p/e core exists it's a intel 12th/13th gen. the actual model id isn't required to be matched against a list of current 12th/13th gen processors at this time and i'm betting that not in the future either (as even if the architecture changes i'd bet the current instruction values will be reserved as they are now). edit: here's a slightly updated version with both checking the arch and the thread. untested as i don't have a 12/13th gen intel edit 2: fixed a bug // 1 = P core
// 0 = E core
// -1 = fail / not P/E arch
inline int is_thread_on_p_core(void) {
static unsigned const char a[] = {0x31,0xC9,0xB8,0x1A,0x00,0x00,0x00,0x53,0x0F,0xA2,0x5B,0xC1,0xE8,0x18,0x83,0xF8,0x40,0x0F,0x85,0x06,0x00,0x00,0x00,0xB8,0x01,0x00,0x00,0x00,0xC3,0x83,0xE8,0x20,0xF7,0xD8,0x19,0xC0,0xC3};
return ((int (*)(void)) (void*)((void*)a))();
}
// 1 = hybrid x86 cpu
inline int is_x86_hybrid_cpu(void) {
static unsigned const char a[] = {0x31,0xC9,0xB8,0x07,0x00,0x00,0x00,0x53,0x0F,0xA2,0x5B,0xC1,0xEA,0x0F,0x83,0xE2,0x01,0x89,0xD0,0xC3};
return ((int (*)(void)) (void*)((void*)a))();
}
// 1 = intel 12th/13th gen cpu
inline int is_intel_p_e_core_arch(void) {
return (is_x86_hybrid_cpu() && (is_thread_on_p_core() >= 0));
} |
Hello @anzz1 |
Here are the best options I found for Intel 13th core: #229 (reply in thread): I tinkered a bit and is what seemed the best on my i5 13500:
Using a single ggml thread with 5 BLAS threads on the 5 other performance cores proceeds quite well, but of course inference is slow. It would be great to be able to set the ggml / BLAS threads counts differently depending if it is initial prompt ingestion or inference. Using more than 6 ggml threads is very slow, I believe that the efficiency cores are bottlenecking. I opened a PR to OpenBLAS to improve the issue I had with it on Intel 13th gen: OpenMathLib/OpenBLAS#3970 |
@CyberTimon , the example code I posted is useful for debugging purposes only, it only tells you which type of core it's currently running on, it doesn't actually do anything. Since you're on Windows, for now you can simply try locking the "main.exe" process to the P cores as discussed earlier. I believe the Locking the affinity might not be the most optimal route though. There is some light reading to do to be able to understand how the thread director is controlled and how it actually works. |
Hello, I see 100% util on llama.cpp, but a sister impl based on ggml, llama-rs, is showing 50% as well. Perhaps we can share some findings. I do not have BLAS installed, so |
This is a contradictory statement. |
@cmp-nct , would you mind running @KASR 's script https://gist.github.com/KASR/dc3dd7f920f57013486583af7e3725f1#file-benchmark_threads_llama_cpp-py to determine the exact |
@mqy A late night update from my testing, with no conclusive results yet.
The best performance was at 10 threads when forcing those on the 8 P cores. All benchmarks were run with mt-preloading to ensure the entire memory is cached and available at first inference. Raw benchmark timings: 31 threads, default + preloading 16 threads, default + preloading 8 threads, 7 of them bound to performance 10 (modifications) |
@cmp-nct the
My test shows that with 6 physical cores, the |
Can you elaborate, what do you mean by "returns the intel core id 64 internally but not the extended core flags (p/e)" ? As that is exactly the hybrid enumeration leaf, when HYBRID = 1 (on extended feature flags leaf) on an Intel x86 processor, where: P-Core == Intel® Core™ "Golden Cove" 0x40 (64) References: I am not saying that I'm 100% positive I understood the documentation right, that there isn't any error in my code or whether the documentation is even correct as I have no P/E processor to verify it myself. But as your reasoning doesn't seem to make sense in light of the documentation, did you try the code I posted before declaring "is not going to work"? And if there indeed is an error and you know what's up, a correction would be helpful. 👍 Here are the functions from the comment above explained for anyone interested: (int) is_thread_on_p_core(void):
0: 31 C9 xor ecx,ecx ; set ECX=0
2: B8 1A 00 00 00 mov eax,0x1a ; set EAX=0x1A (26)
7: 53 push rbx ; push RBX to stack (save RBX initial value)
8: 0F A2 cpuid ; cpuid : EAX=0x1A (hybrid info leaf), ECX=0 (sub-leaf 0)
A: 5B pop rbx ; pop RBX from stack (discard EBX return value, restore initial value)
B: C1 E8 18 shr eax,0x18 ; set EAX >>= 0x18 (shift right 24 bits)
E: 83 F8 40 cmp eax,0x40 ; compare EAX==0x40 (64) : Check for P-Core (Intel Core "Golden Cove")
11: 0F 85 06 00 00 00 jne 0x1d ; if EAX != 0x40, jump to 1D (+6)
17: B8 01 00 00 00 mov eax,0x1 ; set EAX=1
1C: C3 ret ; return EAX = 1
1D: 83 E8 20 sub eax,0x20 ; set EAX -= 0x20 (32) : Check for E-core (Intel Atom "Gracemont")
20: F7 D8 neg eax ; two's complement EAX and set CF = (EAX != 0)
22: 19 C0 sbb eax,eax ; substract with borrow : EAX -= EAX - CF
24: C3 ret ; return EAX = 0 / 0xFFFFFFFF (-1)
(int) is_x86_hybrid_cpu(void):
0: 31 C9 xor ecx,ecx ; set ECX=0
2: B8 07 00 00 00 mov eax,0x7 ; set EAX=0x7 (7)
7: 53 push rbx ; push RBX to stack (save RBX initial value)
8: 0F A2 cpuid ; cpuid : EAX=0x7 (extended feature flags leaf), ECX=0 (sub-leaf 0)
A: 5B pop rbx ; pop RBX from stack (discard EBX return value, restore initial value)
B: C1 EA 0F shr edx,0xf ; set EDX >>= 0xF (shift right 15 bits)
E: 83 E2 01 and edx,0x1 ; set EDX &= 1 : Check for HYBRID feature flag
11: 89 D0 mov eax,edx ; set EAX=EDX
13: C3 ret ; return EAX = 0 / 1 |
@anzz1 What should work better is to set affinity, you can specify all performance cores or just selected ones. I did not have time to dig into the issues from yesterday but it's a new day. btw here is the non ASM approach:
|
Yeah I thought about this too that how on earth can the OS scheluder can know whether to assign a thread to a P-core or a E-core before actually doing the work. There are some information here How 13th Gen Intel® Core™ Processors Work on how the "behind-the-scenes magic that maximizes hybrid performance" works, namely the symbiotic relationship between OS scheluder and "Intel Thread Director", and I also remember "advanced AI" being thrown around in some other article too, whatever that means.
It should work though, the 'cpuid' returns information about the logical processor the instruction was executed on. So yeah, it cannot be used to know beforehand whether a thread is going to be executed on a P/E processor, that's affinity which you can force, but it can determine the current state, meaning while the thread is already running, after the work is sent by the scheluder to a logical processor. So the functions posted are useful for information's sake, not actually changing what is happening. It can show whether an affinity lock actually worked, and also how the threads are being assigned when not locking the affinity.
The reason to use the ASM approach was to support all and any compilers as I've found it frustrating that there are multiple implementations depending on the compiler, like they couldn't decide on a standard for everyone and had to make up their own. Especially if you need other sub-leafs than 0, there are even more different intrinsics to cover, they aren't required in this particular case though. The performance benefit of making it in the least amount of instructions possible saves just a few instructions and dozen bytes of memory at best, so it wasn't really any consideration. As in my single-header (incomplete) cross-compiler feature flag library cpuid.h , I simply found making the intrinsincs from scratch easier than to cover and test every possible compiler (and version). Using simple inline assembly makes sure that the code is always the same and doesn't allow for any compiler confusions. The compiler intrinsics will be translated to assembly at compile time anyway, so there isn't really a difference. But a non-ASM approach should be fine too, it just needs some more #ifdef's and work to support. Other than that they're fundamentally identical. Intrinsics do have the upside of better readability, and not having to add a explanatory comment to the code of what is being done with the ASM.
Could you drop the stalling compiled binary as an attachment or put it to a google drive/dropbox/whatever as I would be interested in taking a look. Maybe add a printf("HEY , OVER HERE") just before the call it so I can find it quickly. |
I have similar problem with arm64 GCE t2a instance. Instead of 8 CPUs only 4 are used. |
Same, there are 64 cores on a pc I am using and it uses half. Is there a way to use more? |
I found a workaround - recreate model from Modelfile with num_thread parameter. Set it to real number of CPU, then all used and perf is 2x |
My solution: set n_threads_batch with the cpus's number I set n_threads_batch to cpus's system number minus one import multiprocessing At line 234 https://llama-cpp-python.readthedocs.io/en/latest/api-reference/ you can read the default settings using half cpus: |
-t N, --threads N number of threads to use during generation (default: 32) |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I've not digged deep into this yet but my whole CPU utilization is only at 50%.
I've compiled it with current VS build tools, all default, release mode of course.
It might be related to the modern e-cores in Intel CPUs, they pack quite a punch but are weaker than performance cores.
In the graph it looks like 16 cores (the amount of e-cores) are much more utilized and 8 cores (amount of performance cores) are mostly idle despite using 24 threads. Increasing threads worsens performance, decreasing threads worsens tokens output.
I tested the small 7B model in 4 bit and 16 bit.
The only method to get CPU utilization above 50% is by using more than the total physical cores (like 32 cores).
In this case I see up to 99% CPU utilization but the token performance drops below 2 cores performance, some hyperthreading issue I suppose.
I tried various modes (small/large batch size, context size) It all does not influence it much.
The CPU was idle (as seen in screenshot).
Also memory is not full or swapping either.
Here is the command line:
.\Release\main.exe -m .\models\7B\ggml-model-f16.bin -p "Below I count from 1 to 100000000: 1 2 3 4 5 6" -c 1024 -t 24 -n 1024 -b 64
The text was updated successfully, but these errors were encountered: