Performance 3x better when use performance core only on Intel gen 12th cpu #572
Replies: 11 comments 11 replies
-
Do you get better performance if you also use -t 6 without numactl? By default llama.cpp uses all threads which is rarely optimal. |
Beta Was this translation helpful? Give feedback.
-
Also better number without numactl, what I observe is all worker threads stay on core 0-5. the performance is similiar to with numactl (1422.12 ms per run with 65B model) |
Beta Was this translation helpful? Give feedback.
-
If you disable the E-cores and enable AVX512, how's the performance then? To be honest, it's kinda strange having to manually lock the app to performance cores. The whole point of Intel's P/E architecture and the "Thread Director" is that it should automatically know to use the proper mix of logical processors for any given workload. Then again, the thread director works with the OS thread scheluder so it possible that it's the OS's fault of not scheluding processor and the architecture itself is not to blame. Anyway this is good information. For the people testing this, you should post your OS type and distro/kernel versions too. It could be worth adding information about the cores too in future debug/trace/benchmarks infos. A thread can detect whether it's running on a P or E core with the cpuid instruction so this could easily be added in a platform-agnostic manner: https://www.intel.com/content/www/us/en/developer/articles/guide/12th-gen-intel-core-processor-gamedev-guide.html If I read the doc correctly, this should tell whether a thread is running on a P/E core: // 1 = P core
// 0 = E core
// -1 = fail / not P/E arch
inline int is_thread_on_p_core(void) {
static unsigned const char a[] = {0x31,0xC9,0xB8,0x1A,0x00,0x00,0x00,0x53,0x0F,0xA2,0x5B,0xC1,0xE8,0x18,0x83,0xF8,0x40,0x0F,0x85,0x06,0x00,0x00,0x00,0xB8,0x01,0x00,0x00,0x00,0xC3,0x83,0xE8,0x20,0xF7,0xD8,0x19,0xC0,0xC3};
return ((int (*)(void)) (void*)((void*)a))();
}
// 1 = hybrid x86 cpu
inline int is_x86_hybrid_cpu(void) {
static unsigned const char a[] = {0x31,0xC9,0xB8,0x07,0x00,0x00,0x00,0x53,0x0F,0xA2,0x5B,0xC1,0xEA,0x0F,0x83,0xE2,0x01,0x89,0xD0,0xC3};
return ((int (*)(void)) (void*)((void*)a))();
}
// 1 = intel 12th/13th gen cpu
inline int is_intel_p_e_core_arch(void) {
return (is_x86_hybrid_cpu() && (is_thread_on_p_core() >= 0));
} |
Beta Was this translation helpful? Give feedback.
-
I observed the same thing on Intel 13rd gen, here are my findings and the best parameters I found: #229 (reply in thread) Make sure to use a kernel recent enough, with 6.2.6, I had no need for numactl to spread things correctly (I had to use cpuset on 5.16). TL;DR for now:
My guess is that effiency cores are bottlenecking, and somehow we are waiting for them to finish their work (which takes 2-3 more time than a performance core) instead of giving back their work to another performance core when their work is done. This is probably the case because threads spinlock, so unused threads that finish their work faster are still stuck at 100% preventing the Linux scheduler to schedule another thread with the remaining work on performance cores. I see two potential fixes to that: implement a work-stealking algorithm (but I fear it may impact performance), or find an alternative to busy waiting when threads have no work left to do, so the Linux scheduler can schedule back the threads with work began on efficiency cores onto performance cores. CC: @ggerganov as this related to what you described here: #578 (comment) |
Beta Was this translation helpful? Give feedback.
-
Does anyone in this thread who has a processor with performance / efficiency cores have the ability to test out #1278, a patch for the Windows build detecting P/E cores? I was under the impression someone else was working on the Linux implementation but it seems not. So I'll work on that one after this one is done. |
Beta Was this translation helpful? Give feedback.
-
A better solution may be to find out why E-cores hurt performance. For example, some of the |
Beta Was this translation helpful? Give feedback.
-
was this ever fixed? |
Beta Was this translation helpful? Give feedback.
-
I have similar performance losses when using e-cores on my 13600KF (which has 6 P-cores and 8 E-cores, or 20 threads in total). I compiled llama with openblas (edit : under linux), and I benchmarked various e-cores/p-cores and hyper-threading combinations by tweaking the BIOS, and observed no performance gain whatsoever when e-cores are involved. The maximum TDP of my 13600KF is 180 W, which seems to be barely enough for the 6 P-cores alone when they are at 100% load. (I measured the power consumption using a watt-meter). Also, when using some P-cores / E-cores combination, I observed multiple pauses during the inference (like if the P-cores were waiting for the E-cores to finish). Retrospectively, I regret that I picked this expensive and power-hungry "14 P/E-cores" intel CPU instead of a more symmetrical multi-core CPU from AMD. I even get better performances on my AMD Ryzen 5 5500 U laptop ! (6 cores, 12 threads, 25W TDP max) |
Beta Was this translation helpful? Give feedback.
-
Before we hat CUDA offloading I was investing quite a bit time into the scheduler, experimented with core affinity. So using performance cores makes sense, manually binding threads to them as well. But if anything goes wrong in that logic, the results are massive performance impacts. |
Beta Was this translation helpful? Give feedback.
-
Something similar is with Ryzen processors with 3D cache cores. Luckily I can change under UEFI priority to system prefer c0 block. |
Beta Was this translation helpful? Give feedback.
-
for long time since i got my 12th i7-12700F, i immediately disabled the E-core, until today i enabled them for testing purposes. after model is loaded to VRAM, the eval time is 2-3 times slower, and when looking at task manager, again E-cores are heavily used while P-cores doing almost nothing. forcing processes affinity to P-cores DOES NOT fully solve the issue, P-cores are heavily used this time, however E-coers at 60-70% load too. im not willing to test further more or try different configurations, and i cannot tell the exact issue whether it is llama, windows, or the 12th gen. reverting back to my previous uefi configuration solves it all, i.e. disable all E-cores. |
Beta Was this translation helpful? Give feedback.
-
I found by restrict threads and cores to performance cores only on Intel gen 12th processor, performance is much better than default.
My process is Intel core i7 12700H, this processor has 6 performance cores and 8 efficient cores. When use numactl to bind threads to performance core only, the performance is better than use all the cores. I tried 7B and 65B dataset with q4_0 quantization, both configuration show performance improvement.
7B
Use all cores:
./main -m ./models/7B/ggml-model-q4_0.bin -n 32 -p "Hiking is"
llama_print_timings: eval time = 14941.65 ms / 31 runs ( 481.99 ms per run)
Use performance core only:
numactl -C 0-5 ./main -m ./models/7B/ggml-model-q4_0.bin -n 32 -p "Hiking is" -t 6
llama_print_timings: eval time = 6256.60 ms / 31 runs ( 201.83 ms per run)
2.4x performance improvement for 7B 4-bit model.
65B model
./main -m ./models/65B/ggml-model-q4_0.bin -n 32 -p "Hiking is"
llama_print_timings: eval time = 126128.34 ms / 31 runs ( 4068.66 ms per run)
numactl -C 0-5 ./main -m ./models/65B/ggml-model-q4_0.bin -n 32 -p "Hiking is" -t 6
llama_print_timings: eval time = 42069.76 ms / 31 runs ( 1357.09 ms per run)
3x performance improvement to 65B 4-bit model.
Beta Was this translation helpful? Give feedback.
All reactions