Performance 3x better when use performance core only on Intel gen 12th cpu #572

delock · 2023-03-28T12:59:02Z

delock
Mar 28, 2023

I found by restrict threads and cores to performance cores only on Intel gen 12th processor, performance is much better than default.

My process is Intel core i7 12700H, this processor has 6 performance cores and 8 efficient cores. When use numactl to bind threads to performance core only, the performance is better than use all the cores. I tried 7B and 65B dataset with q4_0 quantization, both configuration show performance improvement.

7B
Use all cores:
./main -m ./models/7B/ggml-model-q4_0.bin -n 32 -p "Hiking is"
llama_print_timings: eval time = 14941.65 ms / 31 runs ( 481.99 ms per run)

Use performance core only:
numactl -C 0-5 ./main -m ./models/7B/ggml-model-q4_0.bin -n 32 -p "Hiking is" -t 6
llama_print_timings: eval time = 6256.60 ms / 31 runs ( 201.83 ms per run)

2.4x performance improvement for 7B 4-bit model.

65B model
./main -m ./models/65B/ggml-model-q4_0.bin -n 32 -p "Hiking is"
llama_print_timings: eval time = 126128.34 ms / 31 runs ( 4068.66 ms per run)

numactl -C 0-5 ./main -m ./models/65B/ggml-model-q4_0.bin -n 32 -p "Hiking is" -t 6
llama_print_timings: eval time = 42069.76 ms / 31 runs ( 1357.09 ms per run)

3x performance improvement to 65B 4-bit model.

slaren · 2023-03-28T13:07:01Z

slaren
Mar 28, 2023
Collaborator

Do you get better performance if you also use -t 6 without numactl? By default llama.cpp uses all threads which is rarely optimal.

0 replies

delock · 2023-03-28T13:10:19Z

delock
Mar 28, 2023
Author

Also better number without numactl, what I observe is all worker threads stay on core 0-5. the performance is similiar to with numactl (1422.12 ms per run with 65B model)

0 replies

anzz1 · 2023-03-28T17:13:10Z

anzz1
Mar 28, 2023

If you disable the E-cores and enable AVX512, how's the performance then?

To be honest, it's kinda strange having to manually lock the app to performance cores. The whole point of Intel's P/E architecture and the "Thread Director" is that it should automatically know to use the proper mix of logical processors for any given workload.

Then again, the thread director works with the OS thread scheluder so it possible that it's the OS's fault of not scheluding processor and the architecture itself is not to blame.

Anyway this is good information. For the people testing this, you should post your OS type and distro/kernel versions too.

It could be worth adding information about the cores too in future debug/trace/benchmarks infos. A thread can detect whether it's running on a P or E core with the cpuid instruction so this could easily be added in a platform-agnostic manner: https://www.intel.com/content/www/us/en/developer/articles/guide/12th-gen-intel-core-processor-gamedev-guide.html

If I read the doc correctly, this should tell whether a thread is running on a P/E core:

//  1 = P core
//  0 = E core
// -1 = fail / not P/E arch
inline int is_thread_on_p_core(void) {
  static unsigned const char a[] = {0x31,0xC9,0xB8,0x1A,0x00,0x00,0x00,0x53,0x0F,0xA2,0x5B,0xC1,0xE8,0x18,0x83,0xF8,0x40,0x0F,0x85,0x06,0x00,0x00,0x00,0xB8,0x01,0x00,0x00,0x00,0xC3,0x83,0xE8,0x20,0xF7,0xD8,0x19,0xC0,0xC3};
  return ((int (*)(void)) (void*)((void*)a))();
}

// 1 = hybrid x86 cpu
inline int is_x86_hybrid_cpu(void) {
  static unsigned const char a[] = {0x31,0xC9,0xB8,0x07,0x00,0x00,0x00,0x53,0x0F,0xA2,0x5B,0xC1,0xEA,0x0F,0x83,0xE2,0x01,0x89,0xD0,0xC3};
  return ((int (*)(void)) (void*)((void*)a))();
}

// 1 = intel 12th/13th gen cpu
inline int is_intel_p_e_core_arch(void) {
  return (is_x86_hybrid_cpu() && (is_thread_on_p_core() >= 0));
}

2 replies

delock Mar 29, 2023
Author

My OS is Linux distro ArchLinux, kernel information as follow:
Linux cortex 6.2.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 13 Mar 2023 17:02:08 +0000 x86_64 GNU/Linux

anzz1 Apr 17, 2023

Explanation for the assembly code above can be found here in case someone is interested: #842 (comment)

linouxis9 · 2023-03-29T20:28:32Z

linouxis9
Mar 29, 2023

I observed the same thing on Intel 13rd gen, here are my findings and the best parameters I found: #229 (reply in thread)

Make sure to use a kernel recent enough, with 6.2.6, I had no need for numactl to spread things correctly (I had to use cpuset on 5.16).

TL;DR for now:

For BLAS, use Intel oneAPI MKL's BLAS implementation
For BLAS again, use the env var to specify the number of performance + efficiency cores without counting the hyper threading performance cores
For llama.cpp itself, only specify performance cores (without HT) as threads

My guess is that effiency cores are bottlenecking, and somehow we are waiting for them to finish their work (which takes 2-3 more time than a performance core) instead of giving back their work to another performance core when their work is done. This is probably the case because threads spinlock, so unused threads that finish their work faster are still stuck at 100% preventing the Linux scheduler to schedule another thread with the remaining work on performance cores. I see two potential fixes to that: implement a work-stealking algorithm (but I fear it may impact performance), or find an alternative to busy waiting when threads have no work left to do, so the Linux scheduler can schedule back the threads with work began on efficiency cores onto performance cores.

CC: @ggerganov as this related to what you described here: #578 (comment)

1 reply

linouxis9 Mar 30, 2023

If my theory is true @delock, your performance should be way better like #616 doing the same thing and using both performance and efficiency cores! :)

DannyDaemonic · 2023-05-03T11:41:09Z

DannyDaemonic
May 3, 2023

Does anyone in this thread who has a processor with performance / efficiency cores have the ability to test out #1278, a patch for the Windows build detecting P/E cores?

I was under the impression someone else was working on the Linux implementation but it seems not. So I'll work on that one after this one is done.

0 replies

zrm · 2023-05-16T00:06:38Z

zrm
May 16, 2023

A better solution may be to find out why E-cores hurt performance. For example, some of the ggml threading code seems to allocate work equally to each thread, which might not be optimal when cores are different speeds. But E-cores are real cores, so a better implementation might improve performance more than just turning them off.

6 replies

zrm May 18, 2023

On bandwidth-limited systems they might not help, but why do they hurt? Adding more P-cores to an already bandwidth-limited system doesn't make it slower, does it?
And some systems might not be bandwidth limited, especially if you only use the P-cores. There are systems with only 2 P-cores but 8 E-cores.

FNsi May 20, 2023

No clue, but I think that will let it slower due to speed difference.

linouxis9 May 20, 2023

Yes @zrm, this is what I have observed. P-cores finish their work earlier and have to wait on the E-cores (so the overall speed is limited to E-cores). We need to be able to assign less work to E-cores or P-cores should be able to steal work from E-cores when they are done.

slaren May 20, 2023
Collaborator

If you have a CPU with P and E-cores, #1507 may allow you to use all the cores and improve performance.

FNsi May 20, 2023

Not only the waiting time, I suppose also the competition for the limited bandwidth.

YellowRoseCx · 2023-09-17T07:03:17Z

YellowRoseCx
Sep 17, 2023

was this ever fixed?

0 replies

SuperUserNameMan · 2024-02-15T16:32:36Z

SuperUserNameMan
Feb 15, 2024

I have similar performance losses when using e-cores on my 13600KF (which has 6 P-cores and 8 E-cores, or 20 threads in total).

I compiled llama with openblas (edit : under linux), and I benchmarked various e-cores/p-cores and hyper-threading combinations by tweaking the BIOS, and observed no performance gain whatsoever when e-cores are involved.

The maximum TDP of my 13600KF is 180 W, which seems to be barely enough for the 6 P-cores alone when they are at 100% load. (I measured the power consumption using a watt-meter).

Also, when using some P-cores / E-cores combination, I observed multiple pauses during the inference (like if the P-cores were waiting for the E-cores to finish).

Retrospectively, I regret that I picked this expensive and power-hungry "14 P/E-cores" intel CPU instead of a more symmetrical multi-core CPU from AMD.

I even get better performances on my AMD Ryzen 5 5500 U laptop ! (6 cores, 12 threads, 25W TDP max)

1 reply

kalomaze Feb 16, 2024

https://github.com/kalomaze/koboldcpp/releases/tag/v1.57-cuda12-oldyield
I made a build that forces thread yielding to alleviate this to some extent (for a fork, though the change should work when applied to upstream l.cpp too)

cmp-nct · 2024-02-16T02:26:35Z

cmp-nct
Feb 16, 2024

Before we hat CUDA offloading I was investing quite a bit time into the scheduler, experimented with core affinity.
There was a significant performance boost when smartly assigning performance cores to threads, e-cores never made it work in any way better most likely the memory bandwidth is the constraint so anything pushed on an Atom core just takes away a potential performance core workload.
The spinlock was also quite performance hurting when cuda was used.
Overall a lot has changed since then, the scheduling got better as well.

So using performance cores makes sense, manually binding threads to them as well. But if anything goes wrong in that logic, the results are massive performance impacts.
The best was using one thread per physical core while reserving one performance core for the main thread.

0 replies

mirek190 · 2024-02-16T23:02:28Z

mirek190
Feb 16, 2024

Something similar is with Ryzen processors with 3D cache cores.
I have Ryzen 7950x3D.
Running on ccd0 ( first block with 128 MB 3D cache 16 threads 5.0 Ghz ) I have extra boot performance 40% comparing to ccd1 ( second block with no 3D cache 16 threads 5.5 Ghz ).

Luckily I can change under UEFI priority to system prefer c0 block.

1 reply

BrickBee Feb 17, 2024

You don't necessarily need to make those adjustments in UEFI, and you can get even better performance when doing it differently. Search for "affinity" in this posting with llama.cpp benchmarks for that CPU.

zmn28hgbn59kcmlpio8unfh7fdre523esd28q9a · 2025-01-28T01:56:03Z

zmn28hgbn59kcmlpio8unfh7fdre523esd28q9a
Jan 28, 2025

for long time since i got my 12th i7-12700F, i immediately disabled the E-core, until today i enabled them for testing purposes.
the whole system freeze for couple of seconds when loading 14B model, looking at task manager (Windows 10), it shows that E-cores are heavily used while P-cores doing almost nothing.

after model is loaded to VRAM, the eval time is 2-3 times slower, and when looking at task manager, again E-cores are heavily used while P-cores doing almost nothing.

forcing processes affinity to P-cores DOES NOT fully solve the issue, P-cores are heavily used this time, however E-coers at 60-70% load too.
even tho llama was invoked to use 8 threads only which is equal to P-core count in this CPU, i still cant find why E-cores are used this much.

im not willing to test further more or try different configurations, and i cannot tell the exact issue whether it is llama, windows, or the 12th gen.

reverting back to my previous uefi configuration solves it all, i.e. disable all E-cores.

0 replies

Performance 3x better when use performance core only on Intel gen 12th cpu #572

Replies: 11 comments · 11 replies

slaren Mar 28, 2023 Collaborator

delock Mar 28, 2023 Author

delock Mar 29, 2023 Author

slaren May 20, 2023 Collaborator

Replies: 11 comments 11 replies

slaren
Mar 28, 2023
Collaborator

delock
Mar 28, 2023
Author

delock Mar 29, 2023
Author

slaren May 20, 2023
Collaborator