Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k) #842

cmp-nct · 2023-04-08T03:07:10Z

I've not digged deep into this yet but my whole CPU utilization is only at 50%.
I've compiled it with current VS build tools, all default, release mode of course.

It might be related to the modern e-cores in Intel CPUs, they pack quite a punch but are weaker than performance cores.
In the graph it looks like 16 cores (the amount of e-cores) are much more utilized and 8 cores (amount of performance cores) are mostly idle despite using 24 threads. Increasing threads worsens performance, decreasing threads worsens tokens output.

I tested the small 7B model in 4 bit and 16 bit.
The only method to get CPU utilization above 50% is by using more than the total physical cores (like 32 cores).
In this case I see up to 99% CPU utilization but the token performance drops below 2 cores performance, some hyperthreading issue I suppose.
I tried various modes (small/large batch size, context size) It all does not influence it much.

The CPU was idle (as seen in screenshot).
Also memory is not full or swapping either.

Here is the command line: .\Release\main.exe -m .\models\7B\ggml-model-f16.bin -p "Below I count from 1 to 100000000: 1 2 3 4 5 6" -c 1024 -t 24 -n 1024 -b 64

system_info: n_threads = 24 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 1024, n_batch = 64, n_predict = 1024, n_keep = 0

The text was updated successfully, but these errors were encountered:

mqy · 2023-04-08T19:06:28Z

Let me try explain this.

50% total usage on your 32 cores is about 16 cores 100% usage.
From my observations (4 ~ 6 threads), typical average per core usage is about 70%, so 24 * 0.7 = 15.4 ~= 16.

Increasing threads worsens performance, decreasing threads worsens tokens output.

For any call to ggml_graph_compute about 54% nodes have to be computed in the main thread only, other worker threads busy spin looping. So token output speed does not scale linearly with thread number, the more threads the less average performance. Also because all threads content for certain system resources (memory, atomic spin locks etc.), thus worsen average per thread performance as well. From my observations, 4 threads almost always output tokens faster than 2.

cmp-nct · 2023-04-08T19:22:42Z

Let me try explain this.

Thanks for the response.
Here is one more view that's quite interesting.
We can see the utilization of the cores is not spreading right, we have performance cores sitting idle at 0% usage while efficiency cores are at 100%.
If the main worker is tasked with so much more computation we need the primary performance core assigned to the main worker.
On Intel platforms typically one performance core can clock high, the others are normal and the e-cores are side workers.
It depends on the mainboard/configuration if that is properly done or if turbo is applied on all cores equally.

This should be relevant for most 12 and 13+ gen Intel CPUs.
The potential to improve performance seems not to be the hoped +50% but if performance cores are skipped and economy cores are at 100% I believe we lose a lot of potential performance.

When I use 30 cores I seem to get the best performance today (yesterday that was not the case) inference goes down to 32ms/token.
At 32 cores it's much slower (around 45ms/token).
At 8 cores or 16 cores I've real bad performance (that should utilize 100% of the performance cores but it is spread to efficiency in both test cases).

anzz1 · 2023-04-08T21:53:45Z

You could try adding this piece of code #572 (comment)
to determine whether a given thread is running on a P or E core. (Note that the snippet shouldn't be used in other than Intel 12th/13th gen since I didn't include the necessary code to check the platform, only the P/E core part. If such a thing were to be added to master, it would be trivial to extend it to add the platform check.)

You could also lock the thread affinity for the process to only use P-cores in the task manager and see if that improves performance as that discussion thread suggests. While the point of the P/E arch was that the intel thread director working in tandem with the os thread scheduler should know better which one to use for a given task, it seems that in reality it doesn't always know that. It also doesn't help that all the windows's after 7 are garbage and want to eat your resources in the background. But maybe locking apps to the perf cores would make the background crap threads run only in the E-cores and keep the P-cores free for actual work.

Also if you have a AVX512 capable CPU you could try enabling it (which also disables the E-cores afaik) and see if AVX512 increases performance even further compared to AVX2.

If there is a performance increase it could be a good idea to add it as an compile-flag #ifdef ENABLE_INTEL_FORCE_P_CORE along with a option like --force-p-core or something using pthread_setaffinity_np on unixes and SetThreadAffinityMask on windows.

cmp-nct · 2023-04-08T23:12:57Z

Looks like a good start, I'll be looking into that.
I think better than a define is to actually auto detect it (just like other flags). Though first it's needed to benchmark and find the optimal usage of P/E cores.
Deactivating them is likely too much of a performance hit, the E cores are powerful but like you said I also think 2-3 times slower than a P core. so they need smart load assignment

anzz1 · 2023-04-08T23:25:51Z

yeah obviously the compile-flag could be auto-detected in make/cmake like the other flags, and just like them the guard gives the option to leave that functionality out. that snippet can be extended with detection by checking the hybrid flag and then if either a p/e core exists it's a intel 12th/13th gen. the actual model id isn't required to be matched against a list of current 12th/13th gen processors at this time and i'm betting that not in the future either (as even if the architecture changes i'd bet the current instruction values will be reserved as they are now).

edit: here's a slightly updated version with both checking the arch and the thread. untested as i don't have a 12/13th gen intel

edit 2: fixed a bug

//  1 = P core
//  0 = E core
// -1 = fail / not P/E arch
inline int is_thread_on_p_core(void) {
  static unsigned const char a[] = {0x31,0xC9,0xB8,0x1A,0x00,0x00,0x00,0x53,0x0F,0xA2,0x5B,0xC1,0xE8,0x18,0x83,0xF8,0x40,0x0F,0x85,0x06,0x00,0x00,0x00,0xB8,0x01,0x00,0x00,0x00,0xC3,0x83,0xE8,0x20,0xF7,0xD8,0x19,0xC0,0xC3};
  return ((int (*)(void)) (void*)((void*)a))();
}

// 1 = hybrid x86 cpu
inline int is_x86_hybrid_cpu(void) {
  static unsigned const char a[] = {0x31,0xC9,0xB8,0x07,0x00,0x00,0x00,0x53,0x0F,0xA2,0x5B,0xC1,0xEA,0x0F,0x83,0xE2,0x01,0x89,0xD0,0xC3};
  return ((int (*)(void)) (void*)((void*)a))();
}

// 1 = intel 12th/13th gen cpu
inline int is_intel_p_e_core_arch(void) {
  return (is_x86_hybrid_cpu() && (is_thread_on_p_core() >= 0));
}

CyberTimon · 2023-04-09T21:04:23Z

Hello @anzz1
I also have a i5 13600k and I think I could speed up the generation with this code. Since I'm an beginner, where do I have to put this code and do I have to recompile it?
I'm on windows.
Thank you

linouxis9 · 2023-04-10T19:44:34Z

Here are the best options I found for Intel 13th core: #229 (reply in thread):

I tinkered a bit and is what seemed the best on my i5 13500:

Switch from OpenBLAS to Intel openAPI MKL's BLAS implementation
6 threads for ggml (this CPU has 6 performance core)
8 threads for OPL (this CPU has 8 efficiency cores)

Using a single ggml thread with 5 BLAS threads on the 5 other performance cores proceeds quite well, but of course inference is slow. It would be great to be able to set the ggml / BLAS threads counts differently depending if it is initial prompt ingestion or inference.

Using more than 6 ggml threads is very slow, I believe that the efficiency cores are bottlenecking.

I opened a PR to OpenBLAS to improve the issue I had with it on Intel 13th gen: OpenMathLib/OpenBLAS#3970
Make sure to compile the latest version of OpenBLAS with this PR if on i5 13500.

anzz1 · 2023-04-12T01:01:04Z

@CyberTimon , the example code I posted is useful for debugging purposes only, it only tells you which type of core it's currently running on, it doesn't actually do anything.

Since you're on Windows, for now you can simply try locking the "main.exe" process to the P cores as discussed earlier. I believe the taskset command can be used to do the same in Linux.

Locking the affinity might not be the most optimal route though. There is some light reading to do to be able to understand how the thread director is controlled and how it actually works.

jon-chuang · 2023-04-13T01:15:54Z

Hello, I see 100% util on llama.cpp, but a sister impl based on ggml, llama-rs, is showing 50% as well. Perhaps we can share some findings.

I do not have BLAS installed, so n_threads is 16 for both.

rustformers/llm#131

jon-chuang · 2023-04-13T20:10:47Z

Increasing threads worsens performance, decreasing threads worsens tokens output.

This is a contradictory statement. ms per token is probaby the most faithful representation of perf, not core utilization.

jon-chuang · 2023-04-13T20:20:51Z

@cmp-nct , would you mind running @KASR 's script https://gist.github.com/KASR/dc3dd7f920f57013486583af7e3725f1#file-benchmark_threads_llama_cpp-py to determine the exact num_threads which leads to best performance on your machine?

cmp-nct · 2023-04-17T02:51:37Z

@mqy
You wrote that half the work is done on the main thread, can you elaborate on that ?
The graph compute spins up n threads and those do all the work in a "first come first serve"
Roughly 1700 node computations per layer
So if we ignore the uneven load (which we shouldn't longterm) the load should be kind of evenly spread among all threads.
If most of the work is done in the main thread, where is that done?

A late night update from my testing, with no conclusive results yet.
So I have 32 threads, 16 of them on 8 P cores and 16 on 16 E cores

The code from @anzz1 is not going to work, that returns the intel core id 64 internally but not the extended core flags (p/e)
I've been running extensive benchmarks and I even at 4 threads the performance is not significantly worse than with 31 threads.
Without modification I got the best performance at 24 threads, a bit faster than at 8 threads, slower than 31 and 16 threads.

The best performance was at 10 threads when forcing those on the 8 P cores.
That got me about 10% performance gain.
When doing that same game with 14 threads I have a complete stall, despite having 16 P threads.
Given those results I believe a lot more can be done, it's kinda ridiculous that I get 80% performance at 4/32 threads.
So I'll continue later.

All benchmarks were run with mt-preloading to ensure the entire memory is cached and available at first inference.
So that is where I had to end for today. Not super happy but there is a lot of potential for 12+´gen CPU optimization.

Raw benchmark timings:
`24 threads, default + preloading
llama_print_timings: prompt eval time = 306.92 ms / 3 tokens ( 102.31 ms per token)
llama_print_timings: eval time = 3973.03 ms / 15 runs ( 264.87 ms per run)
llama_print_timings: prompt eval time = 333.17 ms / 3 tokens ( 111.06 ms per token)
llama_print_timings: eval time = 3842.47 ms / 15 runs ( 256.16 ms per run)
llama_print_timings: prompt eval time = 315.84 ms / 3 tokens ( 105.28 ms per token)
llama_print_timings: eval time = 4024.56 ms / 15 runs ( 268.30 ms per run)
llama_print_timings: prompt eval time = 306.31 ms / 3 tokens ( 102.10 ms per token)
llama_print_timings: eval time = 3939.46 ms / 15 runs ( 262.63 ms per run)
llama_print_timings: prompt eval time = 311.01 ms / 3 tokens ( 103.67 ms per token)
llama_print_timings: eval time = 3860.00 ms / 15 runs ( 257.33 ms per run)
llama_print_timings: prompt eval time = 310.47 ms / 3 tokens ( 103.49 ms per token)
llama_print_timings: eval time = 4048.81 ms / 15 runs ( 269.92 ms per run)

31 threads, default + preloading
llama_print_timings: prompt eval time = 376.82 ms / 3 tokens ( 125.61 ms per token)
llama_print_timings: eval time = 4981.62 ms / 15 runs ( 332.11 ms per run)
llama_print_timings: prompt eval time = 451.62 ms / 3 tokens ( 150.54 ms per token)
llama_print_timings: eval time = 5018.78 ms / 15 runs ( 334.59 ms per run)
llama_print_timings: prompt eval time = 311.54 ms / 3 tokens ( 103.85 ms per token)
llama_print_timings: eval time = 4673.70 ms / 15 runs ( 311.58 ms per run)
llama_print_timings: prompt eval time = 323.35 ms / 3 tokens ( 107.78 ms per token)
llama_print_timings: eval time = 5090.26 ms / 15 runs ( 339.35 ms per run)

16 threads, default + preloading
llama_print_timings: prompt eval time = 365.53 ms / 3 tokens ( 121.84 ms per token)
llama_print_timings: eval time = 4081.48 ms / 15 runs ( 272.10 ms per run)
llama_print_timings: prompt eval time = 358.76 ms / 3 tokens ( 119.59 ms per token)
llama_print_timings: eval time = 4243.73 ms / 15 runs ( 282.92 ms per run)
llama_print_timings: prompt eval time = 333.47 ms / 3 tokens ( 111.16 ms per token)
llama_print_timings: eval time = 4094.97 ms / 15 runs ( 273.00 ms per run)
llama_print_timings: prompt eval time = 362.97 ms / 3 tokens ( 120.99 ms per token)
llama_print_timings: eval time = 4223.15 ms / 15 runs ( 281.54 ms per run)

8 threads, 7 of them bound to performance
llama_print_timings: prompt eval time = 321.36 ms / 3 tokens ( 107.12 ms per token)
llama_print_timings: eval time = 3703.86 ms / 15 runs ( 246.92 ms per run)
llama_print_timings: prompt eval time = 311.62 ms / 3 tokens ( 103.87 ms per token)
llama_print_timings: eval time = 3769.10 ms / 15 runs ( 251.27 ms per run)
llama_print_timings: prompt eval time = 311.69 ms / 3 tokens ( 103.90 ms per token)
llama_print_timings: eval time = 3761.85 ms / 15 runs ( 250.79 ms per run)

10 (modifications)
llama_print_timings: prompt eval time = 285.02 ms / 3 tokens ( 95.01 ms per token)
llama_print_timings: eval time = 3642.29 ms / 15 runs ( 242.82 ms per run)
llama_print_timings: prompt eval time = 282.78 ms / 3 tokens ( 94.26 ms per token)
llama_print_timings: eval time = 3662.67 ms / 15 runs ( 244.18 ms per run)`

mqy · 2023-04-17T06:33:37Z

@mqy You wrote that half the work is done on the main thread, can you elaborate on that ? The graph compute spins up n threads and those do all the work in a "first come first serve" Roughly 1700 node computations per layer So if we ignore the uneven load (which we shouldn't longterm) the load should be kind of evenly spread among all threads. If most of the work is done in the main thread, where is that done?

@cmp-nct the half I said before is about "number of nodes". I finally noticed that:

Nodes with n_task set as 1 are usually light (several us), some of them even do nothing at all. Whereas nodes with n_task larger than 1 are heavy (tens of ms), i.e. MUL_MAT, RoPE.
There are 3 steps to compute a node: INIT, COMPUTE, FINALIZE. Nodes with n_task set as 1 usually execute the COMPUTE step only. The INIT step is always run in main thread. A few nodes are dummy that do nothing at all.
The more threads the heavy spin contention. I found spin pause on Intel mac slightly increase performance.

My test shows that with 6 physical cores, the -t 4 config shows good balance between performance and energy.
Did you noticed this #934 ?
Also you may try at mine #1020

anzz1 · 2023-04-17T17:15:36Z

The code from @anzz1 is not going to work, that returns the intel core id 64 internally but not the extended core flags (p/e)

Can you elaborate, what do you mean by "returns the intel core id 64 internally but not the extended core flags (p/e)" ? As that is exactly the hybrid enumeration leaf, when HYBRID = 1 (on extended feature flags leaf) on an Intel x86 processor, where:

P-Core == Intel® Core™ "Golden Cove" 0x40 (64)
E-Core == Intel Atom® "Gracemont" 0x20 (32)

References:
Intel ® Architecture Instruction Set Extensions and Future Features
Game Dev Guide for 12th Gen Intel® Core™ Processor

I am not saying that I'm 100% positive I understood the documentation right, that there isn't any error in my code or whether the documentation is even correct as I have no P/E processor to verify it myself. But as your reasoning doesn't seem to make sense in light of the documentation, did you try the code I posted before declaring "is not going to work"? And if there indeed is an error and you know what's up, a correction would be helpful. 👍

Here are the functions from the comment above explained for anyone interested:

(int) is_thread_on_p_core(void):
0:  31 C9                   xor    ecx,ecx    ; set ECX=0
2:  B8 1A 00 00 00          mov    eax,0x1a   ; set EAX=0x1A (26)
7:  53                      push   rbx        ; push RBX to stack (save RBX initial value)
8:  0F A2                   cpuid             ; cpuid : EAX=0x1A (hybrid info leaf), ECX=0 (sub-leaf 0)
A:  5B                      pop    rbx        ; pop RBX from stack (discard EBX return value, restore initial value)
B:  C1 E8 18                shr    eax,0x18   ; set EAX >>= 0x18 (shift right 24 bits)
E:  83 F8 40                cmp    eax,0x40   ; compare EAX==0x40 (64) : Check for P-Core (Intel Core "Golden Cove")
11: 0F 85 06 00 00 00       jne    0x1d       ; if EAX != 0x40, jump to 1D (+6)
17: B8 01 00 00 00          mov    eax,0x1    ; set EAX=1
1C: C3                      ret               ; return EAX = 1
1D: 83 E8 20                sub    eax,0x20   ; set EAX -= 0x20 (32) : Check for E-core (Intel Atom "Gracemont")
20: F7 D8                   neg    eax        ; two's complement EAX and set CF = (EAX != 0)
22: 19 C0                   sbb    eax,eax    ; substract with borrow : EAX -= EAX - CF
24: C3                      ret               ; return EAX = 0 / 0xFFFFFFFF (-1)

(int) is_x86_hybrid_cpu(void):
0:  31 C9                   xor    ecx,ecx    ; set ECX=0
2:  B8 07 00 00 00          mov    eax,0x7    ; set EAX=0x7 (7)
7:  53                      push   rbx        ; push RBX to stack (save RBX initial value)
8:  0F A2                   cpuid             ; cpuid : EAX=0x7 (extended feature flags leaf), ECX=0 (sub-leaf 0)
A:  5B                      pop    rbx        ; pop RBX from stack (discard EBX return value, restore initial value)
B:  C1 EA 0F                shr    edx,0xf    ; set EDX >>= 0xF (shift right 15 bits)
E:  83 E2 01                and    edx,0x1    ; set EDX &= 1 : Check for HYBRID feature flag
11: 89 D0                   mov    eax,edx    ; set EAX=EDX
13: C3                      ret               ; return EAX = 0 / 1

cmp-nct · 2023-04-17T18:13:29Z

@anzz1
When I implemented it into the thread worker it actually killed the worker, just executing the opcode sequence caused it to stall, even when ignoring the result. Debugging why that happens would be painful.
So I made an own quick implementation using the cpuid api (which makes probably more sense, we only call that a few thousand times in total so inline asm isn't really needed and I always got 64 back.
Maybe you are right and Atom should be returned, the intel docs are so much to read that I eventually went for a direct affinity approach.
Even if the cpuid command works out, I'm doubtful the entire approach works. When the thread launches the OS scheduler has no clue what that thread is going to do. So it will likely assign anything spare and decide later ? I don't have much experience in
that area so might be wrong.

What should work better is to set affinity, you can specify all performance cores or just selected ones.

I did not have time to dig into the issues from yesterday but it's a new day.
I'll take a closer look at the spinlocks, those might be cycle hungry.

btw here is the non ASM approach:

#ifdef _WIN32
    int cpu_info[4];
    __cpuid(cpu_info, level);
    *eax = cpu_info[0];
    *ebx = cpu_info[1];
    *ecx = cpu_info[2];
    *edx = cpu_info[3];
#else
    __get_cpuid(level, eax, ebx, ecx, edx);
#endif
}```

anzz1 · 2023-04-17T18:39:28Z

When the thread launches the OS scheduler has no clue what that thread is going to do.

Yeah I thought about this too that how on earth can the OS scheluder can know whether to assign a thread to a P-core or a E-core before actually doing the work. There are some information here How 13th Gen Intel® Core™ Processors Work on how the "behind-the-scenes magic that maximizes hybrid performance" works, namely the symbiotic relationship between OS scheluder and "Intel Thread Director", and I also remember "advanced AI" being thrown around in some other article too, whatever that means.

I'm doubtful the entire approach works.

It should work though, the 'cpuid' returns information about the logical processor the instruction was executed on. So yeah, it cannot be used to know beforehand whether a thread is going to be executed on a P/E processor, that's affinity which you can force, but it can determine the current state, meaning while the thread is already running, after the work is sent by the scheluder to a logical processor. So the functions posted are useful for information's sake, not actually changing what is happening. It can show whether an affinity lock actually worked, and also how the threads are being assigned when not locking the affinity.

btw here is the non ASM approach:

The reason to use the ASM approach was to support all and any compilers as I've found it frustrating that there are multiple implementations depending on the compiler, like they couldn't decide on a standard for everyone and had to make up their own. Especially if you need other sub-leafs than 0, there are even more different intrinsics to cover, they aren't required in this particular case though.

The performance benefit of making it in the least amount of instructions possible saves just a few instructions and dozen bytes of memory at best, so it wasn't really any consideration.

As in my single-header (incomplete) cross-compiler feature flag library cpuid.h , I simply found making the intrinsincs from scratch easier than to cover and test every possible compiler (and version). Using simple inline assembly makes sure that the code is always the same and doesn't allow for any compiler confusions.

The compiler intrinsics will be translated to assembly at compile time anyway, so there isn't really a difference. But a non-ASM approach should be fine too, it just needs some more #ifdef's and work to support. Other than that they're fundamentally identical. Intrinsics do have the upside of better readability, and not having to add a explanatory comment to the code of what is being done with the ASM.

When I implemented it into the thread worker it actually killed the worker, just executing the opcode sequence caused it to stall, even when ignoring the result. Debugging why that happens would be painful.

Could you drop the stalling compiled binary as an attachment or put it to a google drive/dropbox/whatever as I would be interested in taking a look. Maybe add a printf("HEY , OVER HERE") just before the call it so I can find it quickly.

msveshnikov · 2024-01-09T20:24:07Z

I have similar problem with arm64 GCE t2a instance. Instead of 8 CPUs only 4 are used.

JesseGuerrero · 2024-01-14T05:02:00Z

Same, there are 64 cores on a pc I am using and it uses half. Is there a way to use more?

msveshnikov · 2024-01-14T05:08:35Z

I found a workaround - recreate model from Modelfile with num_thread parameter. Set it to real number of CPU, then all used and perf is 2x

luis0076 · 2024-02-01T21:23:44Z

My solution:

set n_threads_batch with the cpus's number

I set n_threads_batch to cpus's system number minus one

import multiprocessing
llm = LlamaCpp(...
n_threads_batch=multiprocessing.cpu_count()-1,
...
)

At line 234 https://llama-cpp-python.readthedocs.io/en/latest/api-reference/ you can read the default settings using half cpus:
self.n_threads_batch = n_threads_batch or max(
multiprocessing.cpu_count() // 2, 1
)

HILTICHINA · 2024-02-05T09:35:41Z

-t N, --threads N number of threads to use during generation (default: 32)
Add parameters: -t
example：
main.exe -m D:\llama2\Llama-2-13b-4bit\ggml-model-q4_0.gguf --prompt "what is the weather today" -t 50

github-actions · 2024-04-11T01:06:50Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

HamzaYslmn · 2024-05-23T11:19:07Z

-t N, --threads N number of threads to use during generation (default: 32) Add parameters: -t example： main.exe -m D:\llama2\Llama-2-13b-4bit\ggml-model-q4_0.gguf --prompt "what is the weather today" -t 50

I have same problem.

cmp-nct changed the title ~~Performance issue on Windows, Intel 13900k (50% CPU utilization at best)~~ Performance bug - only 50% CPU utilization when using all threads - (Win11, Intel 13900k) Apr 8, 2023

cmp-nct changed the title ~~Performance bug - only 50% CPU utilization when using all threads - (Win11, Intel 13900k)~~ Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k) Apr 8, 2023

anzz1 mentioned this issue Apr 9, 2023

ggml: refactor compute thread: merge three spin variables into one #816

Closed

anzz1 mentioned this issue May 4, 2023

Implement get_num_physical_cores() for Windows #1278

Open

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Add missing n_seq_id to llama_batch (ggerganov#842)

3fbcded

AdamYLK mentioned this issue Mar 15, 2024

default num_thread incorrect on some large core count system (non-hyperthreading) ollama/ollama#2496

Closed

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k) #842

Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k) #842

cmp-nct commented Apr 8, 2023 •

edited

Loading

mqy commented Apr 8, 2023 •

edited

Loading

cmp-nct commented Apr 8, 2023 •

edited

Loading

anzz1 commented Apr 8, 2023 •

edited

Loading

cmp-nct commented Apr 8, 2023

anzz1 commented Apr 8, 2023 •

edited

Loading

CyberTimon commented Apr 9, 2023

linouxis9 commented Apr 10, 2023 •

edited

Loading

anzz1 commented Apr 12, 2023

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 •

edited

Loading

cmp-nct commented Apr 17, 2023 •

edited

Loading

mqy commented Apr 17, 2023

anzz1 commented Apr 17, 2023

cmp-nct commented Apr 17, 2023 •

edited

Loading

anzz1 commented Apr 17, 2023

msveshnikov commented Jan 9, 2024

JesseGuerrero commented Jan 14, 2024

msveshnikov commented Jan 14, 2024

luis0076 commented Feb 1, 2024 •

edited

Loading

HILTICHINA commented Feb 5, 2024

github-actions bot commented Apr 11, 2024

HamzaYslmn commented May 23, 2024

Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k) #842

Performance e-core bug(?) - only 50% CPU utilization when using all threads - (Win11, Intel 13900k) #842

Comments

cmp-nct commented Apr 8, 2023 • edited Loading

mqy commented Apr 8, 2023 • edited Loading

cmp-nct commented Apr 8, 2023 • edited Loading

anzz1 commented Apr 8, 2023 • edited Loading

cmp-nct commented Apr 8, 2023

anzz1 commented Apr 8, 2023 • edited Loading

CyberTimon commented Apr 9, 2023

linouxis9 commented Apr 10, 2023 • edited Loading

anzz1 commented Apr 12, 2023

jon-chuang commented Apr 13, 2023 • edited Loading

jon-chuang commented Apr 13, 2023

jon-chuang commented Apr 13, 2023 • edited Loading

cmp-nct commented Apr 17, 2023 • edited Loading

mqy commented Apr 17, 2023

anzz1 commented Apr 17, 2023

cmp-nct commented Apr 17, 2023 • edited Loading

anzz1 commented Apr 17, 2023

msveshnikov commented Jan 9, 2024

JesseGuerrero commented Jan 14, 2024

msveshnikov commented Jan 14, 2024

luis0076 commented Feb 1, 2024 • edited Loading

HILTICHINA commented Feb 5, 2024

github-actions bot commented Apr 11, 2024

HamzaYslmn commented May 23, 2024

cmp-nct commented Apr 8, 2023 •

edited

Loading

mqy commented Apr 8, 2023 •

edited

Loading

cmp-nct commented Apr 8, 2023 •

edited

Loading

anzz1 commented Apr 8, 2023 •

edited

Loading

anzz1 commented Apr 8, 2023 •

edited

Loading

linouxis9 commented Apr 10, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

jon-chuang commented Apr 13, 2023 •

edited

Loading

cmp-nct commented Apr 17, 2023 •

edited

Loading

cmp-nct commented Apr 17, 2023 •

edited

Loading

luis0076 commented Feb 1, 2024 •

edited

Loading