threads: changing to a mutex/condvar based thread pool. #710

bogdad · 2023-04-02T12:13:40Z

This is a try to change the threading in ggml from busy wait spin locking to mutex/condvar based thread pool.
I don’t think this should be merged - it adds a dependency on copied https://github.com/Pithikos/C-Thread-Pool also hacked to work on windows to see what the effect on performance / energy usage would be. but maybe it will inspire further work on this.

The motivation is energy consumption, this pr gets cpu usage from 700% to 400% on the 8 threads on mac run while slightly making the time per token eval worse, maybe some similarish like cpu saving on other platforms.

I timed few various runs below and also will add activity monitor screenshots thread pool vs. master which i will add in comments.

I can't explain the the cpu savings properly though, main thread also seem to take part in computation, hm

bogdad · 2023-04-02T12:14:02Z

Thread pool Mac 7B threads 6 n_predict 64

llama_print_timings:        load time =   625.00 ms
llama_print_timings:      sample time =    46.18 ms /    64 runs   (    0.72 ms per run)
llama_print_timings: prompt eval time =  1068.37 ms /    21 tokens (   50.87 ms per token)
llama_print_timings:        eval time =  6690.41 ms /    63 runs   (  106.20 ms per run)
llama_print_timings:       total time =  8006.15 ms

Master Mac 7B threads 6 n_predict 64

llama_print_timings:        load time =   570.73 ms
llama_print_timings:      sample time =    30.81 ms /    42 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time =   907.05 ms /    21 tokens (   43.19 ms per token)
llama_print_timings:        eval time =  2208.34 ms /    41 runs   (   53.86 ms per run)
llama_print_timings:       total time =  3343.33 ms

Windows thread pool

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 64, n_keep = 0

llama_print_timings:        load time =  1003.66 ms
llama_print_timings:      sample time =    43.09 ms /    64 runs   (    0.67 ms per run)
llama_print_timings: prompt eval time =  1652.55 ms /    21 tokens (   78.69 ms per token)
llama_print_timings:        eval time = 11181.85 ms /    63 runs   (  177.49 ms per run)
llama_print_timings:       total time = 13169.05 ms

Windows master

system_info: n_threads = 32 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 64, n_keep = 0


 Below is an instruction that describes a task. Write a response that appropriately completes the request.
Sarah and Eric have been dating for two years, but they don’t want to get married yet. Sarah wants to spend some time with Eric alone before committing to marriage. She asks him what he thinks of the idea of going on a singles cruise together as their next date night.

llama_print_timings:        load time =   880.14 ms
llama_print_timings:      sample time =    53.33 ms /    64 runs   (    0.83 ms per run)
llama_print_timings: prompt eval time =  1045.64 ms /    21 tokens (   49.79 ms per token)
llama_print_timings:        eval time = 10429.93 ms /    63 runs   (  165.55 ms per run)
llama_print_timings:       total time = 11820.35 ms
(

threadpool, Mac 65B, threads 8, n_predict 64

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 64, n_keep = 0
...
llama_print_timings:        load time =  5735.31 ms
llama_print_timings:      sample time =    27.86 ms /    38 runs   (    0.73 ms per run)
llama_print_timings: prompt eval time = 11073.08 ms /    21 tokens (  527.29 ms per token)
llama_print_timings:        eval time = 21568.16 ms /    37 runs   (  582.92 ms per run)
llama_print_timings:       total time = 33071.62 ms

master Mac 65B, threads 8, n_predict 64

system_info: n_threads = 8 / 10 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 64, n_keep = 0

llama_print_timings:        load time =  5882.47 ms
llama_print_timings:      sample time =    48.56 ms /    64 runs   (    0.76 ms per run)
llama_print_timings: prompt eval time = 10561.09 ms /    21 tokens (  502.91 ms per token)
llama_print_timings:        eval time = 30989.55 ms /    63 runs   (  491.90 ms per run)
llama_print_timings:       total time = 42014.82 ms

bogdad · 2023-04-02T12:16:26Z

activity monitor for thread pool

master

sampling profiler, just for reference
thread pool:

master

prusnak · 2023-04-02T12:47:25Z

Converting to draft since even the author of the PR does not think this should be merged as is:

I don’t think this should be merged

bogdad · 2023-04-03T19:34:07Z

i was thinking how to easily show cpu time spent spinlocking vs being blocked on the threadpool -

this change https://github.com/bogdad/llama.cpp/pull/7/files extracts the spinning portions of ggml_graph_compute and ggml_graph_compute_thread on master to separate functions that can be seen in the sampling profiler. hopefully they don't change master too much.

instruments run with the large prompt - dan.
./build/bin/main -m ./models/65B/ggml-model-q4_0.bin --color -f ./prompts/dan.txt -n 64 -t 8

or 7b model
./build/bin/main -m ./models/7B/ggml-model-q4_0.bin --color -f ./prompts/dan.txt -n 64 -t 8

compared to the thread pool

i think that shows that spinning amounts to 20% on the large prompts, large model, or 40% small model large prompts, unless i am misreading profiler output.

howard0su · 2023-04-04T00:00:04Z

I think this is right direction. Spinning do take quite a bit CPU. In my perf run, it is about 30% on Windows 10.

I suggest you look at how to spin a few cycles before entering lock if main thread is doing the preparation. (not sure if this is optimization already done in threadpool). And you may want to consider to set cpu affinity as well.

howard0su · 2023-04-07T15:02:13Z

I removed the code to run FINALIZE as all operator's finalize branch is empty.

With This Change:

Running with 8 threads...
         8 threads | run 1/4 | current token time 322.61 ms - eval time 38633.45 ms - prompt eval time 2580.88 ms
         8 threads | run 2/4 | current token time 320.44 ms - eval time 39142.37 ms - prompt eval time 2563.55 ms
         8 threads | run 3/4 | current token time 302.37 ms - eval time 39437.33 ms - prompt eval time 2418.94 ms
         8 threads | run 4/4 | current token time 309.38 ms - eval time 39101.15 ms - prompt eval time 2475.01 ms
Running with 12 threads...
         12 threads | run 1/4 | current token time 317.88 ms - eval time 47406.18 ms - prompt eval time 2543.04 ms
         12 threads | run 2/4 | current token time 319.62 ms - eval time 47466.82 ms - prompt eval time 2556.97 ms
         12 threads | run 3/4 | current token time 320.63 ms - eval time 47356.98 ms - prompt eval time 2565.01 ms
         12 threads | run 4/4 | current token time 310.91 ms - eval time 47467.93 ms - prompt eval time 2487.31 ms
Running with 16 threads...
         16 threads | run 1/4 | current token time 251.6 ms - eval time 39880.11 ms - prompt eval time 2012.83 ms
         16 threads | run 2/4 | current token time 252.94 ms - eval time 39864.13 ms - prompt eval time 2023.53 ms
         16 threads | run 3/4 | current token time 254.65 ms - eval time 40030.51 ms - prompt eval time 2037.18 ms
         16 threads | run 4/4 | current token time 247.06 ms - eval time 39718.12 ms - prompt eval time 1976.49 ms
Running with 20 threads...
         20 threads | run 1/4 | current token time 230.93 ms - eval time 40994.86 ms - prompt eval time 1847.46 ms
         20 threads | run 2/4 | current token time 247.3 ms - eval time 39370.09 ms - prompt eval time 1978.36 ms
         20 threads | run 3/4 | current token time 243.99 ms - eval time 40523.65 ms - prompt eval time 1951.9 ms
         20 threads | run 4/4 | current token time 243.74 ms - eval time 38985.39 ms - prompt eval time 1949.94 ms

Master:

Running with 8 threads...
         8 threads | run 1/4 | current token time 387.52 ms - eval time 50231.0 ms - prompt eval time 3100.13 ms
         8 threads | run 2/4 | current token time 364.58 ms - eval time 50275.85 ms - prompt eval time 2916.65 ms
         8 threads | run 3/4 | current token time 389.67 ms - eval time 50712.58 ms - prompt eval time 3117.39 ms
         8 threads | run 4/4 | current token time 383.34 ms - eval time 50331.24 ms - prompt eval time 3066.7 ms
Running with 12 threads...
         12 threads | run 1/4 | current token time 317.38 ms - eval time 41190.92 ms - prompt eval time 2539.05 ms
         12 threads | run 2/4 | current token time 317.49 ms - eval time 41369.19 ms - prompt eval time 2539.9 ms
         12 threads | run 3/4 | current token time 324.44 ms - eval time 41333.93 ms - prompt eval time 2595.49 ms
         12 threads | run 4/4 | current token time 315.98 ms - eval time 40918.64 ms - prompt eval time 2527.88 ms
Running with 16 threads...
         16 threads | run 1/4 | current token time 265.92 ms - eval time 33277.19 ms - prompt eval time 2127.36 ms
         16 threads | run 2/4 | current token time 244.35 ms - eval time 33022.23 ms - prompt eval time 1954.79 ms
         16 threads | run 3/4 | current token time 250.27 ms - eval time 33235.78 ms - prompt eval time 2002.12 ms
         16 threads | run 4/4 | current token time 250.62 ms - eval time 32954.86 ms - prompt eval time 2004.93 ms
Running with 20 threads...
         20 threads | run 1/4 | current token time 431.94 ms - eval time 73713.41 ms - prompt eval time 3455.48 ms
         20 threads | run 2/4 | current token time 353.77 ms - eval time 76674.15 ms - prompt eval time 2830.16 ms
         20 threads | run 3/4 | current token time 398.88 ms - eval time 107782.13 ms - prompt eval time 3191.0 ms
         20 threads | run 4/4 | current token time 455.42 ms - eval time 110819.35 ms - prompt eval time 3643.36 ms

The graph is very strange. (Legend is wrong: main -> this change, rope opt is master)

bogdad · 2023-04-07T17:24:45Z

very cool! agree, strange indeed, i would expect master to be faster than the thread pool, hm.
Is it that doing nothing (no finalize) with spinlocks so much faster than doing nothing with a thread pool? so that after we remove it thread pool becomes faster? (if i understood your change correctly)

with this thread pool change the question for me is how to do it properly so not to bring the threadpool dependency into ggml and maybe keep it in the user code - ie llama cpp,

one way to do this is to extract some kind "schedule work" interface that ggml could use, but i did not get a chance to work on this further yet

howard0su · 2023-04-08T14:09:06Z

Check this branch: https://github.com/howard0su/llama.cpp/tree/tp_schedule

Overall, I believe we can make threadpool faster. But the current thread pool implementation is suboptimal. We may need to look at some lock free queue to replace it. But the first thing is implementing a better scheduling algorithm.

ggml or llma, it is debatable. I would prefer having a better graph scheduler to replace the current one.

ggerganov · 2023-04-13T13:04:44Z

Can we get a short TLDR?
Adding thpool.h / thpool.c is not an option

alitariq4589 · 2023-09-15T13:12:48Z

Can one of the admins verify this patch?

bogdad · 2023-09-15T13:23:35Z

oh, i missed that.

the tldr: this is just an exploration how of how llama.cpp would behave if there were no busy waiting. was not supposed to be merged, because it has some external thread pool implementation.

since then i think work scheduling in llama.cpp moved a lot, i think i will close this.

feel free to use the patch, but very likely it diverged a lot from the main.
or reopen is also fine by me if there is value in this

bogdad added 2 commits April 2, 2023 12:55

using github Pithikos/C-Thread-Pool for threading

c87def0

fix windows build

126ebdf

bogdad mentioned this pull request Apr 2, 2023

~2x perf improvement on Apple Silicon by changing state_shared.has_work access from atomic to mutex/conditional #633

Closed

prusnak marked this pull request as draft April 2, 2023 12:46

howard0su mentioned this pull request Apr 8, 2023

Use Threadpool to schedule the work #851

Draft

ggerganov added the threading Parallel processing and thread management label Apr 14, 2023

bogdad closed this Sep 15, 2023

Qeeweew pushed a commit to Qeeweew/llama.cpp that referenced this pull request May 17, 2024

add additional tooltips (ggml-org#710)

7b85917

besnardjb mentioned this pull request May 17, 2024

Another threadpool: Avoid creating hundreds of threads in GGML #7342

Closed

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

threads: changing to a mutex/condvar based thread pool. #710

threads: changing to a mutex/condvar based thread pool. #710

bogdad commented Apr 2, 2023 •

edited

Loading

bogdad commented Apr 2, 2023

bogdad commented Apr 2, 2023 •

edited

Loading

prusnak commented Apr 2, 2023

bogdad commented Apr 3, 2023 •

edited

Loading

howard0su commented Apr 4, 2023

howard0su commented Apr 7, 2023

bogdad commented Apr 7, 2023 •

edited

Loading

howard0su commented Apr 8, 2023

ggerganov commented Apr 13, 2023

alitariq4589 commented Sep 15, 2023

bogdad commented Sep 15, 2023 •

edited

Loading

threads: changing to a mutex/condvar based thread pool. #710

threads: changing to a mutex/condvar based thread pool. #710

Conversation

bogdad commented Apr 2, 2023 • edited Loading

bogdad commented Apr 2, 2023

bogdad commented Apr 2, 2023 • edited Loading

prusnak commented Apr 2, 2023

bogdad commented Apr 3, 2023 • edited Loading

howard0su commented Apr 4, 2023

howard0su commented Apr 7, 2023

bogdad commented Apr 7, 2023 • edited Loading

howard0su commented Apr 8, 2023

ggerganov commented Apr 13, 2023

alitariq4589 commented Sep 15, 2023

bogdad commented Sep 15, 2023 • edited Loading

bogdad commented Apr 2, 2023 •

edited

Loading

bogdad commented Apr 2, 2023 •

edited

Loading

bogdad commented Apr 3, 2023 •

edited

Loading

bogdad commented Apr 7, 2023 •

edited

Loading

bogdad commented Sep 15, 2023 •

edited

Loading