-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
threads: changing to a mutex/condvar based thread pool. #710
Conversation
Thread pool Mac 7B threads 6 n_predict 64
Master Mac 7B threads 6 n_predict 64
Windows thread pool
Windows master
threadpool, Mac 65B, threads 8, n_predict 64
master Mac 65B, threads 8, n_predict 64
|
Converting to draft since even the author of the PR does not think this should be merged as is:
|
i was thinking how to easily show cpu time spent spinlocking vs being blocked on the threadpool - this change https://github.com/bogdad/llama.cpp/pull/7/files extracts the spinning portions of ggml_graph_compute and ggml_graph_compute_thread on master to separate functions that can be seen in the sampling profiler. hopefully they don't change master too much. instruments run with the large prompt - dan. or 7b model i think that shows that spinning amounts to 20% on the large prompts, large model, or 40% small model large prompts, unless i am misreading profiler output. |
I think this is right direction. Spinning do take quite a bit CPU. In my perf run, it is about 30% on Windows 10. I suggest you look at how to spin a few cycles before entering lock if main thread is doing the preparation. (not sure if this is optimization already done in threadpool). And you may want to consider to set cpu affinity as well. |
very cool! agree, strange indeed, i would expect master to be faster than the thread pool, hm. with this thread pool change the question for me is how to do it properly so not to bring the threadpool dependency into ggml and maybe keep it in the user code - ie llama cpp, one way to do this is to extract some kind "schedule work" interface that ggml could use, but i did not get a chance to work on this further yet |
Check this branch: https://github.com/howard0su/llama.cpp/tree/tp_schedule Overall, I believe we can make threadpool faster. But the current thread pool implementation is suboptimal. We may need to look at some lock free queue to replace it. But the first thing is implementing a better scheduling algorithm. ggml or llma, it is debatable. I would prefer having a better graph scheduler to replace the current one. |
Can we get a short TLDR? |
Can one of the admins verify this patch? |
oh, i missed that. the tldr: this is just an exploration how of how llama.cpp would behave if there were no busy waiting. was not supposed to be merged, because it has some external thread pool implementation. since then i think work scheduling in llama.cpp moved a lot, i think i will close this. feel free to use the patch, but very likely it diverged a lot from the main. |
This is a try to change the threading in ggml from busy wait spin locking to mutex/condvar based thread pool.
I don’t think this should be merged - it adds a dependency on copied https://github.com/Pithikos/C-Thread-Pool also hacked to work on windows to see what the effect on performance / energy usage would be. but maybe it will inspire further work on this.
The motivation is energy consumption, this pr gets cpu usage from 700% to 400% on the 8 threads on mac run while slightly making the time per token eval worse, maybe some similarish like cpu saving on other platforms.
I timed few various runs below and also will add activity monitor screenshots thread pool vs. master which i will add in comments.
I can't explain the the cpu savings properly though, main thread also seem to take part in computation, hm