-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Another threadpool: Avoid creating hundreds of threads in GGML #7342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
You can use
|
bf0722c
to
c9a4102
Compare
Thanks for the pointer to Overall, there seem to be punctual gains, but it's close to measurement noise. I've also cleaned up the code a bit and tried to fix the Windows build ( Ran: ThreadPool (Semaphore)
Threadpool (ACTIVE Poll -- as for windows measured on Linux)
Master (27b0406)
|
To use |
Got it, I basically ran the same test twice 😆 ... For
Thank you will further look into it. |
it looks like you are always waiting on a variable to change. have your tried replacing |
It is a very good idea. Will look into it after finding myself a windows VM. Also I'm currently not freeing threads will need to refcount init and release. |
- Add OpenMP to dependency detection - Add GGML_NO_OMP if OpenMP is disabled or not found - Keep previous approach if there is no OMP
After looking into the Futex approach I realized it was about to add a lot of ifdefs and several lines of (relatively complex) codes. I then thought that I was in fact trying to implement an OpenMP parallel region. And thus, resorted to trying OpenMP directly. This basically translated in a single I hope it works well on windows, lets see. Also I realized performance counters in 72 Cores (moved timings for thread creation)
192 cores (moved timings for thread creation)
Old Laptop 4 cores (moved timings for thread creation)
H100 (moved timings for thread creation)Node is 96 cores.
|
|
||
/* Loop is reversed as in the NO_OMP case we want threads to start | ||
before the main thread (j==0) */ | ||
#pragma omp parallel for shared(workers,state_shared) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to add a num_threads(n_threads)
here to make sure that omp always launches all the threads, otherwise it will deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right it is indeed fragile. I'm not even sure setting thread count covers all cases (https://www.openmp.org/spec-html/5.0/openmpsu35.html#x55-880002.6.1).
Unfortunately the omp version is slower for me, and it causes a significant performance regression without nkvo.
CPU only:
|
Thank you very much for checking. Indeed turns out it cannot be so simple; it is hard to beat those hundreds of threads ! As Pools did perform better on GPU configs, maybe OMP is doing some busy waiting in between parallels interfering with Cuda. I will try first to fix the parallel for risky logic and will play a bit more with it out of curiosity. |
Superseded by #7606 using Openmp 👍. I still think that there are gains to have at the level of the barriers |
TL/DR:
When profiling the code I noticed hundreds of threads were created, making profiling not practical and maybe leading to some performance overhead. I then attempted to make a simple threadpool inside GGML to see if it made any difference at all. Gains at small scale are not notable, and inference does not change much. However, the more threads the more overhead and more importantly, the code is easier to profile (not 100s of threads to look at).
This is my first attempt at hacking the code, there might be cleaner attempts or better ways, I have seen PR #710, #851 which are similar. I'm also not sure of how to properly benchmark this, did launch
./bin/benchmark
.Perf on x86-64 (Genoa)
Perf on x86-64 (Ivy Bridge)