-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threadpool: take 2 #8672
Threadpool: take 2 #8672
Conversation
Here are some perf figures: On W-2225 Xeon machine: CPU backend:
Intel 10th-gen CPU:
Mobile NVIDIA 3060:
|
043b9df
to
ef1ff14
Compare
@slaren Threadpool is back! |
I tried to test this on macOS, but it seems to deadlock.
|
Fixed! |
On M2 Max: (GGML_NO_METAL=1 GGML_NO_ACCELERATE=1)
|
Same thing, but with llama-v3 8B Q4_0_4_4 (for some reason my compiler AppleClang15 doesn't support INT8 matmul?)
|
This comment was marked as spam.
This comment was marked as spam.
@slaren lmk if it works for you this time |
I tested this again on the M3 Max, but it still seems to deadlock. These are the call stacks of the threads:
Built with |
Bummer... |
@slaren turns out there was a bit of a corner case where if you have a graph with only 1 node, ggml_barrier and wait_for_work deadlock on each other. |
Thanks, I was able to run it now. Unfortunately the results are still not very good on my system. Under WSL this threadpool is much slower than OpenMP. A threadpool would be more important on macOS, since OpenMP is not available there, but for me it is also slower on the M3 Max. M3 Max:
13900k + 3090Ti:
Threadpool (
build: bebe99c (3500) |
Ooof... |
@fmz @slaren I'm also working on another fix which is specific to llama-bench. Currently (in the |
@fmz @slaren Here are the numbers from M2 Max.
|
4aa7a72
to
8ecdd36
Compare
The performance looks better now, with nkvo it is comparable to OpenMP, which is very good. There is still a performance drop when using the BLAS backend (this includes the default build in macOS, which uses Accelerate). I suspect that this is because the threads are spinning in Results
build: 267bf57 (3554)
build: 267bf57 (3554)
build: 267bf57 (3554)
build: 267bf57 (3554)
|
Awesome! Thanks for checking out the latest. We've been doing lots of profiling and tuning. btw We might just flip the default back to non-polling. Technically polling is only useful for the llama-bench to match OpenMP behavior/numbers in that case. When I looked at the original profiles, I saw that the threadpool is doing a lot more context switches than OpenMP during token-gen test. Polling removes those context switches and we get even better numbers now. |
This helps for long running tests on platforms that are thermally limited (phones, laptops, etc). --delay (disabled by default) introduces the sleep for N seconds before starting each test.
This avoids changing the overall process priority on Windows for the apps that use ggml/llama.cpp directy.
adcd24c
to
0778628
Compare
0778628
to
e3c2202
Compare
All threadpool related functions and structs use ggml_threadpool prefix.
@slaren Process priority setting has been moved into a helper function in
As I mentioned above Performance looks pretty good across the board, see the report below (llama-v2-115M on key platforms). We can iterate further on the automatic threadpool creation and reuse. I suggest we do that in Threadpool-V3 though, after we factor out thread/cpu/numa stuff into ggml-thread.cpp. M2 Max
build: 3246fe8 (3637)
build: c6328bc (3677) Ryzen 9 3950X + RTX 3080GGML_CUDA=1 make -j
build: 3246fe8 (3637) GGML_CUDA=1 GGML_NO_OPENMP=1 make -j
build: c6328bc (3677) Snapdragon X-Elite
build: 3246fe8 (3637)
build: c6328bc (3677) Snapdragon Gen 3Default Android build: armv8.7-a + openmp
build: 3246fe8 (3637)
Default Android build: armv8.7-a + no-openmp
build: c6328bc (3677)
|
enum ggml_sched_priority { | ||
GGML_SCHED_PRIO_NORMAL, | ||
GGML_SCHED_PRIO_MEDIUM, | ||
GGML_SCHED_PRIO_HIGH, | ||
GGML_SCHED_PRIO_REALTIME | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't need to be done now, but it would be useful to have priorities below normal. I don't expect that increasing the priority of compute threads will be very useful outside of benchmarking, virtually every other thread is more important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Main use-cases we wanted to enable are benchmarking, low-latency LLM response (ie using fewer cores but having the threads quickly get CPU cycles), also bumping priority a bit encourages Windows scheduler to place threads on the perf cores.
Will add lower priorities in threadpool V3.
Co-authored-by: slaren <slarengh@gmail.com>
@mofosyne @ggerganov @slaren |
Good job! |
Thank you thank you! |
Thank you for the great work and thorough review 👍 |
excellent works!!!May I ask have you tried Android in arm board witch CPU backend,what is the performance? |
That's the Sanpdragon 8 Gen 3 :) |
ref: original PR #7526
Added an API to support explicit management and fine-grain control of threadpools.
The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.
Each threadpool supports:
Setting number of threads (duh)
Setting a CPU mask for threads to be placed on
Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide
Support for polling/interrupt-driven wait
Setting thread priority
Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior).
If users choose to explicitly use threadpools, they have to manage them manually. See example in main.cpp.
With all the bells and whistles enabled, we generally see a minor improvement vs OMP. Without polling, threadpool runs on par with OMP.