-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce ggml_threadpool #7526
Conversation
Added an API to support explicit management of threadpools.
This has rather catastrophic performance on my system (13900k under WSL2):
This is not a full review, but the changes to the interface of ggml-backend should not be done, instead it should remain specific to the CPU backend. It is ok to add more CPU backend specific functions to configure this (eg. we already have a |
0276aca
to
a67dbcc
Compare
@slaren Thank you for the quick review and perf check. We had bug with the default mask handling (ie if mask not specified). I pushed a few updates just now which resolves that. Not yet sure if it'll fix the regression you're seeing though. Thanks for the suggestion for the backend interface. Adding more setters sounds good. We'll follow up on that. |
Thanks for your comment! Definitely didn't really consider the GPU backend for this. The idea though is that ggml_threadpool can be used for any backend that needs CPU parallelism, and you can imagine that any backend might want to do some minor tasks in parallel on the CPU. I'll update the code to make threadpool explicitly CPU-specific. |
Here are the results with the current commit:
So it is a lot better, but still significantly slower without NKVO. NKVO should be the best case for a thread pool, since there is a call to In the case without NKVO, there is only a |
It's odd that you're seeing any difference at all on GPU. It may be the case that the new thread-launch code has substantial overhead. |
Yes, but to be clear, currently there are no threads launched at all when fully offloading to a GPU. There is a single
Yes, I used the command line |
Ok found the GPU issue. Turns out I reverted this line by mistake: |
Alright after a couple bug fixes here and there, here are the GPU numbers on my laptop.
When tightening things a little bit:
On the CPU side we get something that looks like this with default params:
and when tightening up a bit (6 threads, higher priority, polling):
|
Will collect some more numbers on other targets soon |
int reps; | ||
bool verbose; | ||
output_formats output_format; | ||
}; | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug leftovers
@@ -364,7 +380,7 @@ static cmd_params parse_cmd_params(int argc, char ** argv) { | |||
} | |||
auto p = split<std::string>(argv[i], split_delim); | |||
std::vector<ggml_type> types; | |||
for (const auto & t : p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I get when I use an unconfigured IDE
@@ -1257,9 +1297,7 @@ static void test_prompt(llama_context * ctx, int n_prompt, int n_past, int n_bat | |||
llama_synchronize(ctx); | |||
} | |||
|
|||
static void test_gen(llama_context * ctx, int n_gen, int n_past, int n_threads) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not have changed
@@ -97,6 +97,7 @@ extern "C" { | |||
|
|||
// compute graph with a plan | |||
enum ggml_status (*GGML_CALL graph_plan_compute)(ggml_backend_t backend, ggml_backend_graph_plan_t plan); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary
@@ -254,7 +254,10 @@ void ggml_backend_synchronize(ggml_backend_t backend) { | |||
backend->iface.synchronize(backend); | |||
} | |||
|
|||
ggml_backend_graph_plan_t ggml_backend_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph) { | |||
ggml_backend_graph_plan_t ggml_backend_graph_plan_create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert most of the changed function headers below
@@ -67,12 +67,21 @@ extern "C" { | |||
|
|||
GGML_API void ggml_backend_synchronize(ggml_backend_t backend); | |||
|
|||
GGML_API ggml_backend_graph_plan_t ggml_backend_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert
I ran the test again on my system:
I also tested on the CPU only with a very small model stories260k:
Generally on my system I am getting better performance using OpenMP as a thread pool (#7606 (comment) #7606 (comment)). |
Interesting.... |
On my M2 Max (CPU only) with the stories260k model: % LLAMA_NO_METAL=1 ./scripts/compare-commits.sh master threadpool
% ./build/bin/llama-bench -t 8
% ./build/bin/llama-bench -t 8 -mt 8 (8-thread threadpool instead of the default 12)
Which tells me that the default config for the number of threads in a threadpool isn't ideal at all % ./build/bin/llama-bench -t 8 -mt 8 --poll (+ polling)
Will report back on Intel 13th gen soon |
Now that OpenMP is merged, I'll close this until further notice |
To be clear, OpenMP has been merged because it solves the immediate problem, but it would still be good to have a thread pool that doesn't depend on external libraries. If there is a better alternative we can replace the current implementation. |
Absolutely. We didn't go the OpenMP route specifically because it's not available/desirable on all platforms. I will just need to do some rework AND make sure there are no regressions on targets such as Intel 13th gen. Expect another PR soon ;) |
Added an API to support explicit management and fine-grain control of threadpools.
The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.
Each threadpool supports:
Setting number of threads (duh)
Setting a CPU mask for threads to be placed on
Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide
Support for polling/interrupt-driven wait
Setting thread priority
Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior).
If users choose to explicitly use threadpools, they have to manage them manually. See examples in main.cpp and in speculative.cpp.
With all the bells and whistles enabled, we generally see .25-1 tok/s improvement across the board. Stay tuned for actual perf numbers!