Introduce ggml_threadpool #7526

fmz · 2024-05-24T22:30:10Z

Added an API to support explicit management and fine-grain control of threadpools.
The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.

Each threadpool supports:

Setting number of threads (duh)
Setting a CPU mask for threads to be placed on
Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide
Support for polling/interrupt-driven wait
Setting thread priority
Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior).
If users choose to explicitly use threadpools, they have to manage them manually. See examples in main.cpp and in speculative.cpp.

With all the bells and whistles enabled, we generally see .25-1 tok/s improvement across the board. Stay tuned for actual perf numbers!

Added an API to support explicit management of threadpools.

slaren · 2024-05-26T16:13:21Z

This has rather catastrophic performance on my system (13900k under WSL2):

GPU	Model	NKVO	Test	t/s master	t/s threadpool	Speedup
RTX 3090 Ti	llama 7B Q4_0	0	pp512	4547.28	3185.66	0.70
RTX 3090 Ti	llama 7B Q4_0	0	tg128	155.19	21.53	0.14
RTX 3090 Ti	llama 7B Q4_0	1	pp512	501.72	153.43	0.31
RTX 3090 Ti	llama 7B Q4_0	1	tg128	21.54	0.85	0.04

This is not a full review, but the changes to the interface of ggml-backend should not be done, instead it should remain specific to the CPU backend. It is ok to add more CPU backend specific functions to configure this (eg. we already have a ggml_backend_cpu_set_n_threads function), but other backends do not benefit from a thread pool, and the generic interface should not be changed.

max-krasnyansky · 2024-05-28T05:27:33Z

@slaren Thank you for the quick review and perf check. We had bug with the default mask handling (ie if mask not specified). I pushed a few updates just now which resolves that. Not yet sure if it'll fix the regression you're seeing though.

Thanks for the suggestion for the backend interface. Adding more setters sounds good. We'll follow up on that.

fmz · 2024-05-28T14:26:01Z

This has rather catastrophic performance on my system (13900k under WSL2):

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3090 Ti llama 7B Q4_0 0 pp512 4547.28 3185.66 0.70
RTX 3090 Ti llama 7B Q4_0 0 tg128 155.19 21.53 0.14
RTX 3090 Ti llama 7B Q4_0 1 pp512 501.72 153.43 0.31
RTX 3090 Ti llama 7B Q4_0 1 tg128 21.54 0.85 0.04
This is not a full review, but the changes to the interface of ggml-backend should not be done, instead it should remain specific to the CPU backend. It is ok to add more CPU backend specific functions to configure this (eg. we already have a ggml_backend_cpu_set_n_threads function), but other backends do not benefit from a thread pool, and the generic interface should not be changed.

Thanks for your comment!

Definitely didn't really consider the GPU backend for this. The idea though is that ggml_threadpool can be used for any backend that needs CPU parallelism, and you can imagine that any backend might want to do some minor tasks in parallel on the CPU.
That said, it makes total sense that such grand plans aren't suitable for a PR titled "introduce". We can let it develop more organically in the future.

I'll update the code to make threadpool explicitly CPU-specific.

slaren · 2024-05-28T18:44:22Z

Here are the results with the current commit:

GPU	Model	NKVO	Test	t/s master	t/s threadpool	Speedup
RTX 3090 Ti	llama 7B Q4_0	0	pp512	4389.11	4162.12	0.95
RTX 3090 Ti	llama 7B Q4_0	0	tg128	153.74	105.83	0.69
RTX 3090 Ti	llama 7B Q4_0	1	pp512	506.90	522.35	1.03
RTX 3090 Ti	llama 7B Q4_0	1	tg128	21.45	30.73	1.43

So it is a lot better, but still significantly slower without NKVO. NKVO should be the best case for a thread pool, since there is a call to graph_compute on every layer, which causes the threads to be launched many times on every evaluation. See #7342 for the kind of speedup that we should expect with a thread pool in this case.

In the case without NKVO, there is only a get_rows operation being run on the CPU per evaluation, and it does not cause any additional threads to be launched. I suspect that it might be necessary to maintain the logic that prevents launching additional threads in this case to avoid a performance regression.

fmz · 2024-05-28T19:02:35Z

Here are the results with the current commit:

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3090 Ti llama 7B Q4_0 0 pp512 4389.11 4162.12 0.95
RTX 3090 Ti llama 7B Q4_0 0 tg128 153.74 105.83 0.69
RTX 3090 Ti llama 7B Q4_0 1 pp512 506.90 522.35 1.03
RTX 3090 Ti llama 7B Q4_0 1 tg128 21.45 30.73 1.43
So it is a lot better, but still significantly slower without NKVO. NKVO should be the best case for a thread pool, since there is a call to graph_compute on every layer, which causes the threads to be launched many times on every evaluation. See #7342 for the kind of speedup that we should expect with a thread pool in this case.

In the case without NKVO, there is only a get_rows operation being run on the CPU per evaluation, and it does not cause any additional threads to be launched. I suspect that it might be necessary to maintain the logic that prevents launching additional threads in this case to avoid a performance regression.

It's odd that you're seeing any difference at all on GPU. It may be the case that the new thread-launch code has substantial overhead.
Are you using the compare-commits.sh script for this?

slaren · 2024-05-28T19:07:39Z

It's odd that you're seeing any difference at all on GPU. It may be the case that the new thread-launch code has substantial overhead.

Yes, but to be clear, currently there are no threads launched at all when fully offloading to a GPU. There is a single get_rows operation being run on the CPU backend, but it is run with only 1 thread, which causes ggml to reuse the main thread without launching any additional threads.

Are you using the compare-commits.sh script for this?

Yes, I used the command line LLAMA_CUDA=1 scripts/compare-commits.sh master threadpool -nkvo 0,1.

…into threadpool

fmz · 2024-05-30T20:02:11Z

Ok found the GPU issue. Turns out I reverted this line by mistake:
cplan.n_threads = MIN(max_tasks, n_threads);

fmz · 2024-05-31T18:10:49Z

Alright after a couple bug fixes here and there, here are the GPU numbers on my laptop.
I ran this with the default parameters:

Allocate n_threads == n_virtual_cores
no polling
no explicit placement
default priority

GPU	Model	NKVO	Test	t/s master	t/s threadpool	Speedup
RTX 3060 Laptop GPU	llama 7B Q4_0	No	pp512	1391.88	1380.94	0.99
RTX 3060 Laptop GPU	llama 7B Q4_0	No	tg128	66.24	65.95	1.00
RTX 3060 Laptop GPU	llama 7B Q4_0	Yes	pp512	271.17	276.63	1.02
RTX 3060 Laptop GPU	llama 7B Q4_0	Yes	tg128	8.55	45.91	5.37

When tightening things a little bit:

6-thread threadpool (num physical cores)
priority = 3
poll = 1

model	size	params	backend	ngl	nkvo	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	pp512	291.06 ± 1.17
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1	tg128	54.06 ± 0.25

On the CPU side we get something that looks like this with default params:

CPU	Model	Test	t/s master	t/s threadpool	Speedup
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	pp512	14.92	14.95	1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz	llama 7B Q4_0	tg128	7.07	7.35	1.04

and when tightening up a bit (6 threads, higher priority, polling):

model	size	params	backend	threads	nkvo	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	1	pp512	15.08 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	1	tg128	7.66 ± 0.01

fmz · 2024-05-31T18:27:49Z

Will collect some more numbers on other targets soon
In the meantime, I declare this PR ready for review!

fmz · 2024-05-31T18:28:17Z

examples/llama-bench/llama-bench.cpp

    int reps;
    bool verbose;
    output_formats output_format;
 };
+//


Debug leftovers

fmz · 2024-05-31T18:29:09Z

examples/llama-bench/llama-bench.cpp

@@ -364,7 +380,7 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
            }
            auto p = split<std::string>(argv[i], split_delim);
            std::vector<ggml_type> types;
-            for (const auto & t : p) {


That's what I get when I use an unconfigured IDE

fmz · 2024-05-31T18:31:03Z

examples/llama-bench/llama-bench.cpp

@@ -1257,9 +1297,7 @@ static void test_prompt(llama_context * ctx, int n_prompt, int n_past, int n_bat
    llama_synchronize(ctx);
 }

-static void test_gen(llama_context * ctx, int n_gen, int n_past, int n_threads) {


This should not have changed

fmz · 2024-05-31T18:31:46Z

ggml-backend-impl.h

@@ -97,6 +97,7 @@ extern "C" {

        // compute graph with a plan
        enum ggml_status (*GGML_CALL graph_plan_compute)(ggml_backend_t backend, ggml_backend_graph_plan_t plan);
+


Unnecessary

fmz · 2024-05-31T18:31:57Z

ggml-backend.c

@@ -254,7 +254,10 @@ void ggml_backend_synchronize(ggml_backend_t backend) {
    backend->iface.synchronize(backend);
 }

-ggml_backend_graph_plan_t ggml_backend_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
+ggml_backend_graph_plan_t ggml_backend_graph_plan_create(


revert most of the changed function headers below

fmz · 2024-05-31T18:33:13Z

ggml-backend.h

@@ -67,12 +67,21 @@ extern "C" {

    GGML_API void ggml_backend_synchronize(ggml_backend_t backend);

-    GGML_API ggml_backend_graph_plan_t ggml_backend_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph);


slaren · 2024-05-31T18:44:05Z

I ran the test again on my system:

GPU	Model	NKVO	Test	t/s master	t/s threadpool	Speedup
RTX 3090 Ti	llama 7B Q4_0	No	pp512	4529.64	4503.79	0.99
RTX 3090 Ti	llama 7B Q4_0	No	tg128	155.93	145.68	0.93
RTX 3090 Ti	llama 7B Q4_0	Yes	pp512	525.08	519.27	0.99
RTX 3090 Ti	llama 7B Q4_0	Yes	tg128	21.93	26.80	1.22

I also tested on the CPU only with a very small model stories260k:

CPU	Model	Threads	Test	t/s master	t/s threadpool	Speedup
i9-13900K	stories260K F32	1	pp512	6194.17	6209.78	1.00
i9-13900K	stories260K F32	1	tg128	6538.99	2403.19	0.37
i9-13900K	stories260K F32	2	pp512	11904.73	11246.42	0.94
i9-13900K	stories260K F32	2	tg128	4455.47	1648.02	0.37
i9-13900K	stories260K F32	3	pp512	14733.55	14574.01	0.99
i9-13900K	stories260K F32	3	tg128	3640.73	1581.78	0.43
i9-13900K	stories260K F32	4	pp512	21441.80	18003.02	0.84
i9-13900K	stories260K F32	4	tg128	3248.96	1514.81	0.47
i9-13900K	stories260K F32	5	pp512	21877.86	19999.46	0.91
i9-13900K	stories260K F32	5	tg128	2880.80	1422.77	0.49
i9-13900K	stories260K F32	6	pp512	27980.37	20569.50	0.74
i9-13900K	stories260K F32	6	tg128	2401.13	1366.64	0.57
i9-13900K	stories260K F32	7	pp512	25827.94	21106.84	0.82
i9-13900K	stories260K F32	7	tg128	2168.76	1228.24	0.57
i9-13900K	stories260K F32	8	pp512	33610.55	19080.77	0.57
i9-13900K	stories260K F32	8	tg128	1928.88	1208.25	0.63
i9-13900K	stories260K F32	16	pp512	19288.04	19772.55	1.03
i9-13900K	stories260K F32	16	tg128	940.62	872.68	0.93
i9-13900K	stories260K F32	24	pp512	23801.20	23440.24	0.98
i9-13900K	stories260K F32	24	tg128	609.94	637.03	1.04
i9-13900K	stories260K F32	32	pp512	25995.12	26778.80	1.03
i9-13900K	stories260K F32	32	tg128	415.76	458.49	1.10

Generally on my system I am getting better performance using OpenMP as a thread pool (#7606 (comment) #7606 (comment)).

fmz · 2024-06-03T14:13:04Z

I ran the test again on my system:

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3090 Ti llama 7B Q4_0 No pp512 4529.64 4503.79 0.99
RTX 3090 Ti llama 7B Q4_0 No tg128 155.93 145.68 0.93
RTX 3090 Ti llama 7B Q4_0 Yes pp512 525.08 519.27 0.99
RTX 3090 Ti llama 7B Q4_0 Yes tg128 21.93 26.80 1.22
I also tested on the CPU only with a very small model stories260k:

CPU Model Threads Test t/s master t/s threadpool Speedup
i9-13900K stories260K F32 1 pp512 6194.17 6209.78 1.00
i9-13900K stories260K F32 1 tg128 6538.99 2403.19 0.37
i9-13900K stories260K F32 2 pp512 11904.73 11246.42 0.94
i9-13900K stories260K F32 2 tg128 4455.47 1648.02 0.37
i9-13900K stories260K F32 3 pp512 14733.55 14574.01 0.99
i9-13900K stories260K F32 3 tg128 3640.73 1581.78 0.43
i9-13900K stories260K F32 4 pp512 21441.80 18003.02 0.84
i9-13900K stories260K F32 4 tg128 3248.96 1514.81 0.47
i9-13900K stories260K F32 5 pp512 21877.86 19999.46 0.91
i9-13900K stories260K F32 5 tg128 2880.80 1422.77 0.49
i9-13900K stories260K F32 6 pp512 27980.37 20569.50 0.74
i9-13900K stories260K F32 6 tg128 2401.13 1366.64 0.57
i9-13900K stories260K F32 7 pp512 25827.94 21106.84 0.82
i9-13900K stories260K F32 7 tg128 2168.76 1228.24 0.57
i9-13900K stories260K F32 8 pp512 33610.55 19080.77 0.57
i9-13900K stories260K F32 8 tg128 1928.88 1208.25 0.63
i9-13900K stories260K F32 16 pp512 19288.04 19772.55 1.03
i9-13900K stories260K F32 16 tg128 940.62 872.68 0.93
i9-13900K stories260K F32 24 pp512 23801.20 23440.24 0.98
i9-13900K stories260K F32 24 tg128 609.94 637.03 1.04
i9-13900K stories260K F32 32 pp512 25995.12 26778.80 1.03
i9-13900K stories260K F32 32 tg128 415.76 458.49 1.10
Generally on my system I am getting better performance using OpenMP as a thread pool (#7606 (comment) #7606 (comment)).

Interesting....
Definitely doesn't match what I'm seeing. I wonder if it has to do with Intel 13th gen introducing big/little cores (which incidentally is a primary motivator for this PR!)
I'll try to see if I can replicate your setup

fmz · 2024-06-03T15:21:56Z

On my M2 Max (CPU only) with the stories260k model:

% LLAMA_NO_METAL=1 ./scripts/compare-commits.sh master threadpool

CPU	Model	Test	t/s master	t/s threadpool	Speedup
	llama ?B all F32 (guessed)	pp512	49150.60	46529.08	0.95
	llama ?B all F32 (guessed)	tg128	2361.79	2386.59	1.01

% ./build/bin/llama-bench -t 8

model	size	params	backend	threads	test	t/s
llama ?B all F32 (guessed)	1.12 MiB	0.29 M	BLAS	8	pp512	44988.72 ± 5183.60
llama ?B all F32 (guessed)	1.12 MiB	0.29 M	BLAS	8	tg128	2353.62 ± 119.28

% ./build/bin/llama-bench -t 8 -mt 8 (8-thread threadpool instead of the default 12)

model	size	params	backend	threads	test	t/s
llama ?B all F32 (guessed)	1.12 MiB	0.29 M	BLAS	8	pp512	43740.36 ± 5335.82
llama ?B all F32 (guessed)	1.12 MiB	0.29 M	BLAS	8	tg128	3074.75 ± 87.71

Which tells me that the default config for the number of threads in a threadpool isn't ideal at all

% ./build/bin/llama-bench -t 8 -mt 8 --poll (+ polling)

model	size	params	backend	threads	test	t/s
llama ?B all F32 (guessed)	1.12 MiB	0.29 M	BLAS	8	pp512	47442.18 ± 5958.66
llama ?B all F32 (guessed)	1.12 MiB	0.29 M	BLAS	8	tg128	3525.15 ± 68.55

Will report back on Intel 13th gen soon

fmz · 2024-06-05T15:58:33Z

Now that OpenMP is merged, I'll close this until further notice
Thanks for all the feedback @slaren

slaren · 2024-06-05T16:01:38Z

To be clear, OpenMP has been merged because it solves the immediate problem, but it would still be good to have a thread pool that doesn't depend on external libraries. If there is a better alternative we can replace the current implementation.

fmz · 2024-06-05T16:04:48Z

To be clear, OpenMP has been merged because it solves the immediate problem, but it would still be good to have a thread pool that doesn't depend on external libraries. If there is a better alternative we can replace the current implementation.

Absolutely. We didn't go the OpenMP route specifically because it's not available/desirable on all platforms. I will just need to do some rework AND make sure there are no regressions on targets such as Intel 13th gen. Expect another PR soon ;)

Introduce ggml_threadpool

9a4bdc8

Added an API to support explicit management of threadpools.

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend examples server ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 24, 2024

mofosyne added Kompute https://github.com/KomputeProject/kompute/ Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 25, 2024

fmz and others added 6 commits May 24, 2024 19:28

set bitmask properly on windows

4675838

reset cpu affinity every time for main thread

0ce35a6

roll back CMakePresets.json changes

8bdae21

free batch threadpool in main

ef1b87d

threadpool: proper handling for non-specified cpumask

e771674

threadpool: add persistent threadpool for llama-bench

c1e7f48

threadpool: fix compiler errors for android and x64 builds

a67dbcc

max-krasnyansky force-pushed the threadpool branch from 0276aca to a67dbcc Compare May 28, 2024 04:47

threadpool: update backend interface in ggml-rpc

1d9d39a

Restrict threadpool to CPU backend

5c9222d

fmz force-pushed the threadpool branch from 5c9222d to 6953036 Compare May 28, 2024 17:00

Restrict threadpool to CPU backend

cbab212

fmz force-pushed the threadpool branch from 6953036 to cbab212 Compare May 28, 2024 19:35

fmz added 2 commits May 28, 2024 22:31

llama-bench threadpool CLI params

65c11d4

Merge branch 'threadpool' of https://github.com/CodeLinaro/llama.cpp …

88dc99a

…into threadpool

msy-kato mentioned this pull request May 29, 2024

ggml: Support OpenMP for multi-thread processing #7606

Merged

fixes for non-llvm builds

e8c3364

fmz force-pushed the threadpool branch from 6e7145c to e8c3364 Compare May 30, 2024 19:50

Faisal Zaghloul added 3 commits May 31, 2024 12:53

fix server test

719d12b

fix n_threads == 1 bug

9d6675d

Merge branch 'master' into threadpool

6a8aa22

fmz marked this pull request as ready for review May 31, 2024 18:27

fmz commented May 31, 2024

View reviewed changes

fmz closed this Jun 5, 2024

fmz mentioned this pull request Jul 25, 2024

Threadpool: take 2 #8672

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce ggml_threadpool #7526

Introduce ggml_threadpool #7526

fmz commented May 24, 2024

slaren commented May 26, 2024

max-krasnyansky commented May 28, 2024

fmz commented May 28, 2024

slaren commented May 28, 2024 •

edited

Loading

fmz commented May 28, 2024

slaren commented May 28, 2024

fmz commented May 30, 2024

fmz commented May 31, 2024

fmz commented May 31, 2024

fmz May 31, 2024

fmz May 31, 2024

fmz May 31, 2024

fmz May 31, 2024

fmz May 31, 2024

fmz May 31, 2024

slaren commented May 31, 2024

fmz commented Jun 3, 2024

fmz commented Jun 3, 2024 •

edited

Loading

fmz commented Jun 5, 2024

slaren commented Jun 5, 2024

fmz commented Jun 5, 2024

		@@ -97,6 +97,7 @@ extern "C" {

		// compute graph with a plan
		enum ggml_status (*GGML_CALL graph_plan_compute)(ggml_backend_t backend, ggml_backend_graph_plan_t plan);

		@@ -67,12 +67,21 @@ extern "C" {

		GGML_API void ggml_backend_synchronize(ggml_backend_t backend);

		GGML_API ggml_backend_graph_plan_t ggml_backend_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph);

Introduce ggml_threadpool #7526

Introduce ggml_threadpool #7526

Conversation

fmz commented May 24, 2024

slaren commented May 26, 2024

max-krasnyansky commented May 28, 2024

fmz commented May 28, 2024

slaren commented May 28, 2024 • edited Loading

fmz commented May 28, 2024

slaren commented May 28, 2024

fmz commented May 30, 2024

fmz commented May 31, 2024

fmz commented May 31, 2024

fmz May 31, 2024

Choose a reason for hiding this comment

fmz May 31, 2024

Choose a reason for hiding this comment

fmz May 31, 2024

Choose a reason for hiding this comment

fmz May 31, 2024

Choose a reason for hiding this comment

fmz May 31, 2024

Choose a reason for hiding this comment

fmz May 31, 2024

Choose a reason for hiding this comment

slaren commented May 31, 2024

fmz commented Jun 3, 2024

fmz commented Jun 3, 2024 • edited Loading

fmz commented Jun 5, 2024

slaren commented Jun 5, 2024

fmz commented Jun 5, 2024

slaren commented May 28, 2024 •

edited

Loading

fmz commented Jun 3, 2024 •

edited

Loading