Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce ggml_threadpool #7526

Closed
wants to merge 17 commits into from
Closed

Introduce ggml_threadpool #7526

wants to merge 17 commits into from

Conversation

fmz
Copy link
Contributor

@fmz fmz commented May 24, 2024

Added an API to support explicit management and fine-grain control of threadpools.
The API supports creating different threadpools for various parts of execution, e.g. batch, single-token, etc. Each threadpool can be created, paused, resumed, and released independently from any other threadpools. This mitigates the overhead of starting/stopping threads for each decode call and helps OSes keep track of scheduling history in order to make better scheduling decisions.

Each threadpool supports:

Setting number of threads (duh)
Setting a CPU mask for threads to be placed on
Support for strict/relaxed placement: pinning specific threads to specific cores, or letting the OS decide
Support for polling/interrupt-driven wait
Setting thread priority
Using threadpools explicitly is optional. If a llama_decode is called with a llama_context that doesn't have a threadpool attached, a disposable threadpool is created (same as the current behavior).
If users choose to explicitly use threadpools, they have to manage them manually. See examples in main.cpp and in speculative.cpp.

With all the bells and whistles enabled, we generally see .25-1 tok/s improvement across the board. Stay tuned for actual perf numbers!

Added an API to support explicit management of threadpools.
@github-actions github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend examples server ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 24, 2024
@mofosyne mofosyne added Kompute https://github.com/KomputeProject/kompute/ Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 25, 2024
@slaren
Copy link
Collaborator

slaren commented May 26, 2024

This has rather catastrophic performance on my system (13900k under WSL2):

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3090 Ti llama 7B Q4_0 0 pp512 4547.28 3185.66 0.70
RTX 3090 Ti llama 7B Q4_0 0 tg128 155.19 21.53 0.14
RTX 3090 Ti llama 7B Q4_0 1 pp512 501.72 153.43 0.31
RTX 3090 Ti llama 7B Q4_0 1 tg128 21.54 0.85 0.04

This is not a full review, but the changes to the interface of ggml-backend should not be done, instead it should remain specific to the CPU backend. It is ok to add more CPU backend specific functions to configure this (eg. we already have a ggml_backend_cpu_set_n_threads function), but other backends do not benefit from a thread pool, and the generic interface should not be changed.

@max-krasnyansky
Copy link
Collaborator

@slaren Thank you for the quick review and perf check. We had bug with the default mask handling (ie if mask not specified). I pushed a few updates just now which resolves that. Not yet sure if it'll fix the regression you're seeing though.

Thanks for the suggestion for the backend interface. Adding more setters sounds good. We'll follow up on that.

@fmz
Copy link
Contributor Author

fmz commented May 28, 2024

This has rather catastrophic performance on my system (13900k under WSL2):

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3090 Ti llama 7B Q4_0 0 pp512 4547.28 3185.66 0.70
RTX 3090 Ti llama 7B Q4_0 0 tg128 155.19 21.53 0.14
RTX 3090 Ti llama 7B Q4_0 1 pp512 501.72 153.43 0.31
RTX 3090 Ti llama 7B Q4_0 1 tg128 21.54 0.85 0.04
This is not a full review, but the changes to the interface of ggml-backend should not be done, instead it should remain specific to the CPU backend. It is ok to add more CPU backend specific functions to configure this (eg. we already have a ggml_backend_cpu_set_n_threads function), but other backends do not benefit from a thread pool, and the generic interface should not be changed.

Thanks for your comment!

Definitely didn't really consider the GPU backend for this. The idea though is that ggml_threadpool can be used for any backend that needs CPU parallelism, and you can imagine that any backend might want to do some minor tasks in parallel on the CPU.
That said, it makes total sense that such grand plans aren't suitable for a PR titled "introduce". We can let it develop more organically in the future.

I'll update the code to make threadpool explicitly CPU-specific.

@slaren
Copy link
Collaborator

slaren commented May 28, 2024

Here are the results with the current commit:

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3090 Ti llama 7B Q4_0 0 pp512 4389.11 4162.12 0.95
RTX 3090 Ti llama 7B Q4_0 0 tg128 153.74 105.83 0.69
RTX 3090 Ti llama 7B Q4_0 1 pp512 506.90 522.35 1.03
RTX 3090 Ti llama 7B Q4_0 1 tg128 21.45 30.73 1.43

So it is a lot better, but still significantly slower without NKVO. NKVO should be the best case for a thread pool, since there is a call to graph_compute on every layer, which causes the threads to be launched many times on every evaluation. See #7342 for the kind of speedup that we should expect with a thread pool in this case.

In the case without NKVO, there is only a get_rows operation being run on the CPU per evaluation, and it does not cause any additional threads to be launched. I suspect that it might be necessary to maintain the logic that prevents launching additional threads in this case to avoid a performance regression.

@fmz
Copy link
Contributor Author

fmz commented May 28, 2024

Here are the results with the current commit:

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3090 Ti llama 7B Q4_0 0 pp512 4389.11 4162.12 0.95
RTX 3090 Ti llama 7B Q4_0 0 tg128 153.74 105.83 0.69
RTX 3090 Ti llama 7B Q4_0 1 pp512 506.90 522.35 1.03
RTX 3090 Ti llama 7B Q4_0 1 tg128 21.45 30.73 1.43
So it is a lot better, but still significantly slower without NKVO. NKVO should be the best case for a thread pool, since there is a call to graph_compute on every layer, which causes the threads to be launched many times on every evaluation. See #7342 for the kind of speedup that we should expect with a thread pool in this case.

In the case without NKVO, there is only a get_rows operation being run on the CPU per evaluation, and it does not cause any additional threads to be launched. I suspect that it might be necessary to maintain the logic that prevents launching additional threads in this case to avoid a performance regression.

It's odd that you're seeing any difference at all on GPU. It may be the case that the new thread-launch code has substantial overhead.
Are you using the compare-commits.sh script for this?

@slaren
Copy link
Collaborator

slaren commented May 28, 2024

It's odd that you're seeing any difference at all on GPU. It may be the case that the new thread-launch code has substantial overhead.

Yes, but to be clear, currently there are no threads launched at all when fully offloading to a GPU. There is a single get_rows operation being run on the CPU backend, but it is run with only 1 thread, which causes ggml to reuse the main thread without launching any additional threads.

Are you using the compare-commits.sh script for this?

Yes, I used the command line LLAMA_CUDA=1 scripts/compare-commits.sh master threadpool -nkvo 0,1.

@fmz
Copy link
Contributor Author

fmz commented May 30, 2024

Ok found the GPU issue. Turns out I reverted this line by mistake:
cplan.n_threads = MIN(max_tasks, n_threads);

@fmz
Copy link
Contributor Author

fmz commented May 31, 2024

Alright after a couple bug fixes here and there, here are the GPU numbers on my laptop.
I ran this with the default parameters:

  • Allocate n_threads == n_virtual_cores
  • no polling
  • no explicit placement
  • default priority
GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3060 Laptop GPU llama 7B Q4_0 No pp512 1391.88 1380.94 0.99
RTX 3060 Laptop GPU llama 7B Q4_0 No tg128 66.24 65.95 1.00
RTX 3060 Laptop GPU llama 7B Q4_0 Yes pp512 271.17 276.63 1.02
RTX 3060 Laptop GPU llama 7B Q4_0 Yes tg128 8.55 45.91 5.37

When tightening things a little bit:

  • 6-thread threadpool (num physical cores)
  • priority = 3
  • poll = 1
model size params backend ngl nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 pp512 291.06 ± 1.17
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 tg128 54.06 ± 0.25

On the CPU side we get something that looks like this with default params:

CPU Model Test t/s master t/s threadpool Speedup
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 pp512 14.92 14.95 1.00
Intel(R) Core(TM) i7-10750H CPU @ 2.60GHz llama 7B Q4_0 tg128 7.07 7.35 1.04

and when tightening up a bit (6 threads, higher priority, polling):

model size params backend threads nkvo test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 1 pp512 15.08 ± 0.14
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 1 tg128 7.66 ± 0.01

@fmz
Copy link
Contributor Author

fmz commented May 31, 2024

Will collect some more numbers on other targets soon
In the meantime, I declare this PR ready for review!

@fmz fmz marked this pull request as ready for review May 31, 2024 18:27
int reps;
bool verbose;
output_formats output_format;
};
//
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug leftovers

@@ -364,7 +380,7 @@ static cmd_params parse_cmd_params(int argc, char ** argv) {
}
auto p = split<std::string>(argv[i], split_delim);
std::vector<ggml_type> types;
for (const auto & t : p) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I get when I use an unconfigured IDE

@@ -1257,9 +1297,7 @@ static void test_prompt(llama_context * ctx, int n_prompt, int n_past, int n_bat
llama_synchronize(ctx);
}

static void test_gen(llama_context * ctx, int n_gen, int n_past, int n_threads) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not have changed

@@ -97,6 +97,7 @@ extern "C" {

// compute graph with a plan
enum ggml_status (*GGML_CALL graph_plan_compute)(ggml_backend_t backend, ggml_backend_graph_plan_t plan);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary

@@ -254,7 +254,10 @@ void ggml_backend_synchronize(ggml_backend_t backend) {
backend->iface.synchronize(backend);
}

ggml_backend_graph_plan_t ggml_backend_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph) {
ggml_backend_graph_plan_t ggml_backend_graph_plan_create(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert most of the changed function headers below

@@ -67,12 +67,21 @@ extern "C" {

GGML_API void ggml_backend_synchronize(ggml_backend_t backend);

GGML_API ggml_backend_graph_plan_t ggml_backend_graph_plan_create(ggml_backend_t backend, struct ggml_cgraph * cgraph);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert

@slaren
Copy link
Collaborator

slaren commented May 31, 2024

I ran the test again on my system:

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3090 Ti llama 7B Q4_0 No pp512 4529.64 4503.79 0.99
RTX 3090 Ti llama 7B Q4_0 No tg128 155.93 145.68 0.93
RTX 3090 Ti llama 7B Q4_0 Yes pp512 525.08 519.27 0.99
RTX 3090 Ti llama 7B Q4_0 Yes tg128 21.93 26.80 1.22

I also tested on the CPU only with a very small model stories260k:

CPU Model Threads Test t/s master t/s threadpool Speedup
i9-13900K stories260K F32 1 pp512 6194.17 6209.78 1.00
i9-13900K stories260K F32 1 tg128 6538.99 2403.19 0.37
i9-13900K stories260K F32 2 pp512 11904.73 11246.42 0.94
i9-13900K stories260K F32 2 tg128 4455.47 1648.02 0.37
i9-13900K stories260K F32 3 pp512 14733.55 14574.01 0.99
i9-13900K stories260K F32 3 tg128 3640.73 1581.78 0.43
i9-13900K stories260K F32 4 pp512 21441.80 18003.02 0.84
i9-13900K stories260K F32 4 tg128 3248.96 1514.81 0.47
i9-13900K stories260K F32 5 pp512 21877.86 19999.46 0.91
i9-13900K stories260K F32 5 tg128 2880.80 1422.77 0.49
i9-13900K stories260K F32 6 pp512 27980.37 20569.50 0.74
i9-13900K stories260K F32 6 tg128 2401.13 1366.64 0.57
i9-13900K stories260K F32 7 pp512 25827.94 21106.84 0.82
i9-13900K stories260K F32 7 tg128 2168.76 1228.24 0.57
i9-13900K stories260K F32 8 pp512 33610.55 19080.77 0.57
i9-13900K stories260K F32 8 tg128 1928.88 1208.25 0.63
i9-13900K stories260K F32 16 pp512 19288.04 19772.55 1.03
i9-13900K stories260K F32 16 tg128 940.62 872.68 0.93
i9-13900K stories260K F32 24 pp512 23801.20 23440.24 0.98
i9-13900K stories260K F32 24 tg128 609.94 637.03 1.04
i9-13900K stories260K F32 32 pp512 25995.12 26778.80 1.03
i9-13900K stories260K F32 32 tg128 415.76 458.49 1.10

Generally on my system I am getting better performance using OpenMP as a thread pool (#7606 (comment) #7606 (comment)).

@fmz
Copy link
Contributor Author

fmz commented Jun 3, 2024

I ran the test again on my system:

GPU Model NKVO Test t/s master t/s threadpool Speedup
RTX 3090 Ti llama 7B Q4_0 No pp512 4529.64 4503.79 0.99
RTX 3090 Ti llama 7B Q4_0 No tg128 155.93 145.68 0.93
RTX 3090 Ti llama 7B Q4_0 Yes pp512 525.08 519.27 0.99
RTX 3090 Ti llama 7B Q4_0 Yes tg128 21.93 26.80 1.22
I also tested on the CPU only with a very small model stories260k:

CPU Model Threads Test t/s master t/s threadpool Speedup
i9-13900K stories260K F32 1 pp512 6194.17 6209.78 1.00
i9-13900K stories260K F32 1 tg128 6538.99 2403.19 0.37
i9-13900K stories260K F32 2 pp512 11904.73 11246.42 0.94
i9-13900K stories260K F32 2 tg128 4455.47 1648.02 0.37
i9-13900K stories260K F32 3 pp512 14733.55 14574.01 0.99
i9-13900K stories260K F32 3 tg128 3640.73 1581.78 0.43
i9-13900K stories260K F32 4 pp512 21441.80 18003.02 0.84
i9-13900K stories260K F32 4 tg128 3248.96 1514.81 0.47
i9-13900K stories260K F32 5 pp512 21877.86 19999.46 0.91
i9-13900K stories260K F32 5 tg128 2880.80 1422.77 0.49
i9-13900K stories260K F32 6 pp512 27980.37 20569.50 0.74
i9-13900K stories260K F32 6 tg128 2401.13 1366.64 0.57
i9-13900K stories260K F32 7 pp512 25827.94 21106.84 0.82
i9-13900K stories260K F32 7 tg128 2168.76 1228.24 0.57
i9-13900K stories260K F32 8 pp512 33610.55 19080.77 0.57
i9-13900K stories260K F32 8 tg128 1928.88 1208.25 0.63
i9-13900K stories260K F32 16 pp512 19288.04 19772.55 1.03
i9-13900K stories260K F32 16 tg128 940.62 872.68 0.93
i9-13900K stories260K F32 24 pp512 23801.20 23440.24 0.98
i9-13900K stories260K F32 24 tg128 609.94 637.03 1.04
i9-13900K stories260K F32 32 pp512 25995.12 26778.80 1.03
i9-13900K stories260K F32 32 tg128 415.76 458.49 1.10
Generally on my system I am getting better performance using OpenMP as a thread pool (#7606 (comment) #7606 (comment)).

Interesting....
Definitely doesn't match what I'm seeing. I wonder if it has to do with Intel 13th gen introducing big/little cores (which incidentally is a primary motivator for this PR!)
I'll try to see if I can replicate your setup

@fmz
Copy link
Contributor Author

fmz commented Jun 3, 2024

On my M2 Max (CPU only) with the stories260k model:

% LLAMA_NO_METAL=1 ./scripts/compare-commits.sh master threadpool

CPU Model Test t/s master t/s threadpool Speedup
llama ?B all F32 (guessed) pp512 49150.60 46529.08 0.95
llama ?B all F32 (guessed) tg128 2361.79 2386.59 1.01

% ./build/bin/llama-bench -t 8

model size params backend threads test t/s
llama ?B all F32 (guessed) 1.12 MiB 0.29 M BLAS 8 pp512 44988.72 ± 5183.60
llama ?B all F32 (guessed) 1.12 MiB 0.29 M BLAS 8 tg128 2353.62 ± 119.28

% ./build/bin/llama-bench -t 8 -mt 8 (8-thread threadpool instead of the default 12)

model size params backend threads test t/s
llama ?B all F32 (guessed) 1.12 MiB 0.29 M BLAS 8 pp512 43740.36 ± 5335.82
llama ?B all F32 (guessed) 1.12 MiB 0.29 M BLAS 8 tg128 3074.75 ± 87.71

Which tells me that the default config for the number of threads in a threadpool isn't ideal at all

% ./build/bin/llama-bench -t 8 -mt 8 --poll (+ polling)

model size params backend threads test t/s
llama ?B all F32 (guessed) 1.12 MiB 0.29 M BLAS 8 pp512 47442.18 ± 5958.66
llama ?B all F32 (guessed) 1.12 MiB 0.29 M BLAS 8 tg128 3525.15 ± 68.55

Will report back on Intel 13th gen soon

@fmz
Copy link
Contributor Author

fmz commented Jun 5, 2024

Now that OpenMP is merged, I'll close this until further notice
Thanks for all the feedback @slaren

@fmz fmz closed this Jun 5, 2024
@slaren
Copy link
Collaborator

slaren commented Jun 5, 2024

To be clear, OpenMP has been merged because it solves the immediate problem, but it would still be good to have a thread pool that doesn't depend on external libraries. If there is a better alternative we can replace the current implementation.

@fmz
Copy link
Contributor Author

fmz commented Jun 5, 2024

To be clear, OpenMP has been merged because it solves the immediate problem, but it would still be good to have a thread pool that doesn't depend on external libraries. If there is a better alternative we can replace the current implementation.

Absolutely. We didn't go the OpenMP route specifically because it's not available/desirable on all platforms. I will just need to do some rework AND make sure there are no regressions on targets such as Intel 13th gen. Expect another PR soon ;)

@fmz fmz mentioned this pull request Jul 25, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples ggml changes relating to the ggml tensor library for machine learning Kompute https://github.com/KomputeProject/kompute/ Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level server SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language testing Everything test related Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants