Skip to content

Use Threadpool to schedule the work #851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

howard0su
Copy link
Collaborator

I cannot use this code to full utilize all CPU. based on PR #710 :

  1. Remove finalizer
  2. use similar tech like PR Run several single thread operators parellel #850
  3. optimize thread pool itself to avoid malloc/free in the critical path.

bogdad and others added 11 commits April 1, 2023 20:16

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@howard0su
Copy link
Collaborator Author

howard0su commented Apr 8, 2023

Additional Ideas to improve this PR:

  1. Use some lockfree queue to replace mutex. only use conditional variable to wake up for new work. we may not even need this thread pool implementation.
  2. Test more with BLAS to see if it helps. tune the settings of when we use BLAS.
  3. smartly divide the work to n-threads. n should not be always the core number. It should relate to the size of parameter tensors.
  4. Topo sort of the graph. The current challenge is that graph is dynamic but largely they are static except some parameters. We need to evaluate how quick we can do a topo sort.

@ggerganov for comments.

@sw
Copy link
Contributor

sw commented Apr 9, 2023

@howard0su : I think you may have added some commits accidentally? There's already a PR for 997c749, I believe: #809

@howard0su
Copy link
Collaborator Author

It helps me my local debug. I will revert this from this PR when this PR gets to better state.

In the meantime, please help review that PR and merge it if it is proper.

@howard0su
Copy link
Collaborator Author

howard0su commented Apr 10, 2023

Call for some testing. The current change shows both performance improvement and energy improvement. However my devbox is a 10cores, 20threads E5 without AVX2. It is not very typical config. I need some help to validate the performance.

@howard0su
Copy link
Collaborator Author

howard0su commented Apr 13, 2023

This is a test result in my machine. Eval time is still having some regression. I believe this is because when we are doing eval instead of prompt eval, the token output is 1, which every thread's work is smaller so the scheduling effort shows bigger impact to the performance.

Figure_1

@ggerganov
Copy link
Member

I'll look in more details into the threading work after I finish with the quantization improvement efforts and some other pending stuff.

But at the moment, I can immediately say that these extra thpool files cannot stay like this. At some point in the past, I had implemented a thread pool inside ggml.c and it was very compact. If we really do need a thread pool, it should be implemented in a minimalistic way inside ggml.c.

@ggerganov ggerganov added the threading Parallel processing and thread management label Apr 14, 2023
@bogdad
Copy link
Contributor

bogdad commented Apr 15, 2023

did run the benchmark script on m1 mac, its up to 10 threads, orange is threadpool, blue is master-(when the branch was forked=eeaa7b0492fc79baab8bb1fe195d6c87159f2bd3)

token time:
master-blue-vs-threadpool-orange

cant explain why we cant see the windows improvements, except for when we are at 10 threads where thread pool is better.

scripts
bench.py
plot.py

threadpool
4,85.79
6,60.352
8,50.604
10,60.007999999999996

master:
4,82.356
6,57.628
8,47.834
10,140.958

@howard0su
Copy link
Collaborator Author

it may relate to that pthread functions are different. Do you mind checking if switching to lock-free queue will help?

When you using up all threads, my testing also shows threadpool is significant better. but the overall time is not lower than max-2 threads.

@lexasub
Copy link
Contributor

lexasub commented Feb 3, 2025

@howard0su , @besnardjb any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
threading Parallel processing and thread management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants