Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage #1632

Closed
wants to merge 24 commits into from

Conversation

mqy
Copy link
Contributor

@mqy mqy commented May 29, 2023

Introduction

MUL_MAT take most of the compute time (about 95%). So to speed up llama, we have to focus on MUL_MAT.
BLAS, as one of the fastest MUL_MAT solution on CPU, typically efficient at computing large matrix multiplication and tends to be very slow when run parallel in multi OS threads. Accelerate is the native BLAS implementation on macOS, which has the problems exactly as said. OpenBLAS or BLIS are a bit slower than Accelerate, the authors claim that they support multi-threads, but I did not test that. So I assume for the big matrix sizes in llama, multi threaded BLAS does not run faster than single thread.

We have three kinds of MUL_MAT to compute:

  1. mul_mat_f32: both src0 and src1 are F32.
  2. mul_mat_f16_f32: src0 is F16 and src1 is F32.
  3. mul_mat_q_f32: src0 is qauntizied (Q4_0, Q4_1, ...), and src1 is F32.

For every kind of MUL_MAT, we have pure CPU solution which has optional INIT stage and COMPUTE stage.
And optional solutions: CUDA/CL that run in GPU, and BLAS that run in CPU.

  1. mul_mat_f32: has only one stage: COMPUTE.
    • The pure CPU with multi-threads
    • BLAS, CUDA and CL with single thread
  2. mul_mat_f16_f32:
    • The pure CPU has two stages: INIT with single threads, COMPUTE with multi-threads.
    • BLAS, CUDA and CL with single thread
  3. mul_mat_q_f32: same as mul_mat_f16_f32, but the de-quantization time is significant.

As of BLAS, there are three known problems to solve:

  1. spin only threading. While spin has been the simplest and perhaps the fastest solution, the community has been seeking some kind of practical threading infrastructure that can compensate the busy spinning at certain situations for long.
  2. single thread BLAS. This is because that:
    The typical mul_mat time when N/K >= 4096 ranges from several ms to hundreds ms. Given n_threads > 1, when run BLAS in main thread, worker threads has nothing to do thus keep spinning. The spinning overhead is not acceptable.
    Given M/N/K, n_threads (and even src0 type), due to the diverse of matrix dimensions and hardware/software stacks, we are not sure which of the solutions is the fastest. At present, master branch applies this rule: run CUDA/CL/BLAS in single OS thread when both src0 and src1 are continuous and M >=32 && N >=32 && K >= 32. As of llama model, this rule almost equals to M >= 32 && N >= 4096 && K >= 4096.
  3. For some N/K, de-quantization time may exceeds mul_mat time when M < 128. This range covers the token size of typical daily conversations. So, we'd better separate de-quantization out of the for loops, thus we can run de-quantization in multi-threads.

Solutions

This PR tries to solve the above problems, they are tightly coupled together. So it's hard to just solve one without touching others.

1. A new threading infrastructure that supports spin + wait/notify

Typical usages are:

  • when compute a task stage, main threads knows that this stage can only run by it's self, and the task stage is configured as idle wait, it issues a wait_now command, workers get this command almost at once, then go wait.
  • workers can be configured with wait_on_task_done: that means we can look ahead a few future task stages to see if there are no immediate multi-thread needs. If no, then tell workers go waiting after finishing task. The optimization benefits energy saving, but is hard to implement correctly and efficiently. In addition to mutex, I have to use spin lock.
  • Also, when compute a task stage, if main threads knows current task stage needs more workers, it executes a syscall to wake up all workers. I had ever implemented a threading framework that can await or wakeup given number of workers. I finally discarded that because I did not find evidence to use only part of workers.

2. A way to configure how to run task stage.

I want to explicitly define: which part of code to run, single thread or multi-thread, workers should go idle wait or not. This is not new but introduced the idle wait and make the configure more explicit. With this we can run bench at will, this unlock us from current implicit#if defined(xxx), and allow us to build with all kinds solutions. I formally defined task profiles for the three kinds of mul_mat. This took not little codes, but is very important for the whole solution.

3. A flexible tune(bench) tool to generate bench data

This tool has the following features/benefits:

  • Supports all llama models and typical matrix sizes and types (attention layer, feed-forward layer, RoPE)
  • Supports all types (F32, F16, all of the Qx_x). NOTE F32 and F16 are disabled as workaround to avoid a unfixed bug.
  • Able to write to/read from file. So the result can be generated ahead of time, and be loaded into memory later.
  • The data file is designed as self-contained, including model, type, backend, all 6 typical shapes, every shape contains their task profiles and per task stage execution time for every task profile.
  • Able to estimate execution time for any M and n_threads, provide corresponding APIs for GGML.
  • Analyze bench data for n_threads. The output is CSV blocks, thus can be easily visualized.
  • Should cover typical M range. I had ever generated M with a constant start value, increase with constant step (for example, from 16, step in 8). Now I generate M with (for n in [0, 10] M := 2^n), this balance the two fundamental needs: (1) M range should reasonable large (2) should assign more M(s) for M <=32 because I guess this is the typical conversation token size that will be executed frequently and this M range is sensitive to profile selecting as of multi-threading.
  • Should run as fast as possible. It takes about 75 seconds on my device to bench 7B/Q4_0/Accelerate with 10 Ms range from 1 up to 512 in 3 passes 1 thread, while one pass bench takes about 35 seconds 1 thread, with 4 threads 1 pass and max-M 128 takes about 13s. Current speed is not good enough in case of running bench at program startup.

4. Adapt llama and ggml to schedule with bench

After the bench data was loaded into program, when do graph computing, we can at first match shape by given N/K, then estimate time for every profile that this shape supports, finally select the fastest profile. Since in practice, we only bench for limited M (10s or so) , we have to leverage some magic to estimate time for any M. Due the the near linear nature of M-time curve, I use interpolate. This is not very cool, but is the best affordable way I can think. Non-continuous matrices are not suitable to run in BLAS, so they will be scheduled to the pure CPU profile. If both src0 and src1 of matrix are continuous, but we do not have bench loaded or for some unknown reasons or bugs that we can not find corresponding shape for given N/K, or unable to estimate, we fallback to the traditional logic: M >= 32 && N >=32 && K >= 32 -- this is totally unfortunate because estimating bias around 32 is highly sensitive to performance. You will see this in the following section.

5. Split single thread BLAS

I separated de-quantization with de-quantization + mul_mat from the for loops. Thus I can create the third task profile for the q_f32's use BLAS solution: run de-quantization in INIT stage with multi-threads, run mul_mat with BLAS and single thread, let workers idle wait.

Results

Due to the nature of predicating, it's a bit hard for me to bench end to end. I wrote a bench tool named prompt.sh to ask llama questions like this: 0+0=0.1+1=1.2+2=. Although in this way it is easy to construct prompt at almost any approximate size, this kind of questions are likely take llama too much time to think, thus result in unusual bench time that may be longer than those normal questions. I have to say that I don't know how to efficiently and correctly run the end-to-end bench at all. Anyway, I did run the examples/chat.sh with 4 threads for many times. Often observed the prompt time decreases about 35%, sometimes over 40%, comparing to master.

So, let me explain in more strict but perhaps easier understood way with a bunch of images.
First of all let's remember several tokens that will be used to identify the task stages for the three q_f32 profiles.

  • #0_0_nth=1 : profile 0, stage 0, n_threads = 1
  • #0_1_nth=1 : profile 0, stage 1, n_threads = 1
  • #0___nth=1 : profile 0, total, n_threads = 1
  • #1_1_nth=1 : profile 1, stage 1, n_threads = 1
  • #1___nth=1 : profile 1, total, n_threads = 1
  • #2_0_nth=1 : profile 2, stage 0, n_threads = 1
  • #2___nth=1 : profile 2, total, n_threads = 1
  • #0___nth=2 : profile 0, total, n_threads = 2
  • ...
  • #2___nth=6 : profile 2, total, n_threads = 6

Where stage 0 is the INIT stage and stage 1 is the COMPUTE stage.
The values of n_threads are typical because:

  • apart from 1, we usually use even n_threads.
  • personal computers often do not have that many physical cores, 6 n_threads is OK.
  • suppose the single thread time is t1, when we increase n_threads, we will get, t2=0.5t1 for n_thread=2, t4=0.25t0 for n_threads=4, t3=0.16*t0 for n_threads=6. The 0.16 means -84%, this is a pretty good speed up, I think.
  • Too many threads causes heavy spin + wait/notify burden. When the ROI (speedup rate v.s. energy/heat) decreases to certain value, increasing n_threads will help little or even hurt.

All data in the following images are created from llama 7B. I will not show you all models because that's too lengthy and I can only run 7B/13B. Instead I'll try Q4_0, Q5_0 and Q8_0 because they are enough for us to catch the points.

I ran bench/analyze on my MacBook pro 2018 with 32 GB 2400 MHz DDR4 memory, 2.6 GHz 6-Core Intel Core i7-8850H @2.60GHz.

The data are all plotted in 2-D lines, where the x-axis is M, and the y-axis is per-thread execution time with unit of ms.

4096x4096, Q4_0

The M >=32 rule and bias

The next diagram shows the execution time of profile-0 at stage-0 and stage-1. The axis scale is logarithmic. The stage-0 time is very fast, and is negligible comparing to that of stage-1. We can anticipate that:

  • the overall compute time is almost same as that of stage-1
  • when run with multi threads n, the per-thread execution time should be 1/n of the single thread.
pic-1

The next diagram shows the execution time of profile-1 at stage-1 (BLAS). The axis scale is logarithmic. It's almost near constant when M <= 64, otherwise the Δt/ΔM goes up more and more finally the time becomes linear to M. I guess the reason why the time increases so much when M>64 is because 4096x4096x64 is the total 1 billon number of float32 to allocate at 32GiB memory, this is identical to my device memory. When it exceeds max memory, the OS has to compress memory or use swap, this would greatly hurt performance.

pic-2

The next picture is used to explain bias ranges in current master code. Let's firstly find the points that the blue line intersects with other lines. The blue line represents the overall execution time for profile-2, whereas other 4 lines represent the overall execution time for profile-0 at that n_threads. Every line for profile-0 intersects with the line for profile-2 at some point. So given n_threads and M, we can easily determine the fastest profile (line) by simply having a glance at the intersection point. For those Ms not in x-axis, we can easily estimate the corresponding time.

Now let's focus on the vertical line at M=32. Given n_threads, we can find the corresponding line for profile-0 and profile-2.
Let's recall the default profile selecting policy in master code: M >=32 && N >= 32 && K >=32. This means: for NxK= 4096x4096, when M <32 we follow the line for profile-0, otherwise follow the line for profile-2.

This is ideal when the two line intersect at M=32, otherwise the estimation bias will show up for those Ms between the intersection point and 32. We can see that for any line of profile-0, the bias goes up from 0 (at intersection point) to |t0-t1| (at M=32), where t0 is the profile-0 time and t2 is the profile-2 time. The max bias is so large that may reach up to 30% for n_threads=1 and 2, and up to 60% for n_thread=4 or 6. Of course, with the increasing of n_threads, the spinning and memory contention or cache miss would cause certain performance degradation, finally the per-thread average time would not reach that ideal (small) value.

As I had said before, M is the token size. Since white spaces and stems are also be counted in the token size, for any typical question or statement, the corresponding prompt token size should is likely get closes to 32.

Anyway, nowadays personal computers tends to have big memory and fast CPUs, thus the bias may not be noticed or tolerable.

pic-3

Parallel de-quantizing

The next two pictures shows the trend of de-quantization time at INIT stage as a percentage of the whole execution time. In theory, de-quantization (INIT) time is determined by N/K only, so it can be seen as a constant. But BLAS time increases after M>64.

The important thing to learn from this plotting is: the INIT time is near or bigger than the COMPUTE time at pretty large M range: up to 128! It is about 1/3 of the overall time even at M=256. So if we run INIT with multi-threads, we can get far better performance than single thread. Ideally, we can speed up over 50% when M <= 64, and 30% ~ 40 % when M between 64 and 128.

pic-4 pic-5

Finally I show you the multi-threaded plotting, for simplicity purpose I just show nth=1 and nth=4. From this picture we can see that: M at intersection point increases with n_threads. I've seen that there is no intersection point at all when n_threads=8: that means the pure CPU solution always run faster than BLAS solution even if both run with multi-threads.

With fine tuning, given model, type, M,N,K and n_threads, we will able to select the correct profile.

pic-7

Other images

I will not explain them. The important reason that I list these images is: show similarity and minor differences.

pic-14

How to evaluate

Build with make or CMake

Make sure one of the BLAS vendor is enabled and compiled into program.

#Accelerate: make clean; LLAMA_NO_ACCELERATE=  make
#OpenBLAS:   make clean; LLAMA_NO_ACCELERATE=1 LLAMA_OPENBLAS=1 make
#BLIS:       make clean; LLAMA_NO_ACCELERATE=1 LLAMA_BLIS=1 make
#CLBLAST     make clean; LLAMA_CLBLAST=1 make

#CUDA is supported, but not well tested, may not run at all.

Evaluate:

NOTE when GPU offloading is enabled (-ngl > 0), mul_mat tuning is disabled atomatically.

# help
./mulmat-tune -h

# tune, use default config, 7B, Q4_0, n_threads=4, ...
./mulmat-tune

#tune and run
./main ... --tune

# tune and save file, exit.
./main ... --tune --tune-file=<FILE>

# load and run:
./main ... --tune-file=<FILE>

./perplexity ... --tune

Have a look at examples/mulmat-tune/README.md for details

Conclusion

Software systems are complicated. It's hard to optimize when target platforms vary widely. I'm certain that the speed up to q_f32 would not become reality without the new threading infrastructure, task config profile and the mulmat tune tool. I'm happy that for so long time I finally able to show you the working codes. Enjoy!

@ggerganov @SlyEcho @0cc4m @JohannesGaessler @zenixls2 @slaren

EDITED on Jun 18

  • typos
  • hide 5 images
  • tune: sync with latest changes

EDITED ON Jun 26

I haven't updated this PR for a few days, because of the following reasons I think:

  • to support tuning, this PR introduced too many updates.
  • the threading implementation is ugly and full of tricks, not well-tested.
  • hard to test for Windows and CL/CUDA due to limited personal devices.
  • controversial design of task profiles: intrusive.
  • hard to merge even pieces of codes, tends to become trouble maker.
  • finally, in favor of ggml : get rid of BLAS and all it's variants ggml#293

Great thanks to @KerfuffleV2 for help testing and all of you who took time on this PR.

I'm sorry @ggerganov this took you time to review, so I close this PR?

@mqy mqy changed the title Fine Tune MUL_MAT with Bench, new threading (spin+wait/notify), speedup CBLAS by splitting COMPUTE and paralllel Fine Tune MUL_MAT with Bench, new threading (spin+wait/notify), speedup CBLAS by splitting COMPUTE and run INIT in parallel May 29, 2023
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 25 out of 32. Check the log or trigger a new build to see more.

examples/mulmat-tune/mulmat-tune.h Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune-tool.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune-tool.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune-tool.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune-tool.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
ggml.c Outdated Show resolved Hide resolved
tests/test-mulmat-tune.c Outdated Show resolved Hide resolved
tests/test-mulmat-tune.c Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
@mqy
Copy link
Contributor Author

mqy commented May 29, 2023

CMakeFiles does not work, perhaps should move mulmat-tune.[c,h] to root dir.

@JohannesGaessler
Copy link
Collaborator

I was thinking recently that better threading would be nice to have.

Anyways, I didn't yet look at the PR in detail but I can already give you feedback regarding the way you represent your data to make it easier to understand:

  • Add units to the table: you can't tell at a glance what the numbers mean. Then you no longer need to go back and forth between the README and the image.
  • Label the plot axes: again, you cannot tell at a glance what the lines mean. There are benchmarks where lower is better and some where higher is better. Put the meaning of the axes directly in the image.
  • Markdown lets you create tables. That may be a little easier to use than an image.

Regarding the contents of the README: unless I'm misunderstanding something you are at one point talking about doing dequantization on the CPU and then doing the actual matrix multiplication on the GPU. This is not a viable approach. The weights are very large and become even larger when dequantized. Transferring that much data between CPU and GPU is very slow, slower than to just do everything on the CPU. My implementation only works because weights are stored in VRAM and thus don't need to be copied to the GPU.

.gitignore Show resolved Hide resolved
ggml.h Outdated Show resolved Hide resolved
@SlyEcho
Copy link
Collaborator

SlyEcho commented May 29, 2023

dequantization on the CPU and then doing the actual matrix multiplication on the GPU. This is not a viable approach.

We started out that way, at first cuBLAS was used without the custom kernels. It did work but obviously was much slower than it is now.

Makefile Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
@SlyEcho
Copy link
Collaborator

SlyEcho commented May 29, 2023

CMakeFiles does not work, perhaps should move mulmat-tune.[c,h] to root dir.

I think so. It is seems to be another part of ggml, so I would rename them to ggml-tune.{c,h}

Makefile Outdated Show resolved Hide resolved
@ggerganov ggerganov added performance Speed related topics threading Parallel processing and thread management labels May 29, 2023
@mqy
Copy link
Contributor Author

mqy commented May 29, 2023

CMakeFiles does not work, perhaps should move mulmat-tune.[c,h] to root dir.

I think so. It is seems to be another part of ggml, so I would rename them to ggml-tune.{c,h}

@mqy mqy closed this May 29, 2023
@mqy mqy reopened this May 29, 2023
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ggml.h Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
examples/mulmat-tune/mulmat-tune.c Outdated Show resolved Hide resolved
ggml.c Outdated Show resolved Hide resolved
ggml.h Outdated Show resolved Hide resolved
tests/test-mulmat-tune.c Outdated Show resolved Hide resolved
@mqy
Copy link
Contributor Author

mqy commented May 29, 2023

I was thinking recently that better threading would be nice to have.

Anyways, I didn't yet look at the PR in detail but I can already give you feedback regarding the way you represent your data to make it easier to understand:

  • Add units to the table: you can't tell at a glance what the numbers mean. Then you no longer need to go back and forth between the README and the image.
  • Label the plot axes: again, you cannot tell at a glance what the lines mean. There are benchmarks where lower is better and some where higher is better. Put the meaning of the axes directly in the image.
  • Markdown lets you create tables. That may be a little easier to use than an image.

Regarding the contents of the README: unless I'm misunderstanding something you are at one point talking about doing dequantization on the CPU and then doing the actual matrix multiplication on the GPU. This is not a viable approach. The weights are very large and become even larger when dequantized. Transferring that much data between CPU and GPU is very slow, slower than to just do everything on the CPU. My implementation only works because weights are stored in VRAM and thus don't need to be copied to the GPU.

@JohannesGaessler feedbacks from you and others corrected me the misunderstandings. I managed to improve the README file a bit for now: fixed wrong terms, no longer use image, pasted some example results. I'll will keep updating it.

As of the term backend, similar to current enum ggml_backend, I was defined enum ggml_device for CPU and GPU before. Honest speaking, I always get confused with terms BLAS and GPU since then, sorry !

In this PR, bench result is tightly bond to specified implementation, so I named several backend vendors for validating the loaded bench file. Now I read the backend as "mixed implementation on top of hardware and software library spec", so I use it to control which part of code to run explicitly. I'm aware that your PR Cuda refactor, multi GPU support #1670 is ready to merge, congratulations!

Thanks!

@mqy
Copy link
Contributor Author

mqy commented May 29, 2023

I'll try fix the CMake build. I'm not familiar with it, so will reference the configuration of ggml-opencl.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 29, 2023

I'll try fix the CMake build. I'm not familiar with it, so will reference the configuration of ggml-opencl.

Is it optional? Because ggml-opencl is optional.

Otherwise you can just add the files to the ggml library target.

@mqy
Copy link
Contributor Author

mqy commented May 29, 2023

Is it optional? Because ggml-opencl is optional.

As far as I know, ggml-opencl is controlled by a compile flag namedLLAMA_OPENCL, while mulmat-tune doesn't have any compile flags at present. I'm not anticipating to define any compile flag for mulmat-tune, because both struct llama_context and struct ggml_cgraph were added the field struct ggml_mulmat_tune * mm_tune;. On llama init, if mulmat-tune.txt exists and was successfully loaded and validated, the mm_tune is set.

llama will pass mm_tune to every ggml_cgraph being created by it.
In ggml_graph_compute_mul_mat_set_task_profile(), if cgraph->mm_tune is NULL, fallback to the M >= 32 && N >= 32 && K >= 32 logic.

I'm anticipating that in the future the choice of whether use mulmat tune or not will be controlled by two command line options: --mulmat-tune-file=FILE to load existing file, or --mulmat-tune to run bench at once and use the in-memory result.

I'm doubting the usefulness of--mulmat-tune because the bench time may looks too long. With bench parameters (model=7B, type=Q4_0, m_num = 10, n_pass=3), it takes about 75 seconds on my device, while 1-pass takes about 35 seconds. One of the possible fix is : given N/K (both > 0), do not run de-quantization for every M, I will try this later.

Thanks for the tip!

@mqy mqy force-pushed the blas-n_threads-fix-11 branch 2 times, most recently from 1701eeb to 9306367 Compare May 29, 2023 16:33
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

ggml-tune.c Show resolved Hide resolved
mqy added 9 commits June 18, 2023 14:27
…sk runer and profile id, many changes, see the f codes
* removed ggml_task_backend, infavour of ggml_task_profile.runner and newly added id and name.
* extracted mul_mat blas codes into ggml_compute_forward_mul_mat_blas,
  thus align with CUDA/CL a bit more and make it easier to fix profile and run tune.
* rewrote task profile and update/add some cuda/cl codes, finnaly made CL GPU offloading work.
* misc minor fix/update to tune, the data format was changed.
@mqy mqy closed this Jun 26, 2023
@KerfuffleV2
Copy link
Collaborator

Oh man, after all that work. Hopefully you at least learned some useful stuff that will help you in future projects. (Also, unfortunately I wasn't going to be able to provide any further CUDA testing help since my GPU got fried by lightning.)

@JohannesGaessler
Copy link
Collaborator

What's the current state of overhauling threading in llama.cpp? If no one else is working on it I'll maybe take a crack at it once I'm done with my current objectives.

@ggerganov
Copy link
Owner

It's hard to say - it seems there could be improvements made to the threading, but it is not very clear what exactly.
Here are some things that need to be explored:

Do you have something specific in mind?

@mqy
Copy link
Contributor Author

mqy commented Jul 9, 2023

Do you have something specific in mind?

  1. Multithreaded BLAS need study. I briefly tested BLIS and MKL, will study further, MKL threading looks good. There must have something to learn from these libraries.
  2. Current threading is good enough, because: (1) it maximized the data that can be shared, so is a near lock-free design (2) it greatly reduced write to shared atomics, spin atomic_load is cheep, thus no need to add NOP or mem_pause. Of course NUMA is incomplete.
  3. chunk/stride deserve further study, should base on top of multi-threaded BLAS that I mentioned.
  4. session level thread pool is a candidate, not urgent, performance is the second consideration.
  5. I suggest delay studying on wait/notify unless we have exact use case that really need this. I had spent many time on at least 3 or 4 versions in c/c++. It's hard, really. You may have look at the c channel version and the cpp version which has corresponding test. One of the most important thing I have learn is avoid using spin lock in user space.

@slaren
Copy link
Collaborator

slaren commented Jul 9, 2023

Threads pools may become very important in the future for mixed GPU/CPU computations with graphs to allow keeping the k/v cache in the CPU, while still running the feed forward parts of the layer on the GPU. Essentially, to support this we will need a ggml_graph_compute per layer, so the overhead of starting the threads will become significant.

@JohannesGaessler
Copy link
Collaborator

What I'm thinking about primarily is this: currently when you offload all layers to the GPU using CUDA you get better performance when you set the number of threads to 1. Presumably either the overhead from creating threads or the CPU load from the constantly spinning threads is the problem. To me this suggests that the performance of only partial offloading could be improved if you had a thread pool and were to control worker threads via wait/notify. It would also eliminate the need for users to manually set the optimal number of threads since waiting threads that are created only once should not have a performance impact.

@slaren
Copy link
Collaborator

slaren commented Jul 9, 2023

I think that's most definitely caused by the threads constantly spinning. It is also an issue when using BLAS, because it forces us to set the number of threads to 1 to not interfere with the BLAS library, but that also means that operations other than matrix multiplication are only run in 1 thread. This will not be as much of an issue when offloading at the graph level, since only one compute backend will be running at the same time, but should be fixed nonetheless, it is very inefficient.

@mqy
Copy link
Contributor Author

mqy commented Jul 9, 2023

Coarse-grained wait/broadcast is not that difficult to implement.

One thing to consider is the wait/broadcast time, I had written a test https://github.com/mqy/compute.cpp/blob/main/testing/test_wait.c only work on *nix

The actual response time may not be that small and wakeup may takes take quite long time. I suggest you have a try. Here is my local result:

192:testing mqy$  gcc -O3 -std=c11 test_wait.c -o test_wait && ./test_wait
n_threads: 6, n_loops: 1
    avg_wait:   36.000 us
    avg_wakeup: 1547.000 us
192:testing mqy$  gcc -O3 -std=c11 test_wait.c -o test_wait && ./test_wait
n_threads: 6, n_loops: 1
    avg_wait:   28.000 us
    avg_wakeup: 58.000 us
192:testing mqy$  gcc -O3 -std=c11 test_wait.c -o test_wait && ./test_wait
n_threads: 6, n_loops: 1
    avg_wait:   27.000 us
    avg_wakeup: 26.000 us
192:testing mqy$  gcc -O3 -std=c11 test_wait.c -o test_wait && ./test_wait
n_threads: 6, n_loops: 1
    avg_wait:   33.000 us
    avg_wakeup: 75.000 us
192:testing mqy$  gcc -O3 -std=c11 test_wait.c -o test_wait && ./test_wait
n_threads: 6, n_loops: 1
    avg_wait:   36.000 us
    avg_wakeup: 85.000 us
192:testing mqy$  gcc -O3 -std=c11 test_wait.c -o test_wait && ./test_wait
n_threads: 6, n_loops: 1
    avg_wait:   22.000 us
    avg_wakeup: 30.000 us
192:testing mqy$  gcc -O3 -std=c11 test_wait.c -o test_wait && ./test_wait
n_threads: 6, n_loops: 1
    avg_wait:   35.000 us
    avg_wakeup: 75.000 us

Of course it's just a naive test, may not match actual situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue performance Speed related topics threading Parallel processing and thread management
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants