-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: refactor ggml_cuda_op + lower GPU latency via quantization on main GPU and tiling #3110
CUDA: refactor ggml_cuda_op + lower GPU latency via quantization on main GPU and tiling #3110
Conversation
Unfortunately this seems to reduce performance under WSL2, even with a single GPU. ggml_init_cublas: found 1 CUDA devices:
ggml_init_cublas: found 2 CUDA devices:
|
c42b303
to
54f041b
Compare
What could be happening is that waiting on CUDA events is much more expensive on Windows than it is on Linux. I've pushed a commit that removes some unnecessary event waiting. Hopefully that will fix the single GPU performance (because I don't know what else could be the problem). Also please check how varying the define |
The last commit didn't affect single or multi GPU performance, but setting ggml_init_cublas: found 1 CUDA devices:
ggml_init_cublas: found 2 CUDA devices:
build: 54da1a2 (1209) With ggml_init_cublas: found 2 CUDA devices:
|
Wait, now that I look at your numbers, isn't the multi GPU performance terrible either way though? I mean, ~5 t/s for 7b q4_0 token generation is really bad for an RTX 3090 ti + an RTX 3080. So maybe for this PR it's sufficient to fix the single GPU performance? |
Yes, multi GPU performance with WSL2 has always been very bad for me. The biggest issue is the degradation in single GPU performance. |
The branch for this PR is rebased, with WIP commits removed. Can you check the single GPU performance on the branch |
Looks like the performance degradation started with 55f7419:
build: dbddcd1 (1221)
build: c26cc71 (1245)
build: 55f7419 (1258) |
I don't understand what is causing the performance regression. Can you try to find the last commit with good performance? |
I did a bisect between 55f7419 and c26cc71 and got this: da6cb22db7b6478797c597dc8120ec7a2cf4e1ca is the first bad commit
commit da6cb22db7b6478797c597dc8120ec7a2cf4e1ca
Author: JohannesGaessler <johannesg@5d6.de>
Date: Sat Sep 9 21:02:35 2023 +0200
reorder loops |
I've pushed a version that has the loops in the old order. Can you check performance? |
This improved pp performance, but it is still a bit slower than master. tg is still slower. Let me know if you need me to do another bisect.
build: bd79c94 (1211) |
The only thing in the loop that could maybe make a difference for performance is |
No change.
build: 866b502 (1212) |
I also tested removing all the calls to |
I noticed that
build: 866b502 (1212) |
Okay so my intuition was correct and I just did the implementation wrong lol. It's pretty ridiculous though that this is apparently not being cached in the CUDA libraries. |
With multi GPU it crashes in
The only change is this: diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 1551517..deb5571 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -416,6 +416,8 @@ cudaError_t ggml_cuda_set_device(int device) {
return cudaSuccess;
}
+ current_device = device;
+
return cudaSetDevice(device);
} |
I am not able to reproduce the issue. Can you share the exact commands that you used? |
The command is |
This seems to fix the issue and doesn't affect performance: diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 1551517..611b080 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -410,7 +410,8 @@ struct ggml_tensor_extra_gpu {
};
cudaError_t ggml_cuda_set_device(int device) {
- static int current_device = -1;
+ int current_device;
+ CUDA_CHECK(cudaGetDevice(¤t_device));
if (device == current_device) {
return cudaSuccess; Multi GPU performance is still very bad, but I guess that's expected since the ggml_init_cublas: found 2 CUDA devices:
build: 866b502 (1212) |
866b502
to
c923de7
Compare
So presumably the cause of the performance regression that you experienced should be fixed now. In the initial version |
ggml-cuda.cu
Outdated
return cudaSuccess; | ||
} | ||
|
||
current_device = device; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line can be removed.
Yeah, it is very strange that |
On native Windows using my RTX 3090 I am also experiencing a performance uplift. master:
build: 1b0d092 (1213) PR:
build: c923de7 (1212) |
c923de7
to
a599006
Compare
Use |
Unfortunately, nsight systems doesn't work under WSL2. I'll test on native Windows though. |
The performance improvement under native Windows is similar (to WSL2) for me, which is good since it improves single GPU tg performance massively. Multi GPU tg is faster than master, but still very slow. WINDOWS: MASTER: ggml_init_cublas: found 1 CUDA devices:
build: 1b0d092 (1213) ggml_init_cublas: found 2 CUDA devices:
build: 1b0d092 (1213) PR: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
build: a599006 (1212) Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
build: a599006 (1212) I'll check the multi GPU trace with nsight systems, but I think this can be merged already. |
a599006
to
9268745
Compare
Btw, I wonder if the difference between your results and mine (there is no way that the 3090Ti is that much faster than the 3090) is due to hardware-accelerated GPU scheduling, I have it disabled. I am also on Windows 11. |
I had hardware-accelerated GPU scheduling enabled. This is the performance after disabling it:
build: c923de7 (1212)
build: c923de7 (1212)% |
Strange. I'm getting an out of memory error trying to offload 7B fully with this PR. In a previous committ, I was able to offload q3k_l on my RTX 2060 laptop with 6 GB VRAM. But this PR throws me an out of memory error. -n 180 -c 2048 -t 6 --gpu-layers 99 and a prompt with a context of around 1800 tokens for both. |
Looks like this broke LoRA support. #0 0x0000000000000000 in ?? ()
#1 0x000055555568eb4f in ggml_cuda_op_mul_mat_cublas (src0=0x7fff49070030, src1=0x7fff49138180, dst=0x7fff492002d0, src0_dd_i=0x1326a20400 "",
src1_ddf_i=0x1326af2400, src1_ddq_i=0x0, dst_dd_i=0x1327400000, row_low=0, row_high=3200, src1_ncols=3200, src1_padded_row_size=512,
stream=@0x7ffffffcef18: 0x5555566dad10) at /home/slaren/code/llama.cpp/ggml-cuda.cu:5637 The type of |
This PR broke #2506.
|
This PR adds the following 2 optimizations to the CUDA multi GPU code:
mul_mat_q
andmul_mat_vec_q
the data from the hidden state is quantized by the main GPU and then distributed instead of being distributed as f32 and then quantized on each device. This reduces latency and used PCIe bandwidth by a factor of up to 3.56.These are the results:
As part of the above changes I have refactored
ggml_cuda_op
. I have split the function into a functionggml_cuda_op_flatten
and a functionggml_cuda_op_mul_mat
. All tensors other than matrix multiplication useggml_cuda_op_flatten
and this function is equivalent to the the oldggml_cuda_op
withflatten_rows=true
(but greatly simplified).ggml_cuda_op_mul_mat
is more complicated due to the various performance considerations but it should hopefully not be any more difficult to understand than the oldggml_cuda_op
.