-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling cuda GPU acceleration #6
Comments
did you run it/benchmark it? |
Tests on 6-core Ryzen 5 and GTX 1660 cuBLAS is fastest when using F16 model and about 4.5s (22%) faster than without blas OpenBLAS is only faster (1.6s) than without blas when using F32 model and setting openblas threads to physical cores. Time for one sampling step:
Time for decode_first_stage stays at about 56s in all tests. OpenBLAS environment variable to set number of threads: |
It may be possible to find an optimal setting by testing different combinations of openblas threads and --threads. I guess this also applies to llama.cpp when using openblas. |
GPU support is already in my TODO list, and I'm working on adding it. However, I need to make ggml_conv_2d works on the GPU first. |
This is expected as ggml conv 2d will not be optimized by blas and will not run on GPU, I am working on this issue |
Because ggml's thread is always busy waiting, even if no computation task is performed. As a result, it competes with blas threads, sometimes resulting in negative optimization. This is the point that needs to be optimized. |
Btw, I'm not sure if the CPU version of conv 2d is optimal - most likely it is not. |
Great. But using cuBLAS is currently better than anything else if you have a cuda GPU.
Dont know if OpenBlas will have any real benefits over building without blas. And it looks like support for OpenBlas will soon be removed. See ggerganov/llama.cpp#2372 |
@klosax I suspect something is off with your benchmarks because it seems like the speed gain should be much higher with GPU vs CPU. Am I wrong? Using -DGGML_CLBLAST and applying the patch provided by ggml creator in #48, the GPU does get activated BUT CPU temps don’t drop by much. GPU stays hot after completion, so maybe those flags are not really doing anything. Did yesterday’s cuda update affect your speed at all? |
so,i have a question, the project can only run on the cpu?i found vram is low when use cuBLAS by enabling it: |
@LeonNerd I think the main reason of this project exist is can run with CPU only + low RAM. But when we look at README.md, GPU inference is still on development. And for shorter generation time, well maybe just generate 512 x 512 image only(don't use any BLAS), or get better CPU(?) |
Can we get some inspiration from clip.cpp? |
@klosax would be cool if you could rerun the benchmarks 😉 |
It is possible to use cuBLAS by enabling it when compiling:
-DGGML_CUBLAS=ON
Maybe add this to the readme?
The text was updated successfully, but these errors were encountered: