Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling cuda GPU acceleration #6

Open
klosax opened this issue Aug 15, 2023 · 14 comments
Open

Enabling cuda GPU acceleration #6

klosax opened this issue Aug 15, 2023 · 14 comments

Comments

@klosax
Copy link

klosax commented Aug 15, 2023

It is possible to use cuBLAS by enabling it when compiling:
-DGGML_CUBLAS=ON

Maybe add this to the readme?

@Green-Sky
Copy link
Contributor

did you run it/benchmark it?

@klosax
Copy link
Author

klosax commented Aug 15, 2023

Tests on 6-core Ryzen 5 and GTX 1660
Parameter --threads is set to physical cores.

cuBLAS is fastest when using F16 model and about 4.5s (22%) faster than without blas

OpenBLAS is only faster (1.6s) than without blas when using F32 model and setting openblas threads to physical cores.

Time for one sampling step:

test Q4_0 Q8_0 F16 F 32 comment
cublas 16.12 16.60 16.05 16.28
w/o blas 19.46 19.20 20.54 23.86
openblas 20.02 19.86 20.77 22.28 env var 6 threads
openblas 30.86 29.30 32.26 29.68 default 12 threads

Time for decode_first_stage stays at about 56s in all tests.

OpenBLAS environment variable to set number of threads:
export OPENBLAS_NUM_THREADS=6

@klosax
Copy link
Author

klosax commented Aug 15, 2023

It may be possible to find an optimal setting by testing different combinations of openblas threads and --threads. I guess this also applies to llama.cpp when using openblas.

@leejet
Copy link
Owner

leejet commented Aug 16, 2023

It is possible to use cuBLAS by enabling it when compiling: -DGGML_CUBLAS=ON

Maybe add this to the readme?

GPU support is already in my TODO list, and I'm working on adding it. However, I need to make ggml_conv_2d works on the GPU first.

@leejet
Copy link
Owner

leejet commented Aug 16, 2023

Time for decode_first_stage stays at about 56s in all tests.

This is expected as ggml conv 2d will not be optimized by blas and will not run on GPU, I am working on this issue

@leejet
Copy link
Owner

leejet commented Aug 16, 2023

It may be possible to find an optimal setting by testing different combinations of openblas threads and --threads. I guess this also applies to llama.cpp when using openblas.

Because ggml's thread is always busy waiting, even if no computation task is performed. As a result, it competes with blas threads, sometimes resulting in negative optimization. This is the point that needs to be optimized.

@ggerganov
Copy link
Contributor

This is expected as ggml conv 2d will not be optimized by blas and will not run on GPU, I am working on this issue

Btw, I'm not sure if the CPU version of conv 2d is optimal - most likely it is not.
There might be additional improvements possible if implemented properly

@klosax
Copy link
Author

klosax commented Aug 16, 2023

GPU support is already in my TODO list, and I'm working on adding it. However, I need to make ggml_conv_2d works on the GPU first.

Great. But using cuBLAS is currently better than anything else if you have a cuda GPU.

Because ggml's thread is always busy waiting, even if no computation task is performed. As a result, it competes with blas threads, sometimes resulting in negative optimization. This is the point that needs to be optimized.

Dont know if OpenBlas will have any real benefits over building without blas. And it looks like support for OpenBlas will soon be removed. See ggerganov/llama.cpp#2372

@klosax klosax mentioned this issue Aug 21, 2023
@Happenedtostumblein
Copy link

Happenedtostumblein commented Sep 5, 2023

@klosax I suspect something is off with your benchmarks because it seems like the speed gain should be much higher with GPU vs CPU. Am I wrong?

Using -DGGML_CLBLAST and applying the patch provided by ggml creator in #48, the GPU does get activated BUT CPU temps don’t drop by much. GPU stays hot after completion, so maybe those flags are not really doing anything.

Did yesterday’s cuda update affect your speed at all?

@LeonNerd
Copy link

so,i have a question, the project can only run on the cpu?i found vram is low when use cuBLAS by enabling it:
-DGGML_CUBLAS=ON and It has been a long time when use FP16 model.
what should i do for better infer time.

@juniofaathir
Copy link

@LeonNerd I think the main reason of this project exist is can run with CPU only + low RAM. But when we look at README.md, GPU inference is still on development.

And for shorter generation time, well maybe just generate 512 x 512 image only(don't use any BLAS), or get better CPU(?)

@LeonNerd
Copy link

Can we get some inspiration from clip.cpp?

@FSSRepo
Copy link
Contributor

FSSRepo commented Nov 29, 2023

@LeonNerd You can now activate the CUDA backend with -DSD_CUBLAS=ON, @klosax you can close this issue.

@Green-Sky
Copy link
Contributor

@klosax would be cool if you could rerun the benchmarks 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants