Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No cuBLAS #101

Closed
Priestru opened this issue Apr 21, 2023 · 34 comments
Closed

No cuBLAS #101

Priestru opened this issue Apr 21, 2023 · 34 comments
Labels
oobabooga https://github.com/oobabooga/text-generation-webui windows A Windoze-specific issue

Comments

@Priestru
Copy link

Priestru commented Apr 21, 2023

AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

Is it possible to add option to enable cuBLAS support like in an original Llama.cpp?

@jmtatsch
Copy link

Yes see docker PR for how to

@Priestru
Copy link
Author

It seems to be about openblas, not cublas.

@Priestru Priestru changed the title No BLAS No cuBLAS Apr 21, 2023
@jmtatsch
Copy link

Well you asked about openblas originally. Cublas should work exactly the same way.

@Priestru
Copy link
Author

Yeah i wrote BLAS without making it clear what i speak about, my bad. Yet i struggle to realize how to force micromamba that used in ooba's webui to accept -DLLAMA_CUBLAS=ON parameters

@snxraven
Copy link

ggml-org/llama.cpp#1044

It needs to be compiled with an ENV Var set:
make clean && LLAMA_CUBLAS=1 make

@Priestru
Copy link
Author

Well i can compile original llama.cpp with such but when it comes to llama-cpp-python within micromamba env things become too complicated for my limited abilities. If i can read anything that could bring me closer to solution i would deeply appreciate it.

@abetlen
Copy link
Owner

abetlen commented Apr 21, 2023

cuBLAS definitely works, I've tested installing and using cuBLAS by installing with the LLAMA_CUBLAS=1 flag and then python setup.py develop installing. It doesn't show up in that list because the function that prints the flags hasn't been updated yet in llama.cpp. It should work though (check nvidia-smi and you'll see some usage) and there's a good 25-30% speedup to eval times that should be pretty noticeable too.

@Priestru
Copy link
Author

I'm sorry but i struggle to understand how to make llama_cpp_python-0.1.36-cp310-cp310-win_amd64.whl that has n_batch of 512 instead of 8 and cuBLAS enabled.

@Priestru
Copy link
Author

Hard to believe but with help of GPT it seems like i managed to learn how to and make it.

@gjmulder
Copy link
Contributor

Hard to believe but with help of GPT it seems like i managed to learn how to and make it.

Care to share? 😄

@Priestru
Copy link
Author

Hard to believe but with help of GPT it seems like i managed to learn how to and make it.

Care to share? 😄

image

Thanks god it doesn't work. It would be too much of a cultural shock if i dared to pull it off from the first attempt for real.

@Priestru
Copy link
Author

Priestru commented Apr 22, 2023

-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 12.1.105
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.1/bin/nvcc.exe - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- CMAKE_SYSTEM_PROCESSOR: AMD64
-- x86 detected
-- GGML CUDA sources found, configuring CUDA architecture
-- Configuring done (8.2s)
-- Generating done (0.0s)
-- Build files have been written to: E:/LLaMA/llama-cpp-python/_skbuild/win-amd64-3.10/cmake-build
[3/5] Building CUDA object vendor\llama.cpp\CMakeFiles\ggml.dir\ggml-cuda.cu.obj
ggml-cuda.cu
tmpxft_00002550_00000000-10_ggml-cuda.cudafe1.cpp
[4/5] Install the project...-- Install configuration: "Release"
-- Installing: E:/LLaMA/llama-cpp-python/_skbuild/win-amd64-3.10/cmake-install/llama_cpp/llama.dll

Tho looks just fine, no idea why my dll isn't working

creating 'dist\llama_cpp_python-0.1.36-cp310-cp310-win_amd64.whl' and adding '_skbuild\win-amd64-3.10\setuptools\bdist.win-amd64\wheel' to it
adding 'llama_cpp/__init__.py'
adding 'llama_cpp/llama.dll'
adding 'llama_cpp/llama.py'
adding 'llama_cpp/llama_cpp.py'
adding 'llama_cpp/llama_types.py'
adding 'llama_cpp/server/__main__.py'
adding 'llama_cpp_python-0.1.36.dist-info/LICENSE.md'
adding 'llama_cpp_python-0.1.36.dist-info/METADATA'
adding 'llama_cpp_python-0.1.36.dist-info/WHEEL'
adding 'llama_cpp_python-0.1.36.dist-info/top_level.txt'
adding 'llama_cpp_python-0.1.36.dist-info/RECORD'

WHL looks perfect setting aside it doesn't work at all

@Priestru
Copy link
Author

Priestru commented Apr 22, 2023

Traceback (most recent call last):
File “E:\LLaMA\oobabooga-windows\installer_files\env\lib\site-packages\llama_cpp\llama_cpp.py”, line 54, in _load_shared_library
return ctypes.CDLL(str(lib_path))
File "E:\LLaMA\oobabooga-windows\installer_files\env\lib\ctypes_init.py", line 374, in init
self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module ‘E:\LLaMA\oobabooga-windows\installer_files\env\lib\site-packages\llama_cpp\llama.dll’ (or one of its dependencies). Try using the full path with constructor syntax.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “E:\LLaMA\oobabooga-windows\text-generation-webui\server.py”, line 101, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name)
File “E:\LLaMA\oobabooga-windows\text-generation-webui\modules\models.py”, line 104, in load_model
from modules.llamacpp_model_alternative import LlamaCppModel
File “E:\LLaMA\oobabooga-windows\text-generation-webui\modules\llamacpp_model_alternative.py”, line 9, in
from llama_cpp import Llama, LlamaCache
File “E:\LLaMA\oobabooga-windows\installer_files\env\lib\site-packages\llama_cpp_init_.py”, line 1, in
from .llama_cpp import *
File “E:\LLaMA\oobabooga-windows\installer_files\env\lib\site-packages\llama_cpp\llama_cpp.py”, line 67, in
_lib = _load_shared_library(_lib_base_name)
File “E:\LLaMA\oobabooga-windows\installer_files\env\lib\site-packages\llama_cpp\llama_cpp.py”, line 56, in _load_shared_library
raise RuntimeError(f"Failed to load shared library ‘{_lib_path}’: {e}")
RuntimeError: Failed to load shared library ‘E:\LLaMA\oobabooga-windows\installer_files\env\lib\site-packages\llama_cpp\llama.dll’: Could not find module ‘E:\LLaMA\oobabooga-windows\installer_files\env\lib\site-packages\llama_cpp\llama.dll’ (or one of its dependencies). Try using the full path with constructor syntax.

When i use my wheel that made with cublas enabled it creates 720KB dll file that somehow fail to work. Could anyone help with some ideas?

@Priestru
Copy link
Author

No luck. regretfully i can't anything about it. My only hope that cublas wheels would be published by someone of a greater intellect later

@jmtatsch
Copy link

The dll file is really at that place E:\LLaMA\oobabooga-windows\installer_files\env\lib\site-packages\llama_cpp\llama.dll ? Looks like a path issue.

@Priestru
Copy link
Author

Yeah, it's def there

@Priestru
Copy link
Author

If i manually change my dll to normal one it immediately begins to load models, so it's more about

(or **one of its dependencies**)

@Priestru
Copy link
Author

This issue is entirely cuda related. If i make dll with cublas off and then manually swap it, it still loads all models and stuff. But for cublas on it doesn't

@abetlen
Copy link
Owner

abetlen commented Apr 22, 2023

cuBLAS will not work with any of the wheels unfortunately, llama.cpp has to be compiled from source for it to work so you either have to install from PyPI or from the Github.

@horenbergerb
Copy link

@abetlen Could you elaborate on how you built to get cublas working? I've been trying this, but I'm not able to match the speeds seen when I just build llama.cpp with cublas (150ms/t vs 40ms/t for prompt eval time).

I'm on Ubuntu. Here's essentially what I tried:

  1. git clone llama-cpp-python
  2. clone submodules
  3. build llama.cpp submodule with make clean && LLAMA_CUBLAS=1 make libllama.so
  4. build llama-cpp-python with LLAMA_CUBLAS=on python3 setup.py develop

I see that BLAS is enabled when loading models with llama-cpp-python, but the performance is still so slow compared to llama.cpp... I also tried copying the libllama.so file into _skbuild, just to see if that changed anything.

Any clue where I've gone wrong here? Thanks!

@Priestru
Copy link
Author

Priestru commented Apr 23, 2023

@abetlen Could you elaborate on how you built to get cublas working? I've been trying this, but I'm not able to match the speeds seen when I just build llama.cpp with cublas (150ms/t vs 40ms/t for prompt eval time).

I'm on Ubuntu. Here's essentially what I tried:

  1. git clone llama-cpp-python
  2. clone submodules
  3. build llama.cpp submodule with make clean && LLAMA_CUBLAS=1 make libllama.so
  4. build llama-cpp-python with LLAMA_CUBLAS=on python3 setup.py develop

I see that BLAS is enabled when loading models with llama-cpp-python, but the performance is still so slow compared to llama.cpp... I also tried copying the libllama.so file into _skbuild, just to see if that changed anything.

Any clue where I've gone wrong here? Thanks!

Do you have n_batch 32 or higher (512)? BLAS shows active but doesn't work for default (8) batch size.

@Priestru
Copy link
Author

Priestru commented Apr 23, 2023

I do following:


E:\LLaMA>git clone https://github.com/abetlen/llama-cpp-python

E:\LLaMA>cd llama-cpp-python

E:\LLaMA\llama-cpp-python>cd vendor

E:\LLaMA\llama-cpp-python\vendor>git clone https://github.com/ggerganov/llama.cpp

Then i go to

E:\LLaMA\llama-cpp-python\vendor\llama.cpp\CMakeLists.txt

and change

option(LLAMA_CUBLAS "llama: use cuBLAS" OFF)

to

option(LLAMA_CUBLAS "llama: use cuBLAS" ON)

after that

E:\LLaMA\llama-cpp-python\vendor>cd ..

E:\LLaMA\llama-cpp-python>python setup.py bdist_wheel

now in E:\LLaMA\llama-cpp-python\dist i have llama_cpp_python-0.1.36-cp310-cp310-win_amd64.whl and it's DLL won't work.

Any ideas what exactly i'm doing wrong?

@Priestru
Copy link
Author

If literally same way that mentioned above i compile .exe using cmake it works flawlessly

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

llama_print_timings: prompt eval time = 10704.31 ms /   399 tokens (   26.83 ms per token)

Why on earth it doesn't work as dll.

@abetlen
Copy link
Owner

abetlen commented Apr 23, 2023

4. LLAMA_CUBLAS=on python3 setup.py develop

This should be the same ie. LLAMA_CUBLAS=1 python3 setup.py develop

@abetlen
Copy link
Owner

abetlen commented Apr 23, 2023

@Priestru you should follow the development install instructions ie:

git clone git@github.com:abetlen/llama-cpp-python.git
git submodule update --init --recursive
# Will need to be re-run any time vendor/llama.cpp is updated
LLAMA_CUBLAS=1 python3 setup.py develop

@horenbergerb
Copy link

horenbergerb commented Apr 23, 2023

Thanks, I gave it another shot this morning and got it to work. Part of the issue might be that I needed to raise the default n_batch to 512, but I also changed LLAMA_CUBLAS=on to match LLAMA_CUBLAS=1.
(Edit: btw, I know it's working because I'm seeing 17ms/token on prompt eval!)

@gjmulder
Copy link
Contributor

I can see the PID running in nvidia-smi, but sadly the GPU utilisation stays at 0%:

$ nvidia-smi 
Sun Apr 23 16:57:23 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti      On | 00000000:09:00.0 Off |                  N/A |
| 25%   33C    P8               11W / 250W|    226MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     19286      C   python3                                     222MiB |
+---------------------------------------------------------------------------------------+

@Priestru
Copy link
Author

Priestru commented Apr 23, 2023

I'm still suffering with this issue. I even asked my friend to help, but nothing came to fruition.

git clone git@github.com:abetlen/llama-cpp-python.git
git submodule update --init --recursive
# Will need to be re-run any time vendor/llama.cpp is updated
LLAMA_CUBLAS=1 python3 setup.py develop

can't use git submodule update --init --recursive, for some reason i have ssh issues (no idea why, it works for anything else), so i went to clone it with http. It should be fine still.

E:\LLaMA\llama-cpp-python>git submodule update --init --recursive
Submodule path 'vendor/llama.cpp': checked out '0e018fe008eacebdbcfa2d61b6c988c245c961cd'

Then i do

LLAMA_CUBLAS=1 python3 setup.py develop

such command doesn't work for me in windows so i manually change LLAMA_CUBLAS to ON and use python setup.py develop

everything seems to be fine, but i have no idea, how i can check if anything works at all at this point. I tried to randomly launch python scripts everywhere but no luck.

How can i load model using llama-cpp-python directly without oobabooga gui?
How can i know if my n_batch is 512?

@Priestru
Copy link
Author

I'm trying to do it on fresh WSL instance now.

@Priestru
Copy link
Author

I can see the PID running in nvidia-smi, but sadly the GPU utilisation stays at 0%:

$ nvidia-smi 
Sun Apr 23 16:57:23 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080 Ti      On | 00000000:09:00.0 Off |                  N/A |
| 25%   33C    P8               11W / 250W|    226MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     19286      C   python3                                     222MiB |
+---------------------------------------------------------------------------------------+

screams n_batch = 8 to me.

@horenbergerb
Copy link

@Priestru n_batch is an argument you can set when initializing the LLM in your Python script. You should just be able to add n_batch=512 as an argument.

@Priestru
Copy link
Author

Okay i built a wsl version. Followed and copied those changes ggml-org/llama.cpp#1128

And also did brand new libllama.so via LLAMA_CUBLAS=1 python3 setup.py develop

It def updated my version to newer one cus 3.4 can't load q4_3 but mine can.
But it doesn't have any cuda within it. I know how GPU should behave with n_batch = 8 and BLAS = 1, it's not the case.
It's also only 300 KB, but with CUDA parts it should be larger. Furthermore i can't see any nvcc activity during complication and more importantly it does load models and works, so i'm sure it has no CUDA.

Why for the love of God is this so hard.

@Priestru
Copy link
Author

make clean && LLAMA_CUBLAS=1 make

Okay this one worked. Required export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} for no reason but worked for sure

BLAS = 1

But LLAMA_CUBLAS=1 python3 setup.py develop doesn't even begin to touch CUDA.

@Priestru
Copy link
Author

Priestru commented Apr 23, 2023

I did it on wsl. No idea how to make it on windows, but it works with ooba now.

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |

For anyone who may need this info i acted like a barbarian. I commented out ifs from makefile forcing it always do cuda. I manually placed libllama.so into \\wsl$\Ubuntu\home\yuuru\.local\lib\python3.8\site-packages\llama_cpp without any pips and whls. And i also didn't make any venvs, it's fresh ubuntu with sole purpose to make this work somehow. I believe everyone could make it better.

Description:    Ubuntu 20.04.6 LTS
`Python 3.8.10`
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
cmake version 3.26.3
python3 -c "import skbuild; print(skbuild.__version__)"
0.17.2
python3 -c "import torch; print(torch.__version__)"
2.0.0+cu117

image

@gjmulder gjmulder added the windows A Windoze-specific issue label May 18, 2023
@gjmulder gjmulder added the oobabooga https://github.com/oobabooga/text-generation-webui label Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oobabooga https://github.com/oobabooga/text-generation-webui windows A Windoze-specific issue
Projects
None yet
Development

No branches or pull requests

6 participants