Skip to content

After installing with CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1, BLAS = 0 on model load #357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vmajor opened this issue Jun 10, 2023 · 11 comments
Labels
build llama.cpp Problem with llama.cpp shared lib

Comments

@vmajor
Copy link

vmajor commented Jun 10, 2023

Expected Behavior

CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
I may be misunderstanding the status output but after making sure that OpenBLAS is installed on my system and testing the build with llama.cpp I would expect to see in the instructions/architecture used this after the model has loaded
BLAS = 1

Current Behavior

BLAS = 0

Environment and Context

$ lscpu AMD Ryzen 9 3900XT 12-Core Processor

  • Operating System, e.g. for Linux:

$ uname -a DESKTOP-1TO72R9 5.15.68.1-microsoft-standard-WSL2+ #2 SMP

$ python3 --version 3.10.9
$ make --version GNU Make 4.3
$ g++ --version g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

OpenBLAS built from source and installed in default paths
llama.cpp built with OpenBLAS and tested

Example environment info:

llama-cpp-python$ git log | head -1
commit 6b764cab80168831ec21b30b7bac6f2fa11dace2


@gjmulder gjmulder added the build label Jun 10, 2023
@gfxblit
Copy link

gfxblit commented Jun 11, 2023

I have the same issue after upgrading to llama-cpp-python-0.1.62.

Previous (llama-cpp-python-0.1.61):

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660 Ti
llama.cpp: loading model from /Users/billy/data/models/WizardLM-7B-uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

llama_print_timings:        load time =   735.28 ms
llama_print_timings: prompt eval time =   735.23 ms /    48 tokens (   15.32 ms per token)
llama_print_timings:        eval time =  6391.50 ms /    83 runs   (   77.01 ms per token)
llama_print_timings:       total time =  7466.34 ms

llama_print_timings:        load time =   735.28 ms
llama_print_timings:      sample time =    28.64 ms /    90 runs   (    0.32 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  6934.34 ms /    90 runs   (   77.05 ms per token)

with llama-cpp-python-0.1.62 version:

llama.cpp: loading model from /Users/billy/data/models/WizardLM-7B-uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.72 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

llama_print_timings:        load time =  2976.44 ms
llama_print_timings:      sample time =    24.18 ms /   102 runs   (    0.24 ms per token)
llama_print_timings:        eval time = 17107.52 ms /   101 runs   (  169.38 ms per token)

maybe something upstream with llamacpp if the python bindings pickup latest from llamacpp?

@gfxblit
Copy link

gfxblit commented Jun 11, 2023

ok maybe my issue was different, but I didn't set the env settings correctly for powershell (windows). this works:

$env:FORCE_CMAKE=1
$env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"

and this doesn't:

SET CMAKE_ARGS="-DLLAMA_CUBLAS=on"
SET FORCE_CMAKE=1

@vmajor
Copy link
Author

vmajor commented Jun 11, 2023

It is a different issue. CUBLAS flag works, OPENBLAS does not seem to work.

@gjmulder gjmulder added the llama.cpp Problem with llama.cpp shared lib label Jun 11, 2023
@snxraven
Copy link

I can confirm this issue as well

@gjmulder
Copy link
Contributor

I tested and confirmed the openblas_simple/Dockerfile does produce a BLAS enabled container:

$ cd docker/openblas_simple

$ docker build --no-cache --force-rm -t openblas_simple .

[..]

Step 6/7 : RUN LLAMA_OPENBLAS=1 pip install llama_cpp_python --verbose
 ---> Running in 1420dedc0cc8
Using pip 23.1.2 from /usr/local/lib/python3.11/site-packages/pip (python 3.11)
Collecting llama_cpp_python
  Downloading llama_cpp_python-0.1.62.tar.gz (1.4 MB)

[..]

$ docker run -e USE_MLOCK=0 -e MODEL=/var/model/7B/ggml-model-f16.bin -v /data/llama/:/var/model -t openblas_simple
llama.cpp: loading model from /var/model/7B/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 14645.09 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
INFO:     Started server process [7]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

@gjmulder
Copy link
Contributor

Just however confirmed that:

LLAMA_OPENBLAS=1 pip install llama_cpp_python

does work, but:

CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

does not.

@iactix
Copy link

iactix commented Jun 14, 2023

May I add, I guess it's ok to have linux-only instructions on a cross-platform project, but at least say so.

In case anyone is interested, on windows I solved this by doing a recursive checkout of the repo and then having a cmd file that contains:

set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
python setup.py clean
python setup.py install

Doing pip uninstall llama-cpp-python multiple times before running that also helped in the past.

For the record, my system has all the dev stuff installed that could be needed, I am not saying that is all one needs to do.

@Skidaadle
Copy link

May I add, I guess it's ok to have linux-only instructions on a cross-platform project, but at least say so.

In case anyone is interested, on windows I solved this by doing a recursive checkout of the repo and then having a cmd file that contains:

set CMAKE_ARGS=-DLLAMA_CUBLAS=on set FORCE_CMAKE=1 python setup.py clean python setup.py install

Doing pip uninstall llama-cpp-python multiple times before running that also helped in the past.

For the record, my system has all the dev stuff installed that could be needed, I am not saying that is all one needs to do.

I tried this (I'm on windows as well) and was having some difficulty figuring out what they where even referring to when talking about their Environment Variable. I went digging and ended up find a file called CMakeLists.Txt from ggerganov's repo and on line 70 changed

option(LLAMA_CUBLAS "llama: use cuBLAS" ON)
(from OFF to ON)

I then completely re-installed llama-cpp-python and I've been able to get it to use the GPU.
That file also contains all the other BLAS backends, so maybe y'all could also benefit from that find.
I'm new to this, so sorry for any bad formatting but it worked for me and thought y'all might have some use from my finding

@abetlen
Copy link
Owner

abetlen commented Jun 16, 2023

Could be related to ggml-org/llama.cpp#1830 in which case should be fixed shortly.

@okigan
Copy link

okigan commented Jun 26, 2023

I wrote the above issue; I think flags in llama-cpp-python are not correct - trying to find time to make PR for llama-cpp-python.

@gjmulder
Copy link
Contributor

Closing. Please reopen if the problem is reproducible with the latest llama-cpp-python which includes an updated llama.cpp

@gjmulder gjmulder closed this as not planned Won't fix, can't repro, duplicate, stale Jul 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build llama.cpp Problem with llama.cpp shared lib
Projects
None yet
Development

No branches or pull requests

8 participants