After installing with CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1, BLAS = 0 on model load #357

vmajor · 2023-06-10T18:10:31Z

Expected Behavior

CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir
I may be misunderstanding the status output but after making sure that OpenBLAS is installed on my system and testing the build with llama.cpp I would expect to see in the instructions/architecture used this after the model has loaded
BLAS = 1

Current Behavior

BLAS = 0

Environment and Context

$ lscpu AMD Ryzen 9 3900XT 12-Core Processor

Operating System, e.g. for Linux:

$ uname -a DESKTOP-1TO72R9 5.15.68.1-microsoft-standard-WSL2+ #2 SMP

$ python3 --version 3.10.9
$ make --version GNU Make 4.3
$ g++ --version g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

OpenBLAS built from source and installed in default paths
llama.cpp built with OpenBLAS and tested

Example environment info:

llama-cpp-python$ git log | head -1
commit 6b764cab80168831ec21b30b7bac6f2fa11dace2

The text was updated successfully, but these errors were encountered:

gfxblit · 2023-06-11T04:06:17Z

I have the same issue after upgrading to llama-cpp-python-0.1.62.

Previous (llama-cpp-python-0.1.61):

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660 Ti
llama.cpp: loading model from /Users/billy/data/models/WizardLM-7B-uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  = 1932.72 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

llama_print_timings:        load time =   735.28 ms
llama_print_timings: prompt eval time =   735.23 ms /    48 tokens (   15.32 ms per token)
llama_print_timings:        eval time =  6391.50 ms /    83 runs   (   77.01 ms per token)
llama_print_timings:       total time =  7466.34 ms

llama_print_timings:        load time =   735.28 ms
llama_print_timings:      sample time =    28.64 ms /    90 runs   (    0.32 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  6934.34 ms /    90 runs   (   77.05 ms per token)

with llama-cpp-python-0.1.62 version:

llama.cpp: loading model from /Users/billy/data/models/WizardLM-7B-uncensored.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 5407.72 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  256.00 MB

llama_print_timings:        load time =  2976.44 ms
llama_print_timings:      sample time =    24.18 ms /   102 runs   (    0.24 ms per token)
llama_print_timings:        eval time = 17107.52 ms /   101 runs   (  169.38 ms per token)

maybe something upstream with llamacpp if the python bindings pickup latest from llamacpp?

gfxblit · 2023-06-11T04:30:48Z

ok maybe my issue was different, but I didn't set the env settings correctly for powershell (windows). this works:

$env:FORCE_CMAKE=1
$env:CMAKE_ARGS="-DLLAMA_CUBLAS=on"

and this doesn't:

SET CMAKE_ARGS="-DLLAMA_CUBLAS=on"
SET FORCE_CMAKE=1

vmajor · 2023-06-11T04:44:23Z

It is a different issue. CUBLAS flag works, OPENBLAS does not seem to work.

snxraven · 2023-06-12T16:59:50Z

I can confirm this issue as well

gjmulder · 2023-06-12T17:14:51Z

I tested and confirmed the openblas_simple/Dockerfile does produce a BLAS enabled container:

$ cd docker/openblas_simple

$ docker build --no-cache --force-rm -t openblas_simple .

[..]

Step 6/7 : RUN LLAMA_OPENBLAS=1 pip install llama_cpp_python --verbose
 ---> Running in 1420dedc0cc8
Using pip 23.1.2 from /usr/local/lib/python3.11/site-packages/pip (python 3.11)
Collecting llama_cpp_python
  Downloading llama_cpp_python-0.1.62.tar.gz (1.4 MB)

[..]

$ docker run -e USE_MLOCK=0 -e MODEL=/var/model/7B/ggml-model-f16.bin -v /data/llama/:/var/model -t openblas_simple
llama.cpp: loading model from /var/model/7B/ggml-model-f16.bin
llama_model_load_internal: format     = ggjt v1 (pre #1405)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 1 (mostly F16)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 14645.09 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
INFO:     Started server process [7]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

gjmulder · 2023-06-12T17:22:13Z

Just however confirmed that:

LLAMA_OPENBLAS=1 pip install llama_cpp_python

does work, but:

CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python

does not.

iactix · 2023-06-14T18:51:29Z

May I add, I guess it's ok to have linux-only instructions on a cross-platform project, but at least say so.

In case anyone is interested, on windows I solved this by doing a recursive checkout of the repo and then having a cmd file that contains:

set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
python setup.py clean
python setup.py install

Doing pip uninstall llama-cpp-python multiple times before running that also helped in the past.

For the record, my system has all the dev stuff installed that could be needed, I am not saying that is all one needs to do.

Skidaadle · 2023-06-14T22:37:54Z

May I add, I guess it's ok to have linux-only instructions on a cross-platform project, but at least say so.

In case anyone is interested, on windows I solved this by doing a recursive checkout of the repo and then having a cmd file that contains:

set CMAKE_ARGS=-DLLAMA_CUBLAS=on set FORCE_CMAKE=1 python setup.py clean python setup.py install

Doing pip uninstall llama-cpp-python multiple times before running that also helped in the past.

For the record, my system has all the dev stuff installed that could be needed, I am not saying that is all one needs to do.

I tried this (I'm on windows as well) and was having some difficulty figuring out what they where even referring to when talking about their Environment Variable. I went digging and ended up find a file called CMakeLists.Txt from ggerganov's repo and on line 70 changed

option(LLAMA_CUBLAS "llama: use cuBLAS" ON)
(from OFF to ON)

I then completely re-installed llama-cpp-python and I've been able to get it to use the GPU.
That file also contains all the other BLAS backends, so maybe y'all could also benefit from that find.
I'm new to this, so sorry for any bad formatting but it worked for me and thought y'all might have some use from my finding

abetlen · 2023-06-16T04:14:11Z

Could be related to ggml-org/llama.cpp#1830 in which case should be fixed shortly.

okigan · 2023-06-26T14:58:53Z

I wrote the above issue; I think flags in llama-cpp-python are not correct - trying to find time to make PR for llama-cpp-python.

gjmulder · 2023-07-10T06:48:51Z

Closing. Please reopen if the problem is reproducible with the latest llama-cpp-python which includes an updated llama.cpp

gjmulder added the build label Jun 10, 2023

gjmulder added the llama.cpp Problem with llama.cpp shared lib label Jun 11, 2023

gjmulder closed this as not planned Won't fix, can't repro, duplicate, stale Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After installing with CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1, BLAS = 0 on model load #357

After installing with CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1, BLAS = 0 on model load #357

vmajor commented Jun 10, 2023

gfxblit commented Jun 11, 2023 •

edited

Loading

gfxblit commented Jun 11, 2023

vmajor commented Jun 11, 2023

snxraven commented Jun 12, 2023

gjmulder commented Jun 12, 2023

gjmulder commented Jun 12, 2023

iactix commented Jun 14, 2023

Skidaadle commented Jun 14, 2023

abetlen commented Jun 16, 2023

okigan commented Jun 26, 2023

gjmulder commented Jul 10, 2023

After installing with CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1, BLAS = 0 on model load #357

After installing with CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1, BLAS = 0 on model load #357

Comments

vmajor commented Jun 10, 2023

Expected Behavior

Current Behavior

Environment and Context

gfxblit commented Jun 11, 2023 • edited Loading

gfxblit commented Jun 11, 2023

vmajor commented Jun 11, 2023

snxraven commented Jun 12, 2023

gjmulder commented Jun 12, 2023

gjmulder commented Jun 12, 2023

iactix commented Jun 14, 2023

Skidaadle commented Jun 14, 2023

abetlen commented Jun 16, 2023

okigan commented Jun 26, 2023

gjmulder commented Jul 10, 2023

gfxblit commented Jun 11, 2023 •

edited

Loading