Installing Llama.cpp without `nvcc` from `nvidia-cuda-toolkit` #3459

jamesbraza · 2023-10-03T21:26:50Z

On an AWS EC2 g4dn.4xlarge (Ubuntu 22.04.2, x86_64, cuda apt package installed for cuBLAS support, NVIDIA Tesla T4), I am trying to install Llama.cpp with cuBLAS acceleration.

I am getting an error during pip install (see dropdown below). abetlen/llama-cpp-python#205 hit this same issue, and the solution was apt install nvidia-cuda-toolkit for nvcc.

However, check this: https://forums.developer.nvidia.com/t/impossible-resolution-ubuntu-22-04/, currently it's impossible for apt to resolve cuda (for cuBLAS) and nvidia-cuda-toolkit (for nvcc).

Is there something I can do to install Llama.cpp with cuBLAS support, but not having nvcc?

pip install fail

CMAKE_ARGS="-DLLAMA_CUBLAS=on" python -m pip install -r llama-cpp-python[server]==0.2.11 \
    --progress-bar off --no-cache-dir
...
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [36 lines of output]
      *** scikit-build-core 0.5.1 using CMake 3.27.6 (wheel)
      *** Configuring CMake...
      loading initial cache file /tmp/tmpsezaa2fg/build/CMakeInit.txt
      -- The C compiler identification is GNU 11.4.0
      -- The CXX compiler identification is GNU 11.4.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: /usr/bin/git (found version "2.34.1")
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140")
      -- cuBLAS found
      -- The CUDA compiler identification is unknown
      CMake Error at /tmp/pip-build-env-xoh3dnhj/normal/lib/python3.11/site-packages/cmake/data/share/cmake-3.27/Modules/CMakeDetermineCUDACompiler.cmake:603 (message):
        Failed to detect a default CUDA architecture.



        Compiler output:

      Call Stack (most recent call first):
        vendor/llama.cpp/CMakeLists.txt:290 (enable_language)


      -- Configuring incomplete, errors occurred!

      *** CMake configuration failed
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Environment and Context

ubuntu@ip-123:~$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            7
    BogoMIPS:            4999.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid
                          aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_si
                         ngle pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku o
                         spke avx512_vnni
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    8 MiB (8 instances)
  L3:                    35.8 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         KVM: Mitigation: VMX unsupported
  L1tf:                  Mitigation; PTE Inversion
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Vulnerable
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

ubuntu@ip-123:~$ uname -a
Linux ip-123 6.2.0-1012-aws #12~22.04.1-Ubuntu SMP Thu Sep  7 14:01:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@ip-123:~$ python3 --version
Python 3.10.12

ubuntu@ip-123:~$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

ubuntu@ip-123:~$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ubuntu@ip-123:~$ nvidia-smi
Tue Oct  3 21:21:11 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   27C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

jamesbraza · 2023-10-03T23:00:38Z

Cc @abetlen and @gjmulder since they were responding a bunch in the other issue

jamesbraza · 2023-10-03T23:19:37Z

With a locally git cloned Llama.cpp b1317, make with cuBLAS acceleration errors due to nvcc not being present:

> make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native
I NVCCFLAGS: --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native "
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native    -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/console.cpp -o console.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/grammar-parser.cpp -o grammar-parser.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native  -c k_quants.c -o k_quants.o
nvcc --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native " -c ggml-cuda.cu -o ggml-cuda.o
/bin/sh: 1: nvcc: not found
make: *** [Makefile:411: ggml-cuda.o] Error 127

The relevant log:

/bin/sh: 1: nvcc: not found
make: *** [Makefile:411: ggml-cuda.o] Error 127

Is there any way we can get around nvcc not being present? For example, using a different tool like nvidia-smi as a stand-in for acquiring the info from nvcc

staviq · 2023-10-04T14:46:43Z

nvcc doesn't acquire any info, it is the compiler responsible for building that part of the code.

I might be wrong, but doesn't nvidia-cuda-toolkit already provide everything necesary to compile and run cuda ? Isn't installing cuda separately redundant here ?

Even if there are some system package shenanigans, you can simply install nvidia-cuda-toolkit, build the code, and uninstall it.

jamesbraza · 2023-10-04T17:20:28Z

Thanks @staviq I appreciate the follow-up. To explain the situation, perhaps more concisely:

cuBLAS gets installed by cuda as suggested in How to install with GPU support via cuBLAS and CUDA abetlen/llama-cpp-python#250 (comment)
- My machine is AWS EC2 g4dn.4xlarge, so it has one GPU. However, from https://stackoverflow.com/a/53493479
  - Before sudo apt install cuda: the command here has "no such directory", so cuBLAS isn't present
  - After sudo apt install cuda: the CUBLAS files are found
- I believe cuda is installing CUDA drivers (e.g. libnvidia-compute-535)
cuda and nvidia-cuda-toolkit currently can't be resolved together: https://forums.developer.nvidia.com/t/impossible-resolution-ubuntu-22-04/

I think I need cuda for LLAMA_CUBLAS=1, but I also need nvidia-cuda-toolkit for Llama.cpp build.

Even if there are some system package shenanigans, you can simply install nvidia-cuda-toolkit, build the code, and uninstall it.

Point 2 there is why I can't simply just temporarily install nvidia-cuda-toolkit.

Do you know of another way to get cuBLAS configured on EC2? I was observing nvidia-cuda-toolkit doesn't install cuBLAS support.

staviq · 2023-10-04T17:33:42Z

@1 comment you linked, specifically mentions "cuda toolkit" and explains it's needed for nvcc, which in your case is exactly what nvidia-cuda-toolkit provides.

Do not install cuda, install nvidia-cuda-toolkit.

jamesbraza · 2023-10-04T20:11:42Z

Okay, so before sudo apt install nvidia-cuda-toolkit:

> cat /usr/local/cuda/include/cublas.h | grep CUBLAS
cat: /usr/local/cuda/include/cublas.h: No such file or directory

Then running my apt installs (there are a few others):

> sudo apt update
...
> sudo apt install -y python3.11 python3.11-dev python3.11-venv \
    build-essential cmake make gcc \
    docker-ce docker-ce-cli docker-buildx-plugin docker-compose-plugin \
    nvidia-container-toolkit nvidia-cuda-toolkit
...
> sudo reboot

Afterwards, we see CUBLAS files aren't here, confirming they come from cuda:

> cat /usr/local/cuda/include/cublas.h | grep CUBLAS
cat: /usr/local/cuda/include/cublas.h: No such file or directory

Regardless though, continuing onwards:

> make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native
I NVCCFLAGS: --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native "
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native    -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/console.cpp -o console.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/grammar-parser.cpp -o grammar-parser.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native  -c k_quants.c -o k_quants.o
nvcc --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native " -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal   : Value 'native' is not defined for option 'gpu-architecture'
make: *** [Makefile:411: ggml-cuda.o] Error 1

Now let's try cmake (inspired by ggerganov/whisper.cpp#876 (comment)):

> mkdir build
> cd build
> cmake .. -DLLAMA_CUBLAS=ON
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found CUDAToolkit: /usr/include (found version "11.5.119")
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 11.5.119
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Using CUDA architectures: 52;61;70
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ubuntu/code/llama.cpp/build
> cmake --build . --config Release
...

Yaay so cmake build worked!

EDIT: I made a mistake here, skip the remainder of this comment

Erroneous `llama-cpp-python` install

Now let's go back to installing llama-cpp-python[server]:

> CMAKE_ARGS="-DLLAMA_METAL=on" python -m pip install -r dev-requirements.txt \
    --progress-bar off --no-cache-dir

And I get an error:

Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [29 lines of output]
      *** scikit-build-core 0.5.1 using CMake 3.22.1 (wheel)
      *** Configuring CMake...
      loading initial cache file /tmp/tmp9s7tyvil/build/CMakeInit.txt
      -- The C compiler identification is GNU 11.4.0
      -- The CXX compiler identification is GNU 11.4.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: /usr/bin/git (found version "2.34.1")
      -- Looking for pthread.h
      -- Looking for pthread.h - found
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      CMake Error at vendor/llama.cpp/CMakeLists.txt:174 (find_library):
        Could not find FOUNDATION_LIBRARY using the following names: Foundation


      -- Configuring incomplete, errors occurred!
      See also "/tmp/tmp9s7tyvil/build/CMakeFiles/CMakeOutput.log".

      *** CMake configuration failed
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python

So now I am failing with Could not find FOUNDATION_LIBRARY using the following names: Foundation. Any ideas there? Also, officially I think this is now a problem with llama-cpp-python (not Llama.cpp), so I can also migrate this issue there

slaren · 2023-10-04T20:17:31Z

You are building for Metal instead of CUDA. Try CMAKE_ARGS="DLLAMA_CUBLAS=ON" instead.

staviq · 2023-10-04T20:17:52Z

CMAKE_ARGS="-DLLAMA_METAL=on" python -m pip install -r dev-requirements.txt
--progress-bar off --no-cache-dir

Why are you specifying METAL, that's a Mac/OSX framework

jamesbraza · 2023-10-04T20:25:10Z

Yep that's a mix up, my bad guys. Now pasting in the right command...

> CMAKE_ARGS="-DLLAMA_CUBLAS=on" python -m pip install -r dev-requirements.txt \
    --progress-bar off --no-cache-dir

Works fine now, wow. So takeaways here are one doesn't need the cuda apt package for cuBLAS acceleration, nvidia-cuda-toolkit alone works for cuBLAS acceleration too.

Is there something I can do to help document or make this clear somewhere? Otherwise we can just close this out as resolved with user error

AmericanPresidentJimmyCarter · 2024-05-22T20:31:29Z

For everyone else looking for the compile time flag to indicate the right nvcc location, it's LLAMA_CUDA_NVCC. This should really be in the README.

eg

LLAMA_CUDA_NVCC=/usr/local/cuda-12.4/bin/nvcc make LLAMA_CUDA=1 -j4

Ifiht · 2024-09-13T02:16:44Z

For everyone else looking for the compile time flag to indicate the right nvcc location, it's LLAMA_CUDA_NVCC. This should really be in the README.

eg

LLAMA_CUDA_NVCC=/usr/local/cuda-12.4/bin/nvcc make LLAMA_CUDA=1 -j4

As of this message, the flag has changed to GGML_CUDA_NVCC:
GGML_CUDA_NVCC=/usr/local/cuda/bin/nvcc make -j 4 GGML_CUDA=1

jamesbraza mentioned this issue Oct 5, 2023

[Request] nice-ifying build failures around hardware acceleration abetlen/llama-cpp-python#793

Open

BarfingLemurs mentioned this issue Oct 6, 2023

readme : update models + clearer linux cuda and perplexity instructions #3510

Merged

ggerganov closed this as completed in #3510 Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installing Llama.cpp without `nvcc` from `nvidia-cuda-toolkit` #3459

Installing Llama.cpp without `nvcc` from `nvidia-cuda-toolkit` #3459

jamesbraza commented Oct 3, 2023 •

edited

Loading

jamesbraza commented Oct 3, 2023

jamesbraza commented Oct 3, 2023 •

edited

Loading

staviq commented Oct 4, 2023

jamesbraza commented Oct 4, 2023 •

edited

Loading

staviq commented Oct 4, 2023

jamesbraza commented Oct 4, 2023 •

edited

Loading

slaren commented Oct 4, 2023

staviq commented Oct 4, 2023

jamesbraza commented Oct 4, 2023

AmericanPresidentJimmyCarter commented May 22, 2024

Ifiht commented Sep 13, 2024

Installing Llama.cpp without nvcc from nvidia-cuda-toolkit #3459

Installing Llama.cpp without nvcc from nvidia-cuda-toolkit #3459

Comments

jamesbraza commented Oct 3, 2023 • edited Loading

Environment and Context

jamesbraza commented Oct 3, 2023

jamesbraza commented Oct 3, 2023 • edited Loading

staviq commented Oct 4, 2023

jamesbraza commented Oct 4, 2023 • edited Loading

staviq commented Oct 4, 2023

jamesbraza commented Oct 4, 2023 • edited Loading

slaren commented Oct 4, 2023

staviq commented Oct 4, 2023

jamesbraza commented Oct 4, 2023

AmericanPresidentJimmyCarter commented May 22, 2024

Ifiht commented Sep 13, 2024

Installing Llama.cpp without `nvcc` from `nvidia-cuda-toolkit` #3459

Installing Llama.cpp without `nvcc` from `nvidia-cuda-toolkit` #3459

jamesbraza commented Oct 3, 2023 •

edited

Loading

jamesbraza commented Oct 3, 2023 •

edited

Loading

jamesbraza commented Oct 4, 2023 •

edited

Loading

jamesbraza commented Oct 4, 2023 •

edited

Loading