Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

Closed
Priestru opened this issue Apr 29, 2023 · 17 comments · Fixed by #1233
Closed

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

Priestru opened this issue Apr 29, 2023 · 17 comments · Fixed by #1233

Comments

@Priestru
Copy link

Priestru commented Apr 29, 2023

Solution: #1230 (comment)

UPD:
Confirmed working just fine on Windows.

Issue below happened only on WSL.

#1207

First i pull and clean

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# git pull
Already up to date.
(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# make clean
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
removed 'common.o'
removed 'ggml-cuda.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'

Build fresh with cuBLAS

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include   -c ggml.c -o ggml.o
ggml.c: In function ‘ggml_compute_forward_mul_mat_use_blas’:
ggml.c:7921:36: warning: unused parameter ‘src0’ [-Wunused-parameter]
 7921 |         const struct ggml_tensor * src0,
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
ggml.c: In function ‘ggml_compute_forward_mul_mat_q_f32’:
ggml.c:8520:31: warning: unused variable ‘y’ [-Wunused-variable]
 8520 |                 const float * y = (float *) ((char *) src1->data + i02*nb12 + i03*nb13);
      |                               ^
ggml.c: In function ‘ggml_compute_forward_alibi_f32’:
ggml.c:9104:15: warning: unused variable ‘n_past’ [-Wunused-variable]
 9104 |     const int n_past = ((int32_t *) src1->data)[0];
      |               ^~~~~~
ggml.c: In function ‘ggml_compute_forward_alibi_f16’:
ggml.c:9165:15: warning: unused variable ‘n_past’ [-Wunused-variable]
 9165 |     const int n_past = ((int32_t *) src1->data)[0];
      |               ^~~~~~
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/main/main.cpp ggml.o llama.o common.o ggml-cuda.o -o main  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

====  Run ./main -h for help.  ====

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize/quantize.cpp ggml.o llama.o ggml-cuda.o -o quantize  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-cuda.o -o quantize-stats  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-cuda.o -o perplexity  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-cuda.o -o embedding  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include pocs/vdot/vdot.cpp ggml.o ggml-cuda.o -o vdot  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

Trying to load model that worked before update

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# ./main -m /mnt/wsl/ggml-vic13b-q5_0.bin -b 512 -t 12 --no-mmap
main: seed = 1682770733
llama.cpp: loading model from /mnt/wsl/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 8740093.73 KB
llama_model_load_internal: mem required  = 10583.25 MB (+ 1608.00 MB per state)
CUDA error 2 at ggml-cuda.cu:359: out of memory

I haven't updated my libllama.so for llama-cpp-python yet, so it uses previous version, and works with this very model just fine. Smth happened.

RTX 3050 8GB

UPD 2:
Issue persists on WSL. I did full clean and yet it doesn't work after being built with current version.

UPD 3:
I found some years old version llama.cpp and did exactly same thing and everything worked fine. So i guess it's not me be especially dumb today.

@slaren
Copy link
Member

slaren commented Apr 29, 2023

Looks like it's failing to allocate host pinned memory. I will add a patch to revert to normal pageable memory when this happens.

In the meanwhile, removing the --no-map should work. You can also try adding more memory to WSL2 in .wslconfig.

@Priestru Priestru changed the title CUDA error 2 at ggml-cuda.cu:359: out of memory WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory Apr 29, 2023
@Priestru
Copy link
Author

Looks like it's failing to allocate host pinned memory. I will add a patch to revert to normal pageable memory when this happens.

In the meanwhile, removing the --no-map should work. You can also try adding more memory to WSL2 in .wslconfig.

Okay, i did both things:

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/text-generation-webui# cat /proc/meminfo | grep MemTotal
MemTotal:       24619496 kB
(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/text-generation-webui# ../llama-cpp-python/vendor/llama.cpp/main -m /mnt/wsl/ggml-vic13b-q5_0.bin -b 512 -t 12
main: seed = 1682774575
llama.cpp: loading model from /mnt/wsl/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 10583.25 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
CUDA error 2 at ggml-cuda.cu:359: out of memory

@slaren
Copy link
Member

slaren commented Apr 29, 2023

That's weird, it looks like it is failing to allocate any amount of host pinned memory. It should still be solved by reverting to normal memory when the host pinned malloc fails, but you will lose some performance.

@Priestru
Copy link
Author

Fixed with #1233

@Priestru
Copy link
Author

Priestru commented May 2, 2023

Okay i found some more info to this topic:

microsoft/WSL#8447

Somehow this is a common issue.

@slaren
Copy link
Member

slaren commented May 2, 2023

In case this is useful, I am using Ubuntu 22.04 in WSL2 under Windows 11, with a RTX 3080 and latest drivers. That works for me.

@Priestru
Copy link
Author

Priestru commented May 2, 2023

In case this is useful, I am using Ubuntu 22.04 in WSL2 under Windows 11, with a RTX 3080 and latest drivers. That works for me.

llama_init_from_file: kv self size  =  400.00 MB
WARNING: failed to allocate 1024.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory

No idea why... it's fresh win 11 with Ubuntu 22.04...

@Priestru
Copy link
Author

Priestru commented May 2, 2023

Tried 20.04 same result. Tried CUDA 11.8 same again. No idea what combination at what order can give me ability to pin memory using WSL

@slaren
Copy link
Member

slaren commented May 2, 2023

NVIDIA is very vague about the limits, but they suggest that usually on Windows there is a limit of pinned memory equal to 50% of the total system memory, and this limit is likely to be lower under WSL2.

@Priestru
Copy link
Author

Priestru commented May 3, 2023

NVIDIA is very vague about the limits, but they suggest that usually on Windows there is a limit of pinned memory equal to 50% of the total system memory, and this limit is likely to be lower under WSL2.

I believe 64GB should be enough in general for 13B T_T.
I will try lesser models, and other options available.

@Priestru
Copy link
Author

Priestru commented May 3, 2023

Okay i tried this:

#include <cuda_runtime.h>
#include <stdio.h>

int main(void) {
    size_t size = 1024 * 1024 * 100; // 100 MB
    float *pinned_mem;
    cudaError_t result = cudaMallocHost((void**)&pinned_mem, size * sizeof(float), cudaHostAllocDefault);
    if (result != cudaSuccess) {
        fprintf(stderr, "cudaMallocHost failed with error code %d: %s\n", result, cudaGetErrorString(result));
        return 1;
    }
    printf("Pinned memory of size %zu bytes allocated to GPU.\n", size * sizeof(float));
    cudaFreeHost(pinned_mem);
    return 0;
}

then i did
nvcc -o pinned_mem pinned_mem.cu

And result is:

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru$ ./pinned_mem
Pinned memory of size 419430400 bytes allocated to GPU.

So it seems i successfully allocated 100 MB to GPU. Good start (i guess?), i'll try larger size.

Pinned memory of size 4294967296 bytes allocated to GPU.

At this point i wonder if it really does it, but okay.

@Priestru
Copy link
Author

Priestru commented May 3, 2023

AAAAAAAAAAAAAAAAAAAAAA

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ ./main -m /media/ggml-vic13b-q5_0.bin -b 512 -t 8
main: build = 488 (67c7779)
main: seed  = 1683098280
llama.cpp: loading model from /media/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 10583.26 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

I believe the thing that did trick for me is:

wsl.exe --update

Also i installed CUDA via

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

and i did it before installing miniconda.

Some info for anyone who may fight this in the future:

Edition	Windows 11 Pro
Version	22H2
Installed on	‎5/‎2/‎2023
OS build	22621.1635
Experience	Windows Feature Experience Pack 1000.22641.1000.0

  NAME            STATE           VERSION
* Ubuntu-22.04    Running         2

C:\Users\Yuuru>wsl uname -r
5.15.90.1-microsoft-standard-WSL2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

(cuda path is prompted in nano ~/.bashrc with a line: export PATH=/usr/local/cuda/bin:$PATH

GPU driver is installed on host windows only and smi returns:

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ nvidia-smi
Wed May  3 00:26:36 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3050         On | 00000000:2D:00.0  On |                  N/A |
|  0%   48C    P5                9W / 130W|   2774MiB /  8192MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

@Priestru Priestru changed the title WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (solved) May 3, 2023
@Priestru Priestru changed the title WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (solved) WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) May 3, 2023
@Celppu
Copy link

Celppu commented May 18, 2023

I have similar problem. more : abetlen/llama-cpp-python#229

llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2020
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 72.75 KB
llama_model_load_internal: mem required = 5809.34 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 5 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 603 MB
llama_init_from_file: kv self size = 1010.00 MB
WARNING: failed to allocate 768.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
You: hello ai
CUDA error 2 at /tmp/pip-install-qnq06ym_/llama-cpp-python_6e0012385653427fb82da3730be5f065/vendor/llama.cpp/ggml-cuda.cu:781: out of memory

@Yuicchi-chan
Copy link

Yuicchi-chan commented Jun 3, 2023

AAAAAAAAAAAAAAAAAAAAAA

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ ./main -m /media/ggml-vic13b-q5_0.bin -b 512 -t 8
main: build = 488 (67c7779)
main: seed  = 1683098280
llama.cpp: loading model from /media/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 10583.26 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

I believe the thing that did trick for me is:

wsl.exe --update

Also i installed CUDA via

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

and i did it before installing miniconda.

Some info for anyone who may fight this in the future:

Edition	Windows 11 Pro
Version	22H2
Installed on	‎5/‎2/‎2023
OS build	22621.1635
Experience	Windows Feature Experience Pack 1000.22641.1000.0

  NAME            STATE           VERSION
* Ubuntu-22.04    Running         2

C:\Users\Yuuru>wsl uname -r
5.15.90.1-microsoft-standard-WSL2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

(cuda path is prompted in nano ~/.bashrc with a line: export PATH=/usr/local/cuda/bin:$PATH

GPU driver is installed on host windows only and smi returns:

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ nvidia-smi
Wed May  3 00:26:36 2023
+-------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|======================+======================|
|   0  NVIDIA GeForce RTX 3050         On | 00000000:2D:00.0  On |                  N/A |
|  0%   48C    P5                9W / 130W|   2774MiB /  8192MiB |      3%      Default |
|                                         |                      |                  N/A |
+---------------------------------+----------------------+----------------------+

+----------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|==============================================|
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
+--------------------------------------------------------------------------------+

I honestly can't get this to work, I tried everything you did, reinstalled WSL and cuda like twice, still the same error, Here's my nvcc and nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070         On | 00000000:26:00.0  On |                  N/A |
| 30%   47C    P0               41W / 239W|   1751MiB /  8192MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Any clue as to what could be wrong? literally loading the smallest model, 20 layers on the 7B model, and no luck

(PytorchEnv) yuicchi@DESKTOP-DJ3R5OF:/mnt/d/Yuicchi Text Model/llama.cpp$ ./main -ngl 20 --ctx_size 2048 -n 2048 -c 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --batch_size 512 --repeat_penalty 1.17647 --seed 1685501956 --model "./models/7B/ggml-model-q4_0.bin" --threads 8 --n_predict 4096 --color --prompt
"Write out a detailed step by step method on how to create a website" --no-mmap
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 610 (d8bd001)
main: seed = 1685501956
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
WARNING: failed to allocate 3615.71 MB of pinned memory: out of memory
llama_model_load_internal: mem required = 3235.84 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 20 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 2171 MB
...................................................................................................
llama_init_from_file: kv self size = 1024.00 MB
WARNING: failed to allocate 768.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 256, repeat_penalty = 1.176470, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.500000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 4096, n_keep = 0

Write out a detailed step by step method on how to create a websiteCUDA error 2 at ggml-cuda.cu:565: out of memory

@Celppu
Copy link

Celppu commented Jun 3, 2023

AAAAAAAAAAAAAAAAAAAAAA

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ ./main -m /media/ggml-vic13b-q5_0.bin -b 512 -t 8
main: build = 488 (67c7779)
main: seed  = 1683098280
llama.cpp: loading model from /media/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 10583.26 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

I believe the thing that did trick for me is:
wsl.exe --update
Also i installed CUDA via

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

and i did it before installing miniconda.
Some info for anyone who may fight this in the future:

Edition	Windows 11 Pro
Version	22H2
Installed on	‎5/‎2/‎2023
OS build	22621.1635
Experience	Windows Feature Experience Pack 1000.22641.1000.0

  NAME            STATE           VERSION
* Ubuntu-22.04    Running         2

C:\Users\Yuuru>wsl uname -r
5.15.90.1-microsoft-standard-WSL2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

(cuda path is prompted in nano ~/.bashrc with a line: export PATH=/usr/local/cuda/bin:$PATH

GPU driver is installed on host windows only and smi returns:

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ nvidia-smi
Wed May  3 00:26:36 2023
+-------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|======================+======================|
|   0  NVIDIA GeForce RTX 3050         On | 00000000:2D:00.0  On |                  N/A |
|  0%   48C    P5                9W / 130W|   2774MiB /  8192MiB |      3%      Default |
|                                         |                      |                  N/A |
+---------------------------------+----------------------+----------------------+

+----------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|==============================================|
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
+--------------------------------------------------------------------------------+

I honestly can't get this to work, I tried everything you did, reinstalled WSL and cuda like twice, still the same error, Here's my nvcc and nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070         On | 00000000:26:00.0  On |                  N/A |
| 30%   47C    P0               41W / 239W|   1751MiB /  8192MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Any clue as to what could be wrong? literally loading the smallest model, 20 layers on the 7B model, and no luck

(PytorchEnv) yuicchi@DESKTOP-DJ3R5OF:/mnt/d/Yuicchi Text Model/llama.cpp$ ./main -ngl 20 --ctx_size 2048 -n 2048 -c 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --batch_size 512 --repeat_penalty 1.17647 --seed 1685501956 --model "./models/7B/ggml-model-q4_0.bin" --threads 8 --n_predict 4096 --color --prompt "Write out a detailed step by step method on how to create a website" --no-mmap WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible. main: build = 610 (d8bd001) main: seed = 1685501956 llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 3615.71 MB WARNING: failed to allocate 3615.71 MB of pinned memory: out of memory llama_model_load_internal: mem required = 3235.84 MB (+ 1026.00 MB per state) llama_model_load_internal: [cublas] offloading 20 layers to GPU llama_model_load_internal: [cublas] total VRAM used: 2171 MB ................................................................................................... llama_init_from_file: kv self size = 1024.00 MB WARNING: failed to allocate 768.00 MB of pinned memory: out of memory WARNING: failed to allocate 512.00 MB of pinned memory: out of memory WARNING: failed to allocate 512.00 MB of pinned memory: out of memory

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 256, repeat_penalty = 1.176470, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.500000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 4096, n_keep = 0

Write out a detailed step by step method on how to create a websiteCUDA error 2 at ggml-cuda.cu:565: out of memory

Due to current cuda bug I think you need to set no pinned for enviroment variables. Command for it: "export GGML_CUDA_NO_PINNED=1"

@Yuicchi-chan
Copy link

Thank you so much for that! It worked. Is there any place I can look into for this bug? What exactly might be going wrong here?

@ParthMakode
Copy link

@Celppu how do you do that . where to execute that command

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants