WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

Priestru · 2023-04-29T12:20:06Z

UPD:
Confirmed working just fine on Windows.

Issue below happened only on WSL.

First i pull and clean

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# git pull
Already up to date.
(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# make clean
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native
I LDFLAGS:
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

rm -vf *.o main quantize quantize-stats perplexity embedding benchmark-q4_0-matmult
removed 'common.o'
removed 'ggml-cuda.o'
removed 'ggml.o'
removed 'llama.o'
removed 'main'
removed 'quantize'
removed 'quantize-stats'
removed 'perplexity'
removed 'embedding'

Build fresh with cuBLAS

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include   -c ggml.c -o ggml.o
ggml.c: In function ‘ggml_compute_forward_mul_mat_use_blas’:
ggml.c:7921:36: warning: unused parameter ‘src0’ [-Wunused-parameter]
 7921 |         const struct ggml_tensor * src0,
      |         ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
ggml.c: In function ‘ggml_compute_forward_mul_mat_q_f32’:
ggml.c:8520:31: warning: unused variable ‘y’ [-Wunused-variable]
 8520 |                 const float * y = (float *) ((char *) src1->data + i02*nb12 + i03*nb13);
      |                               ^
ggml.c: In function ‘ggml_compute_forward_alibi_f32’:
ggml.c:9104:15: warning: unused variable ‘n_past’ [-Wunused-variable]
 9104 |     const int n_past = ((int32_t *) src1->data)[0];
      |               ^~~~~~
ggml.c: In function ‘ggml_compute_forward_alibi_f16’:
ggml.c:9165:15: warning: unused variable ‘n_past’ [-Wunused-variable]
 9165 |     const int n_past = ((int32_t *) src1->data)[0];
      |               ^~~~~~
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -c examples/common.cpp -o common.o
nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/main/main.cpp ggml.o llama.o common.o ggml-cuda.o -o main  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

====  Run ./main -h for help.  ====

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize/quantize.cpp ggml.o llama.o ggml-cuda.o -o quantize  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-cuda.o -o quantize-stats  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-cuda.o -o perplexity  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-cuda.o -o embedding  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include pocs/vdot/vdot.cpp ggml.o ggml-cuda.o -o vdot  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib

Trying to load model that worked before update

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/llama-cpp-python/vendor/llama.cpp# ./main -m /mnt/wsl/ggml-vic13b-q5_0.bin -b 512 -t 12 --no-mmap
main: seed = 1682770733
llama.cpp: loading model from /mnt/wsl/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 8740093.73 KB
llama_model_load_internal: mem required  = 10583.25 MB (+ 1608.00 MB per state)
CUDA error 2 at ggml-cuda.cu:359: out of memory

I haven't updated my libllama.so for llama-cpp-python yet, so it uses previous version, and works with this very model just fine. Smth happened.

RTX 3050 8GB

UPD 2:
Issue persists on WSL. I did full clean and yet it doesn't work after being built with current version.

UPD 3:
I found some years old version llama.cpp and did exactly same thing and everything worked fine. So i guess it's not me be especially dumb today.

The text was updated successfully, but these errors were encountered:

slaren · 2023-04-29T13:07:59Z

Looks like it's failing to allocate host pinned memory. I will add a patch to revert to normal pageable memory when this happens.

In the meanwhile, removing the --no-map should work. You can also try adding more memory to WSL2 in .wslconfig.

Priestru · 2023-04-29T13:23:23Z

Looks like it's failing to allocate host pinned memory. I will add a patch to revert to normal pageable memory when this happens.

In the meanwhile, removing the --no-map should work. You can also try adding more memory to WSL2 in .wslconfig.

Okay, i did both things:

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/text-generation-webui# cat /proc/meminfo | grep MemTotal
MemTotal:       24619496 kB
(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/text-generation-webui# ../llama-cpp-python/vendor/llama.cpp/main -m /mnt/wsl/ggml-vic13b-q5_0.bin -b 512 -t 12
main: seed = 1682774575
llama.cpp: loading model from /mnt/wsl/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 10583.25 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
CUDA error 2 at ggml-cuda.cu:359: out of memory

slaren · 2023-04-29T14:06:38Z

That's weird, it looks like it is failing to allocate any amount of host pinned memory. It should still be solved by reverting to normal memory when the host pinned malloc fails, but you will lose some performance.

Priestru · 2023-04-29T16:34:49Z

Fixed with #1233

Priestru · 2023-05-02T18:05:25Z

Okay i found some more info to this topic:

microsoft/WSL#8447

Somehow this is a common issue.

slaren · 2023-05-02T18:12:50Z

In case this is useful, I am using Ubuntu 22.04 in WSL2 under Windows 11, with a RTX 3080 and latest drivers. That works for me.

Priestru · 2023-05-02T22:16:17Z

In case this is useful, I am using Ubuntu 22.04 in WSL2 under Windows 11, with a RTX 3080 and latest drivers. That works for me.

llama_init_from_file: kv self size  =  400.00 MB
WARNING: failed to allocate 1024.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory

No idea why... it's fresh win 11 with Ubuntu 22.04...

Priestru · 2023-05-02T22:54:03Z

Tried 20.04 same result. Tried CUDA 11.8 same again. No idea what combination at what order can give me ability to pin memory using WSL

slaren · 2023-05-02T23:09:58Z

NVIDIA is very vague about the limits, but they suggest that usually on Windows there is a limit of pinned memory equal to 50% of the total system memory, and this limit is likely to be lower under WSL2.

Priestru · 2023-05-03T06:07:25Z

NVIDIA is very vague about the limits, but they suggest that usually on Windows there is a limit of pinned memory equal to 50% of the total system memory, and this limit is likely to be lower under WSL2.

I believe 64GB should be enough in general for 13B T_T.
I will try lesser models, and other options available.

Priestru · 2023-05-03T06:55:42Z

Okay i tried this:

#include <cuda_runtime.h>
#include <stdio.h>

int main(void) {
    size_t size = 1024 * 1024 * 100; // 100 MB
    float *pinned_mem;
    cudaError_t result = cudaMallocHost((void**)&pinned_mem, size * sizeof(float), cudaHostAllocDefault);
    if (result != cudaSuccess) {
        fprintf(stderr, "cudaMallocHost failed with error code %d: %s\n", result, cudaGetErrorString(result));
        return 1;
    }
    printf("Pinned memory of size %zu bytes allocated to GPU.\n", size * sizeof(float));
    cudaFreeHost(pinned_mem);
    return 0;
}

then i did
nvcc -o pinned_mem pinned_mem.cu

And result is:

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru$ ./pinned_mem
Pinned memory of size 419430400 bytes allocated to GPU.

So it seems i successfully allocated 100 MB to GPU. Good start (i guess?), i'll try larger size.

Pinned memory of size 4294967296 bytes allocated to GPU.

At this point i wonder if it really does it, but okay.

Priestru · 2023-05-03T07:18:31Z

AAAAAAAAAAAAAAAAAAAAAA

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ ./main -m /media/ggml-vic13b-q5_0.bin -b 512 -t 8
main: build = 488 (67c7779)
main: seed  = 1683098280
llama.cpp: loading model from /media/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 10583.26 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

I believe the thing that did trick for me is:

wsl.exe --update

Also i installed CUDA via

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

and i did it before installing miniconda.

Some info for anyone who may fight this in the future:

Edition	Windows 11 Pro
Version	22H2
Installed on	‎5/‎2/‎2023
OS build	22621.1635
Experience	Windows Feature Experience Pack 1000.22641.1000.0

  NAME            STATE           VERSION
* Ubuntu-22.04    Running         2

C:\Users\Yuuru>wsl uname -r
5.15.90.1-microsoft-standard-WSL2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

(cuda path is prompted in nano ~/.bashrc with a line: export PATH=/usr/local/cuda/bin:$PATH

GPU driver is installed on host windows only and smi returns:

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ nvidia-smi
Wed May  3 00:26:36 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3050         On | 00000000:2D:00.0  On |                  N/A |
|  0%   48C    P5                9W / 130W|   2774MiB /  8192MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

Celppu · 2023-05-18T01:06:59Z

I have similar problem. more : abetlen/llama-cpp-python#229

llama_model_load_internal: format = ggjt v2 (latest)
llama_model_load_internal: n_vocab = 32001
llama_model_load_internal: n_ctx = 2020
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 72.75 KB
llama_model_load_internal: mem required = 5809.34 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 5 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 603 MB
llama_init_from_file: kv self size = 1010.00 MB
WARNING: failed to allocate 768.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
You: hello ai
CUDA error 2 at /tmp/pip-install-qnq06ym_/llama-cpp-python_6e0012385653427fb82da3730be5f065/vendor/llama.cpp/ggml-cuda.cu:781: out of memory

Yuicchi-chan · 2023-06-03T17:36:59Z

AAAAAAAAAAAAAAAAAAAAAA

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ ./main -m /media/ggml-vic13b-q5_0.bin -b 512 -t 8
main: build = 488 (67c7779)
main: seed  = 1683098280
llama.cpp: loading model from /media/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 10583.26 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0

I believe the thing that did trick for me is:

wsl.exe --update

Also i installed CUDA via

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda

and i did it before installing miniconda.

Some info for anyone who may fight this in the future:

Edition	Windows 11 Pro
Version	22H2
Installed on	‎5/‎2/‎2023
OS build	22621.1635
Experience	Windows Feature Experience Pack 1000.22641.1000.0

  NAME            STATE           VERSION
* Ubuntu-22.04    Running         2

C:\Users\Yuuru>wsl uname -r
5.15.90.1-microsoft-standard-WSL2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

(cuda path is prompted in nano ~/.bashrc with a line: export PATH=/usr/local/cuda/bin:$PATH

GPU driver is installed on host windows only and smi returns:

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ nvidia-smi
Wed May  3 00:26:36 2023
+-------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|======================+======================|
|   0  NVIDIA GeForce RTX 3050         On | 00000000:2D:00.0  On |                  N/A |
|  0%   48C    P5                9W / 130W|   2774MiB /  8192MiB |      3%      Default |
|                                         |                      |                  N/A |
+---------------------------------+----------------------+----------------------+

+----------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|==============================================|
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
+--------------------------------------------------------------------------------+

I honestly can't get this to work, I tried everything you did, reinstalled WSL and cuda like twice, still the same error, Here's my nvcc and nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070         On | 00000000:26:00.0  On |                  N/A |
| 30%   47C    P0               41W / 239W|   1751MiB /  8192MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Any clue as to what could be wrong? literally loading the smallest model, 20 layers on the 7B model, and no luck

(PytorchEnv) yuicchi@DESKTOP-DJ3R5OF:/mnt/d/Yuicchi Text Model/llama.cpp$ ./main -ngl 20 --ctx_size 2048 -n 2048 -c 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --batch_size 512 --repeat_penalty 1.17647 --seed 1685501956 --model "./models/7B/ggml-model-q4_0.bin" --threads 8 --n_predict 4096 --color --prompt
"Write out a detailed step by step method on how to create a website" --no-mmap
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 610 (d8bd001)
main: seed = 1685501956
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
WARNING: failed to allocate 3615.71 MB of pinned memory: out of memory
llama_model_load_internal: mem required = 3235.84 MB (+ 1026.00 MB per state)
llama_model_load_internal: [cublas] offloading 20 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 2171 MB
...................................................................................................
llama_init_from_file: kv self size = 1024.00 MB
WARNING: failed to allocate 768.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 256, repeat_penalty = 1.176470, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.500000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 4096, n_keep = 0

Write out a detailed step by step method on how to create a websiteCUDA error 2 at ggml-cuda.cu:565: out of memory

Celppu · 2023-06-03T17:51:59Z

AAAAAAAAAAAAAAAAAAAAAA
yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ ./main -m /media/ggml-vic13b-q5_0.bin -b 512 -t 8
main: build = 488 (67c7779)
main: seed  = 1683098280
llama.cpp: loading model from /media/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 10583.26 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0
I believe the thing that did trick for me is:
wsl.exe --update
Also i installed CUDA via
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-12-1-local_12.1.1-1_amd64.deb
sudo cp /var/cuda-repo-wsl-ubuntu-12-1-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda
and i did it before installing miniconda.
Some info for anyone who may fight this in the future:
Edition	Windows 11 Pro
Version	22H2
Installed on	‎5/‎2/‎2023
OS build	22621.1635
Experience	Windows Feature Experience Pack 1000.22641.1000.0

  NAME            STATE           VERSION
* Ubuntu-22.04    Running         2

C:\Users\Yuuru>wsl uname -r
5.15.90.1-microsoft-standard-WSL2

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

(cuda path is prompted in nano ~/.bashrc with a line: export PATH=/usr/local/cuda/bin:$PATH

GPU driver is installed on host windows only and smi returns:

yuuru@DESKTOP-L34CRT1:/mnt/c/Users/Yuuru/llama-cpp-python/vendor/llama.cpp$ nvidia-smi
Wed May  3 00:26:36 2023
+-------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|======================+======================|
|   0  NVIDIA GeForce RTX 3050         On | 00000000:2D:00.0  On |                  N/A |
|  0%   48C    P5                9W / 130W|   2774MiB /  8192MiB |      3%      Default |
|                                         |                      |                  N/A |
+---------------------------------+----------------------+----------------------+

+----------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|==============================================|
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
+--------------------------------------------------------------------------------+
I honestly can't get this to work, I tried everything you did, reinstalled WSL and cuda like twice, still the same error, Here's my nvcc and nvidia-smi:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.50                 Driver Version: 531.79       CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3070         On | 00000000:26:00.0  On |                  N/A |
| 30%   47C    P0               41W / 239W|   1751MiB /  8192MiB |      2%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
Any clue as to what could be wrong? literally loading the smallest model, 20 layers on the 7B model, and no luck

(PytorchEnv) yuicchi@DESKTOP-DJ3R5OF:/mnt/d/Yuicchi Text Model/llama.cpp$ ./main -ngl 20 --ctx_size 2048 -n 2048 -c 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --batch_size 512 --repeat_penalty 1.17647 --seed 1685501956 --model "./models/7B/ggml-model-q4_0.bin" --threads 8 --n_predict 4096 --color --prompt "Write out a detailed step by step method on how to create a website" --no-mmap WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible. main: build = 610 (d8bd001) main: seed = 1685501956 llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 3615.71 MB WARNING: failed to allocate 3615.71 MB of pinned memory: out of memory llama_model_load_internal: mem required = 3235.84 MB (+ 1026.00 MB per state) llama_model_load_internal: [cublas] offloading 20 layers to GPU llama_model_load_internal: [cublas] total VRAM used: 2171 MB ................................................................................................... llama_init_from_file: kv self size = 1024.00 MB WARNING: failed to allocate 768.00 MB of pinned memory: out of memory WARNING: failed to allocate 512.00 MB of pinned memory: out of memory WARNING: failed to allocate 512.00 MB of pinned memory: out of memory

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 256, repeat_penalty = 1.176470, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.500000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 2048, n_batch = 512, n_predict = 4096, n_keep = 0

Write out a detailed step by step method on how to create a websiteCUDA error 2 at ggml-cuda.cu:565: out of memory

Due to current cuda bug I think you need to set no pinned for enviroment variables. Command for it: "export GGML_CUDA_NO_PINNED=1"

Yuicchi-chan · 2023-06-04T06:41:11Z

Thank you so much for that! It worked. Is there any place I can look into for this bug? What exactly might be going wrong here?

ParthMakode · 2023-11-12T13:17:25Z

@Celppu how do you do that . where to execute that command

Priestru changed the title ~~CUDA error 2 at ggml-cuda.cu:359: out of memory~~ WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory Apr 29, 2023

Priestru mentioned this issue Apr 29, 2023

cuBLAS with llama-cpp-python on Windows abetlen/llama-cpp-python#117

Closed

slaren mentioned this issue Apr 29, 2023

cuBLAS: fall back to pageable memory if pinned alloc fails #1233

Merged

Priestru closed this as completed Apr 29, 2023

Priestru changed the title ~~WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory~~ WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (solved) May 3, 2023

Priestru changed the title ~~WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (solved)~~ WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) May 3, 2023

Celppu mentioned this issue May 18, 2023

Out of GPU memory when running streaming example abetlen/llama-cpp-python#229

Closed

dranger003 mentioned this issue Jun 7, 2023

with the newest builds i only get gibberish output #1735

Closed

pickettd mentioned this issue Oct 4, 2024

CUDA Error 2, out of memory exo-explore/exo#101

Open

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

Priestru commented Apr 29, 2023 •

edited

Loading

slaren commented Apr 29, 2023

Priestru commented Apr 29, 2023

slaren commented Apr 29, 2023

Priestru commented Apr 29, 2023

Priestru commented May 2, 2023

slaren commented May 2, 2023

Priestru commented May 2, 2023

Priestru commented May 2, 2023

slaren commented May 2, 2023

Priestru commented May 3, 2023 •

edited

Loading

Priestru commented May 3, 2023 •

edited

Loading

Priestru commented May 3, 2023 •

edited

Loading

Celppu commented May 18, 2023 •

edited by Green-Sky

Loading

Yuicchi-chan commented Jun 3, 2023 •

edited

Loading

Celppu commented Jun 3, 2023

Yuicchi-chan commented Jun 4, 2023

ParthMakode commented Nov 12, 2023

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) #1230

Comments

Priestru commented Apr 29, 2023 • edited Loading

slaren commented Apr 29, 2023

Priestru commented Apr 29, 2023

slaren commented Apr 29, 2023

Priestru commented Apr 29, 2023

Priestru commented May 2, 2023

slaren commented May 2, 2023

Priestru commented May 2, 2023

Priestru commented May 2, 2023

slaren commented May 2, 2023

Priestru commented May 3, 2023 • edited Loading

Priestru commented May 3, 2023 • edited Loading

Priestru commented May 3, 2023 • edited Loading

Celppu commented May 18, 2023 • edited by Green-Sky Loading

Yuicchi-chan commented Jun 3, 2023 • edited Loading

Celppu commented Jun 3, 2023

Yuicchi-chan commented Jun 4, 2023

ParthMakode commented Nov 12, 2023

Priestru commented Apr 29, 2023 •

edited

Loading

Priestru commented May 3, 2023 •

edited

Loading

Priestru commented May 3, 2023 •

edited

Loading

Priestru commented May 3, 2023 •

edited

Loading

Celppu commented May 18, 2023 •

edited by Green-Sky

Loading

Yuicchi-chan commented Jun 3, 2023 •

edited

Loading