-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of GPU memory when running streaming example #229
Comments
On Linux there is a How much VRAM does your GPU have and what model size are you using? A rough rule of thumb is every 1B params needs 0.7GB of VRAM for full offload. |
I faced this issue. Seems like it hit known WSL specific limitation mentioned in the Cuda WSL docs. Known limitations for Cuda Applications Doing a fresh reinstall of WSL fixes it for me. May be worth trying |
I have rtx 2080 ti with 11gb vram - plenty available and tried to n_gpu_layers=1 to test minimum vram requirement.
`nvidia-smi +---------------------------------------------------------------------------------------+ I reinstalled everything, windows 10, wsl 2, right now ubuntu Ubuntu-22.04 and cuda toolkit version 12.1, pip installed with cublas whatever version of this repo is then used. Any hints what steps and versions people have used succesfully? |
to add tested
|
Got it working using this solution ggml-org/llama.cpp#1233 Is there performance penalty? Don't know, but at least this is 10 x speed compared to cpu |
setup -
wsl 2 * Ubuntu-22.04
built with cublas
Running streaming example with any number of n_gpu_layers offloading.
nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Thu_Nov_18_09:45:30_PST_2021 Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0
WARNING: failed to allocate 512.00 MB of pinned memory: out of memory WARNING: failed to allocate 512.00 MB of pinned memory: out of memory AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | CUDA error 2 at /tmp/pip-install-dcpg9e3d/llama-cpp-python_0b20594a2c9f4aa6a0b4bdca1b250223/vendor/llama.cpp/ggml-cuda.cu:781: out of memory
also I checked and tried https://github.com/ggerganov/llama.cpp/issues/1230 with no luck
The text was updated successfully, but these errors were encountered: