Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of GPU memory when running streaming example #229

Closed
Celppu opened this issue May 18, 2023 · 5 comments
Closed

Out of GPU memory when running streaming example #229

Celppu opened this issue May 18, 2023 · 5 comments
Labels
hardware Hardware specific issue model Model specific issue

Comments

@Celppu
Copy link

Celppu commented May 18, 2023

setup -
wsl 2 * Ubuntu-22.04
built with cublas

Running streaming example with any number of n_gpu_layers offloading.

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Thu_Nov_18_09:45:30_PST_2021 Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0

WARNING: failed to allocate 512.00 MB of pinned memory: out of memory WARNING: failed to allocate 512.00 MB of pinned memory: out of memory AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | CUDA error 2 at /tmp/pip-install-dcpg9e3d/llama-cpp-python_0b20594a2c9f4aa6a0b4bdca1b250223/vendor/llama.cpp/ggml-cuda.cu:781: out of memory

also I checked and tried https://github.com/ggerganov/llama.cpp/issues/1230 with no luck

@gjmulder
Copy link
Contributor

On Linux there is a nvidia-smi command to show how much VRAM memory is being used.

How much VRAM does your GPU have and what model size are you using? A rough rule of thumb is every 1B params needs 0.7GB of VRAM for full offload.

@gjmulder gjmulder added hardware Hardware specific issue model Model specific issue labels May 18, 2023
@gjmulder gjmulder changed the title pinned memory: out of memory Out of GPU memory when running streaming example May 18, 2023
@aneeshjoy
Copy link
Contributor

I faced this issue. Seems like it hit known WSL specific limitation mentioned in the Cuda WSL docs.

Known limitations for Cuda Applications

Doing a fresh reinstall of WSL fixes it for me. May be worth trying

@Celppu
Copy link
Author

Celppu commented May 18, 2023

I have rtx 2080 ti with 11gb vram - plenty available and tried to n_gpu_layers=1 to test minimum vram requirement.

llama_init_from_file: kv self size = 256.00 MB WARNING: failed to allocate 512.00 MB of pinned memory: out of memory WARNING: failed to allocate 512.00 MB of pinned memory: out of memory AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | CUDA error 2 at /tmp/pip-install-71q8lg55/llama-cpp-python_fd07c2516d2a4860a5413dbe4eca677f/vendor/llama.cpp/ggml-cuda.cu:781: out of memory

`nvidia-smi
Fri May 19 01:56:19 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.46 Driver Version: 531.61 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2080 Ti On | 00000000:09:00.0 On | N/A |
| 0% 52C P8 37W / 250W| 1490MiB / 11264MiB | 6% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+`

I reinstalled everything, windows 10, wsl 2, right now ubuntu Ubuntu-22.04 and cuda toolkit version 12.1, pip installed with cublas whatever version of this repo is then used. Any hints what steps and versions people have used succesfully?

@Celppu
Copy link
Author

Celppu commented May 18, 2023

to add tested
`python
Python 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
torch.cuda.is_available()
True
quit() `

@Celppu
Copy link
Author

Celppu commented May 18, 2023

Got it working using this solution ggml-org/llama.cpp#1233
For newbie friendly solution add enviroment variable GGML_CUDA_NO_PINNED=1

Is there performance penalty? Don't know, but at least this is 10 x speed compared to cpu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hardware Hardware specific issue model Model specific issue
Projects
None yet
Development

No branches or pull requests

3 participants