Out of GPU memory when running streaming example #229

Celppu · 2023-05-18T00:07:57Z

setup -
wsl 2 * Ubuntu-22.04
built with cublas

Running streaming example with any number of n_gpu_layers offloading.

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2021 NVIDIA Corporation Built on Thu_Nov_18_09:45:30_PST_2021 Cuda compilation tools, release 11.5, V11.5.119 Build cuda_11.5.r11.5/compiler.30672275_0

also I checked and tried https://github.com/ggerganov/llama.cpp/issues/1230 with no luck

The text was updated successfully, but these errors were encountered:

gjmulder · 2023-05-18T02:48:03Z

On Linux there is a nvidia-smi command to show how much VRAM memory is being used.

How much VRAM does your GPU have and what model size are you using? A rough rule of thumb is every 1B params needs 0.7GB of VRAM for full offload.

aneeshjoy · 2023-05-18T06:23:19Z

I faced this issue. Seems like it hit known WSL specific limitation mentioned in the Cuda WSL docs.

Known limitations for Cuda Applications

Doing a fresh reinstall of WSL fixes it for me. May be worth trying

Celppu · 2023-05-18T22:58:13Z

I have rtx 2080 ti with 11gb vram - plenty available and tried to n_gpu_layers=1 to test minimum vram requirement.

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+`

I reinstalled everything, windows 10, wsl 2, right now ubuntu Ubuntu-22.04 and cuda toolkit version 12.1, pip installed with cublas whatever version of this repo is then used. Any hints what steps and versions people have used succesfully?

Celppu · 2023-05-18T23:15:34Z

to add tested
`python
Python 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

import torch
torch.cuda.is_available()
True
quit() `

Celppu · 2023-05-18T23:55:20Z

Got it working using this solution ggml-org/llama.cpp#1233
For newbie friendly solution add enviroment variable GGML_CUDA_NO_PINNED=1

Is there performance penalty? Don't know, but at least this is 10 x speed compared to cpu

gjmulder added hardware Hardware specific issue model Model specific issue labels May 18, 2023

gjmulder changed the title ~~pinned memory: out of memory~~ Out of GPU memory when running streaming example May 18, 2023

Green-Sky mentioned this issue May 18, 2023

WSL: CUDA error 2 at ggml-cuda.cu:359: out of memory (Fix found) ggml-org/llama.cpp#1230

Closed

gjmulder closed this as completed May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of GPU memory when running streaming example #229

Out of GPU memory when running streaming example #229

Celppu commented May 18, 2023 •

edited

Loading

gjmulder commented May 18, 2023

aneeshjoy commented May 18, 2023

Celppu commented May 18, 2023

Celppu commented May 18, 2023

Celppu commented May 18, 2023 •

edited

Loading

Out of GPU memory when running streaming example #229

Out of GPU memory when running streaming example #229

Comments

Celppu commented May 18, 2023 • edited Loading

gjmulder commented May 18, 2023

aneeshjoy commented May 18, 2023

Celppu commented May 18, 2023

Celppu commented May 18, 2023

Celppu commented May 18, 2023 • edited Loading

Celppu commented May 18, 2023 •

edited

Loading

Celppu commented May 18, 2023 •

edited

Loading