Freeze after offloading layers to GPU #3135

CRD716 · 2023-09-12T03:22:42Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

llama.cpp does not freeze and continues to run normally, not interfering with basic windows operations.

Current Behavior

llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 4096
llm_load_print_meta: n_embd         = 8192
llm_load_print_meta: n_head         = 64
llm_load_print_meta: n_head_kv      = 8
llm_load_print_meta: n_layer        = 80
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 8
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 28672
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 70B
llm_load_print_meta: model ftype    = mostly Q5_K - Medium
llm_load_print_meta: model size     = 68.98 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 35995.03 MB (+ 1280.00 MB per state)
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloaded 18/83 layers to GPU
llm_load_tensors: VRAM used: 10500 MB

llama.cpp then freezes and will not respond. Task Manager shows 0% CPU or GPU load. It is also somehow unable to be stopped via task manager, requiring me to hard reset my computer to end the program. It also causes general system instability, as I am writing this with my desktop blacked out and file explorer frozen.

Environment and Context

Windows 10
128 GB RAM
Threadripper 3970X
RTX 2080TI
CMake 3.27.4
CUDA 12.2

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

Run a model with CUBlas.
(My exact command: main -ngl 18 -m E:\largefiles\LLAMA-2\70B\uni-tianyan-70b.Q5_K_M.gguf --color -c 4096 --temp 0.6 --repeat_penalty 1.1 -n -1 --interactive-first)

Failure Logs

I'd love to attach them, but file manager stopped working. I'll try and run it again tomorrow and upload the log before everything freezes.

The text was updated successfully, but these errors were encountered:

KerfuffleV2 · 2023-09-12T10:14:42Z

How much RAM do you have? Running a Q5_K_M 70B model is going to require around 64GB RAM and pretty much nothing else running on your system. Also, at startup it's going to need to read about 50GB from storage. If you're using a hard drive, this can easily take several minutes.

It also causes general system instability

If a user application can do this, that's basically a problem with the OS.

CRD716 · 2023-09-12T12:16:01Z

How much RAM do you have? Running a Q5_K_M 70B model is going to require around 64GB RAM and pretty much nothing else running on your system. Also, at startup it's going to need to read about 50GB from storage. If you're using a hard drive, this can easily take several minutes.

It also causes general system instability

If a user application can do this, that's basically a problem with the OS.

I have 128 GB of ram, and roughly 2-3 commits ago it loaded within a decent amount of time. Also I don't disagree with the OS problem comment but I have a tendency to mess up linux installs.

CRD716 · 2023-09-12T13:20:32Z

80% sure #3110 is the commit that borked it. https://github.com/ggml-org/ci/tree/results/llama.cpp/d5/4a4027a6ebda98ab0fef7fa0c2247d0bef132a/ggml-4-x86-cuda-v100

JohannesGaessler · 2023-09-12T22:25:30Z

80% sure #3110 is the commit that borked it.

Did you actually track down the exact commit that caused the issue? Did you test whether or not you still get the issue with -ngl 0 or when compiling without cuBLAS entirely?

CRD716 · 2023-09-16T03:03:40Z

80% sure #3110 is the commit that borked it.

Did you actually track down the exact commit that caused the issue? Did you test whether or not you still get the issue with -ngl 0 or when compiling without cuBLAS entirely?

Sorry, just recently had time to test. I've updated to the latest version and it gets further in the process before just crashing. It does this both with and without -ngl. The log cuts off after "warming up the model with an empty run"

[1694833318] Log start
[1694833318] Cmd: main -m E:\largefiles\LLAMA-2\70B\marcoroni-70b.Q5_K_S.gguf --mirostat 2 --color -c 4096 --temp 0.6 --repeat_penalty 1.1 -n -1 --interactive-first -ins
[1694833318] main: build = 1247 (e6616cf)
[1694833318] main: built with  for unknown
[1694833318] main: seed  = 1694833318
[1694833318] main: llama backend init
[1694833319] main: load the model and apply lora adapter, if any
[1694833323] warming up the model with an empty run

CRD716 · 2023-09-16T03:12:53Z

llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 4096
llm_load_print_meta: n_embd         = 8192
llm_load_print_meta: n_head         = 64
llm_load_print_meta: n_head_kv      = 8
llm_load_print_meta: n_layer        = 80
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 8
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 28672
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 70B
llm_load_print_meta: model ftype    = mostly Q5_K - Small
llm_load_print_meta: model size     = 68.98 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 38529.47 MB (+ 1280.00 MB per state)
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloaded 12/83 layers to GPU
llm_load_tensors: VRAM used: 6733 MB
....................................................................................
E:\Installs\llama.cpp\build\bin\Release>

Here's the command prompt output

Green-Sky · 2023-09-16T11:23:17Z

did you try supplying a prompt, either with -p or -f ?

CRD716 · 2023-09-16T15:47:42Z

did you try supplying a prompt, either with -p or -f ?

Yes, same issue.

Void-025 · 2023-11-11T18:49:52Z

I'm having the same issue on EndeavourOS with an RX580 and hipBLAS, it just freezes after the section where it prints dots. Using it with no layers on the GPU works fine for me, though.

Void-025 · 2023-11-17T08:29:49Z

Just tried it with CLBlast and that works properly, though very slowly, has anyone figured out what causes cuBLAS and hipBLAS to freeze?

8XXD8 · 2023-11-17T09:32:58Z

With hipBlas I can only load large models with --no-mmap, otherwise it just loads forever.

Void-025 · 2023-11-17T10:10:04Z

Just tried that, seems to still get stuck after printing a bunch of dots.

8XXD8 · 2023-11-17T11:07:26Z

You need a ton of ram, or swap for it to work. I could not load a 70b q3_s model with -no-mmap on a 32gb vram 32gb ram machine, but with 40gb swap it works with 32gb vram and 16b ram.

Void-025 · 2023-11-17T11:09:50Z

It's a 7b model, which I've been able to load with CLBlast, and I'm only offloading 1 layer anyway (for now, to test if it works).

8XXD8 · 2023-11-17T13:32:00Z

I still think it's worth a try.

Look at iotop during loading. After a while, I had a kworker process eat up all the HDD bandwidth instead of the llama process.

Void-025 · 2023-11-17T16:07:48Z

So I shut down everything that was using disk bandwidth according to iotop while I was trying to check how much llama.cpp was using and apparently that fixed it. It still took absolutely ages to load, but this time it actually did load, so apparently io was my problem, thanks!
It then gave me a cuda error 98 and crashed but that's probably unrelated.

8XXD8 · 2023-11-17T16:25:50Z

You have to compile it with make, cmake wont compile it for your gpu without setting your gpu arch with -DAMDGPU_TARGETS=

Void-025 · 2023-11-17T18:04:50Z

That actually was the make build, but I tried it using cmake and the argument you mentioned and it seems to be working, sort of. It's taken even longer than before to load, froze for ages on every step, and now that it's finally finished it seems to be frozen again and not letting me enter any text (I'm using interactive mode). Still, progress is progress, I'll try it with a prompt in non-interactive mode next and see what happens.

By the way, how long is it expected to take to load? Is it meant to take much longer when offloading to GPU than not?

8XXD8 · 2023-11-17T19:15:00Z

It's not supposed to be any slower. From an nvme ssd it is really fast, but some of my models are on a hdd, and from there a ~32gb 70b model takes about 4 minutes to load.
Have you tried increasing swap size? For me that helped.

Void-025 · 2023-11-17T20:33:18Z

Well, the models are all on a (sata) ssd, and swap is barely being used, so I really have no idea why it's so slow. It also seems to never get to the actual "generating" part because I've let it run for hours and it never does anything, so either it's really slow or it's frozen.

Judging by iotop and RAM usage, it seems like the actual "loading" part happens pretty fast, because after the first few minutes it doesn't really seem to read anything from the disk or into memory, so I really have no idea what it's doing for the entire rest of the time. It seems to consistently use ~8% CPU right after the section where it prints dots, so it's clearly doing something, though. After that, it prints some more info ("llama_new_context_with_model") and starts using ~50% CPU. At this point in interactive mode it printed the interactive instructions and froze, but if I give it a prompt to start with, it instead prints the prompt and then freezes again (still with 50% CPU usage). So far it hasn't gotten past this last freeze.

8XXD8 · 2023-11-17T20:37:18Z

What are your system specs, and how large model are you trying to load?

Void-025 · 2023-11-17T20:59:43Z

CPU: AMD Ryzen 5 2600
GPU: RX580 (8gb VRAM)
16gb RAM, 16gb swap
The model is a 7b Q8_0
I know from experimenting with CPU-only and CLBlast that the model should be able to fit (and run at around 3 and 5 tokens per second respectively) in both RAM and VRAM (barely). Either way I'm only offloading 1 layer which llama.cpp says is taking up just under 300mb of VRAM.

Void-025 · 2023-11-17T21:08:42Z

Oh, it's not frozen. Just very slow. It just generated its first token (It's been running for around 2 hours now): "1".

8XXD8 · 2023-11-17T23:43:50Z

I think your Hipblas/Rocm install is broken. It's not very robust, this week i updated from the official repo and could not compile anything, had to reinstall the entire 18gb package. It installed an incompatible version of device libs

Void-025 · 2023-11-19T10:54:47Z

Could be, I guess, but I've been installing and re-installing various packages for the last few days now so you'd think at least one configuration would have worked by now. Also feels like "broken" would outright fail in some way rather than just being ridiculously slow, but I guess it's not that weird. Oh well, guess I'll keep trying, thanks for the help though.

caiyesd · 2023-11-20T06:17:38Z

I met the same issue with RX580.
When batch size is 1, the speed is normal. but when batch size is more than 1, the speed is very low.

./batched-bench models/llama-2-7b-chat.Q5_K_M.gguf $(((512+128)*2)) 0 999 0 64 16 1,2,3

PP	TG	B	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s	T s	S t/s
64	16	1	80	0.703	91.06	1.474	10.86	2.177	36.75
64	16	2	160	1.220	104.96	145.045	0.22	146.264	1.09
64	16	3	240	1.759	109.14	145.093	0.33	146.852	1.63

github-actions · 2024-04-03T01:16:15Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

CRD716 changed the title ~~[User] Insert summary of your issue or enhancement..~~ Freeze after offloading layers to GPU Sep 12, 2023

github-actions bot added the stale label Mar 20, 2024

github-actions bot closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Freeze after offloading layers to GPU #3135

Freeze after offloading layers to GPU #3135

CRD716 commented Sep 12, 2023 •

edited

Loading

KerfuffleV2 commented Sep 12, 2023

CRD716 commented Sep 12, 2023 •

edited

Loading

CRD716 commented Sep 12, 2023

JohannesGaessler commented Sep 12, 2023

CRD716 commented Sep 16, 2023

CRD716 commented Sep 16, 2023

Green-Sky commented Sep 16, 2023

CRD716 commented Sep 16, 2023

Void-025 commented Nov 11, 2023 •

edited

Loading

Void-025 commented Nov 17, 2023

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 17, 2023

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 17, 2023 •

edited

Loading

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 17, 2023

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 17, 2023

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 17, 2023

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 17, 2023 •

edited

Loading

Void-025 commented Nov 17, 2023

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 19, 2023

caiyesd commented Nov 20, 2023

github-actions bot commented Apr 3, 2024

Freeze after offloading layers to GPU #3135

Freeze after offloading layers to GPU #3135

Comments

CRD716 commented Sep 12, 2023 • edited Loading

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

KerfuffleV2 commented Sep 12, 2023

CRD716 commented Sep 12, 2023 • edited Loading

CRD716 commented Sep 12, 2023

JohannesGaessler commented Sep 12, 2023

CRD716 commented Sep 16, 2023

CRD716 commented Sep 16, 2023

Green-Sky commented Sep 16, 2023

CRD716 commented Sep 16, 2023

Void-025 commented Nov 11, 2023 • edited Loading

Void-025 commented Nov 17, 2023

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 17, 2023

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 17, 2023 • edited Loading

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 17, 2023

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 17, 2023

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 17, 2023

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 17, 2023 • edited Loading

Void-025 commented Nov 17, 2023

8XXD8 commented Nov 17, 2023

Void-025 commented Nov 19, 2023

caiyesd commented Nov 20, 2023

github-actions bot commented Apr 3, 2024

CRD716 commented Sep 12, 2023 •

edited

Loading

CRD716 commented Sep 12, 2023 •

edited

Loading

Void-025 commented Nov 11, 2023 •

edited

Loading

Void-025 commented Nov 17, 2023 •

edited

Loading

Void-025 commented Nov 17, 2023 •

edited

Loading