Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Freeze after offloading layers to GPU #3135

Closed
4 tasks done
CRD716 opened this issue Sep 12, 2023 · 27 comments
Closed
4 tasks done

Freeze after offloading layers to GPU #3135

CRD716 opened this issue Sep 12, 2023 · 27 comments
Labels

Comments

@CRD716
Copy link
Contributor

CRD716 commented Sep 12, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

llama.cpp does not freeze and continues to run normally, not interfering with basic windows operations.

Current Behavior

llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 4096
llm_load_print_meta: n_embd         = 8192
llm_load_print_meta: n_head         = 64
llm_load_print_meta: n_head_kv      = 8
llm_load_print_meta: n_layer        = 80
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 8
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 28672
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 70B
llm_load_print_meta: model ftype    = mostly Q5_K - Medium
llm_load_print_meta: model size     = 68.98 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 35995.03 MB (+ 1280.00 MB per state)
llm_load_tensors: offloading 18 repeating layers to GPU
llm_load_tensors: offloaded 18/83 layers to GPU
llm_load_tensors: VRAM used: 10500 MB

llama.cpp then freezes and will not respond. Task Manager shows 0% CPU or GPU load. It is also somehow unable to be stopped via task manager, requiring me to hard reset my computer to end the program. It also causes general system instability, as I am writing this with my desktop blacked out and file explorer frozen.

Environment and Context

Windows 10
128 GB RAM
Threadripper 3970X
RTX 2080TI
CMake 3.27.4
CUDA 12.2

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

Run a model with CUBlas.
(My exact command: main -ngl 18 -m E:\largefiles\LLAMA-2\70B\uni-tianyan-70b.Q5_K_M.gguf --color -c 4096 --temp 0.6 --repeat_penalty 1.1 -n -1 --interactive-first)

Failure Logs

I'd love to attach them, but file manager stopped working. I'll try and run it again tomorrow and upload the log before everything freezes.

@KerfuffleV2
Copy link
Collaborator

How much RAM do you have? Running a Q5_K_M 70B model is going to require around 64GB RAM and pretty much nothing else running on your system. Also, at startup it's going to need to read about 50GB from storage. If you're using a hard drive, this can easily take several minutes.

It also causes general system instability

If a user application can do this, that's basically a problem with the OS.

@CRD716
Copy link
Contributor Author

CRD716 commented Sep 12, 2023

How much RAM do you have? Running a Q5_K_M 70B model is going to require around 64GB RAM and pretty much nothing else running on your system. Also, at startup it's going to need to read about 50GB from storage. If you're using a hard drive, this can easily take several minutes.

It also causes general system instability

If a user application can do this, that's basically a problem with the OS.

I have 128 GB of ram, and roughly 2-3 commits ago it loaded within a decent amount of time. Also I don't disagree with the OS problem comment but I have a tendency to mess up linux installs.

@CRD716 CRD716 changed the title [User] Insert summary of your issue or enhancement.. Freeze after offloading layers to GPU Sep 12, 2023
@CRD716
Copy link
Contributor Author

CRD716 commented Sep 12, 2023

@JohannesGaessler
Copy link
Collaborator

80% sure #3110 is the commit that borked it.

Did you actually track down the exact commit that caused the issue? Did you test whether or not you still get the issue with -ngl 0 or when compiling without cuBLAS entirely?

@CRD716
Copy link
Contributor Author

CRD716 commented Sep 16, 2023

80% sure #3110 is the commit that borked it.

Did you actually track down the exact commit that caused the issue? Did you test whether or not you still get the issue with -ngl 0 or when compiling without cuBLAS entirely?

Sorry, just recently had time to test. I've updated to the latest version and it gets further in the process before just crashing. It does this both with and without -ngl. The log cuts off after "warming up the model with an empty run"

[1694833318] Log start
[1694833318] Cmd: main -m E:\largefiles\LLAMA-2\70B\marcoroni-70b.Q5_K_S.gguf --mirostat 2 --color -c 4096 --temp 0.6 --repeat_penalty 1.1 -n -1 --interactive-first -ins
[1694833318] main: build = 1247 (e6616cf)
[1694833318] main: built with  for unknown
[1694833318] main: seed  = 1694833318
[1694833318] main: llama backend init
[1694833319] main: load the model and apply lora adapter, if any
[1694833323] warming up the model with an empty run

@CRD716
Copy link
Contributor Author

CRD716 commented Sep 16, 2023

llm_load_print_meta: format         = GGUF V2 (latest)
llm_load_print_meta: arch           = llama
llm_load_print_meta: vocab type     = SPM
llm_load_print_meta: n_vocab        = 32000
llm_load_print_meta: n_merges       = 0
llm_load_print_meta: n_ctx_train    = 4096
llm_load_print_meta: n_ctx          = 4096
llm_load_print_meta: n_embd         = 8192
llm_load_print_meta: n_head         = 64
llm_load_print_meta: n_head_kv      = 8
llm_load_print_meta: n_layer        = 80
llm_load_print_meta: n_rot          = 128
llm_load_print_meta: n_gqa          = 8
llm_load_print_meta: f_norm_eps     = 1.0e-05
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: n_ff           = 28672
llm_load_print_meta: freq_base      = 10000.0
llm_load_print_meta: freq_scale     = 1
llm_load_print_meta: model type     = 70B
llm_load_print_meta: model ftype    = mostly Q5_K - Small
llm_load_print_meta: model size     = 68.98 B
llm_load_print_meta: general.name   = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.23 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  = 38529.47 MB (+ 1280.00 MB per state)
llm_load_tensors: offloading 12 repeating layers to GPU
llm_load_tensors: offloaded 12/83 layers to GPU
llm_load_tensors: VRAM used: 6733 MB
....................................................................................
E:\Installs\llama.cpp\build\bin\Release>

Here's the command prompt output

@Green-Sky
Copy link
Collaborator

did you try supplying a prompt, either with -p or -f ?

@CRD716
Copy link
Contributor Author

CRD716 commented Sep 16, 2023

did you try supplying a prompt, either with -p or -f ?

Yes, same issue.

@Void-025
Copy link

Void-025 commented Nov 11, 2023

I'm having the same issue on EndeavourOS with an RX580 and hipBLAS, it just freezes after the section where it prints dots. Using it with no layers on the GPU works fine for me, though.

@Void-025
Copy link

Just tried it with CLBlast and that works properly, though very slowly, has anyone figured out what causes cuBLAS and hipBLAS to freeze?

@8XXD8
Copy link

8XXD8 commented Nov 17, 2023

With hipBlas I can only load large models with --no-mmap, otherwise it just loads forever.

@Void-025
Copy link

Just tried that, seems to still get stuck after printing a bunch of dots.

@8XXD8
Copy link

8XXD8 commented Nov 17, 2023

You need a ton of ram, or swap for it to work. I could not load a 70b q3_s model with -no-mmap on a 32gb vram 32gb ram machine, but with 40gb swap it works with 32gb vram and 16b ram.

@Void-025
Copy link

Void-025 commented Nov 17, 2023

It's a 7b model, which I've been able to load with CLBlast, and I'm only offloading 1 layer anyway (for now, to test if it works).

@8XXD8
Copy link

8XXD8 commented Nov 17, 2023

I still think it's worth a try.

Look at iotop during loading. After a while, I had a kworker process eat up all the HDD bandwidth instead of the llama process.

@Void-025
Copy link

So I shut down everything that was using disk bandwidth according to iotop while I was trying to check how much llama.cpp was using and apparently that fixed it. It still took absolutely ages to load, but this time it actually did load, so apparently io was my problem, thanks!
It then gave me a cuda error 98 and crashed but that's probably unrelated.

@8XXD8
Copy link

8XXD8 commented Nov 17, 2023

You have to compile it with make, cmake wont compile it for your gpu without setting your gpu arch with -DAMDGPU_TARGETS=

@Void-025
Copy link

That actually was the make build, but I tried it using cmake and the argument you mentioned and it seems to be working, sort of. It's taken even longer than before to load, froze for ages on every step, and now that it's finally finished it seems to be frozen again and not letting me enter any text (I'm using interactive mode). Still, progress is progress, I'll try it with a prompt in non-interactive mode next and see what happens.

By the way, how long is it expected to take to load? Is it meant to take much longer when offloading to GPU than not?

@8XXD8
Copy link

8XXD8 commented Nov 17, 2023

It's not supposed to be any slower. From an nvme ssd it is really fast, but some of my models are on a hdd, and from there a ~32gb 70b model takes about 4 minutes to load.
Have you tried increasing swap size? For me that helped.

@Void-025
Copy link

Well, the models are all on a (sata) ssd, and swap is barely being used, so I really have no idea why it's so slow. It also seems to never get to the actual "generating" part because I've let it run for hours and it never does anything, so either it's really slow or it's frozen.

Judging by iotop and RAM usage, it seems like the actual "loading" part happens pretty fast, because after the first few minutes it doesn't really seem to read anything from the disk or into memory, so I really have no idea what it's doing for the entire rest of the time. It seems to consistently use ~8% CPU right after the section where it prints dots, so it's clearly doing something, though. After that, it prints some more info ("llama_new_context_with_model") and starts using ~50% CPU. At this point in interactive mode it printed the interactive instructions and froze, but if I give it a prompt to start with, it instead prints the prompt and then freezes again (still with 50% CPU usage). So far it hasn't gotten past this last freeze.

@8XXD8
Copy link

8XXD8 commented Nov 17, 2023

What are your system specs, and how large model are you trying to load?

@Void-025
Copy link

Void-025 commented Nov 17, 2023

CPU: AMD Ryzen 5 2600
GPU: RX580 (8gb VRAM)
16gb RAM, 16gb swap
The model is a 7b Q8_0
I know from experimenting with CPU-only and CLBlast that the model should be able to fit (and run at around 3 and 5 tokens per second respectively) in both RAM and VRAM (barely). Either way I'm only offloading 1 layer which llama.cpp says is taking up just under 300mb of VRAM.

@Void-025
Copy link

Oh, it's not frozen. Just very slow. It just generated its first token (It's been running for around 2 hours now): "1".

@8XXD8
Copy link

8XXD8 commented Nov 17, 2023

I think your Hipblas/Rocm install is broken. It's not very robust, this week i updated from the official repo and could not compile anything, had to reinstall the entire 18gb package. It installed an incompatible version of device libs

@Void-025
Copy link

Could be, I guess, but I've been installing and re-installing various packages for the last few days now so you'd think at least one configuration would have worked by now. Also feels like "broken" would outright fail in some way rather than just being ridiculously slow, but I guess it's not that weird. Oh well, guess I'll keep trying, thanks for the help though.

@caiyesd
Copy link

caiyesd commented Nov 20, 2023

I met the same issue with RX580.
When batch size is 1, the speed is normal. but when batch size is more than 1, the speed is very low.

./batched-bench models/llama-2-7b-chat.Q5_K_M.gguf $(((512+128)*2)) 0 999 0 64 16 1,2,3
PP TG B N_KV T_PP s S_PP t/s T_TG s S_TG t/s T s S t/s
64 16 1 80 0.703 91.06 1.474 10.86 2.177 36.75
64 16 2 160 1.220 104.96 145.045 0.22 146.264 1.09
64 16 3 240 1.759 109.14 145.093 0.33 146.852 1.63

@github-actions github-actions bot added the stale label Mar 20, 2024
Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants