-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Freeze after offloading layers to GPU #3135
Comments
How much RAM do you have? Running a
If a user application can do this, that's basically a problem with the OS. |
I have 128 GB of ram, and roughly 2-3 commits ago it loaded within a decent amount of time. Also I don't disagree with the OS problem comment but I have a tendency to mess up linux installs. |
80% sure #3110 is the commit that borked it. https://github.com/ggml-org/ci/tree/results/llama.cpp/d5/4a4027a6ebda98ab0fef7fa0c2247d0bef132a/ggml-4-x86-cuda-v100 |
Did you actually track down the exact commit that caused the issue? Did you test whether or not you still get the issue with -ngl 0 or when compiling without cuBLAS entirely? |
Sorry, just recently had time to test. I've updated to the latest version and it gets further in the process before just crashing. It does this both with and without -ngl. The log cuts off after "warming up the model with an empty run"
|
Here's the command prompt output |
did you try supplying a prompt, either with |
Yes, same issue. |
I'm having the same issue on EndeavourOS with an RX580 and hipBLAS, it just freezes after the section where it prints dots. Using it with no layers on the GPU works fine for me, though. |
Just tried it with CLBlast and that works properly, though very slowly, has anyone figured out what causes cuBLAS and hipBLAS to freeze? |
With hipBlas I can only load large models with --no-mmap, otherwise it just loads forever. |
Just tried that, seems to still get stuck after printing a bunch of dots. |
You need a ton of ram, or swap for it to work. I could not load a 70b q3_s model with -no-mmap on a 32gb vram 32gb ram machine, but with 40gb swap it works with 32gb vram and 16b ram. |
It's a 7b model, which I've been able to load with CLBlast, and I'm only offloading 1 layer anyway (for now, to test if it works). |
I still think it's worth a try. Look at |
So I shut down everything that was using disk bandwidth according to iotop while I was trying to check how much llama.cpp was using and apparently that fixed it. It still took absolutely ages to load, but this time it actually did load, so apparently io was my problem, thanks! |
You have to compile it with make, cmake wont compile it for your gpu without setting your gpu arch with |
That actually was the make build, but I tried it using cmake and the argument you mentioned and it seems to be working, sort of. It's taken even longer than before to load, froze for ages on every step, and now that it's finally finished it seems to be frozen again and not letting me enter any text (I'm using interactive mode). Still, progress is progress, I'll try it with a prompt in non-interactive mode next and see what happens. By the way, how long is it expected to take to load? Is it meant to take much longer when offloading to GPU than not? |
It's not supposed to be any slower. From an nvme ssd it is really fast, but some of my models are on a hdd, and from there a ~32gb 70b model takes about 4 minutes to load. |
Well, the models are all on a (sata) ssd, and swap is barely being used, so I really have no idea why it's so slow. It also seems to never get to the actual "generating" part because I've let it run for hours and it never does anything, so either it's really slow or it's frozen. Judging by iotop and RAM usage, it seems like the actual "loading" part happens pretty fast, because after the first few minutes it doesn't really seem to read anything from the disk or into memory, so I really have no idea what it's doing for the entire rest of the time. It seems to consistently use ~8% CPU right after the section where it prints dots, so it's clearly doing something, though. After that, it prints some more info ("llama_new_context_with_model") and starts using ~50% CPU. At this point in interactive mode it printed the interactive instructions and froze, but if I give it a prompt to start with, it instead prints the prompt and then freezes again (still with 50% CPU usage). So far it hasn't gotten past this last freeze. |
What are your system specs, and how large model are you trying to load? |
CPU: AMD Ryzen 5 2600 |
Oh, it's not frozen. Just very slow. It just generated its first token (It's been running for around 2 hours now): "1". |
I think your Hipblas/Rocm install is broken. It's not very robust, this week i updated from the official repo and could not compile anything, had to reinstall the entire 18gb package. It installed an incompatible version of device libs |
Could be, I guess, but I've been installing and re-installing various packages for the last few days now so you'd think at least one configuration would have worked by now. Also feels like "broken" would outright fail in some way rather than just being ridiculously slow, but I guess it's not that weird. Oh well, guess I'll keep trying, thanks for the help though. |
I met the same issue with RX580.
|
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
llama.cpp does not freeze and continues to run normally, not interfering with basic windows operations.
Current Behavior
llama.cpp then freezes and will not respond. Task Manager shows 0% CPU or GPU load. It is also somehow unable to be stopped via task manager, requiring me to hard reset my computer to end the program. It also causes general system instability, as I am writing this with my desktop blacked out and file explorer frozen.
Environment and Context
Windows 10
128 GB RAM
Threadripper 3970X
RTX 2080TI
CMake 3.27.4
CUDA 12.2
Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
Run a model with CUBlas.
(My exact command:
main -ngl 18 -m E:\largefiles\LLAMA-2\70B\uni-tianyan-70b.Q5_K_M.gguf --color -c 4096 --temp 0.6 --repeat_penalty 1.1 -n -1 --interactive-first
)Failure Logs
I'd love to attach them, but file manager stopped working. I'll try and run it again tomorrow and upload the log before everything freezes.
The text was updated successfully, but these errors were encountered: