-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Not enough space in the context's memory pool" exception in 1.52 #563
Comments
Same error, different model (Goliath 120b). Worked fine until I updated, can't get it to load no matter what flags I set. |
Yeah, I get the same with multiple models I've tried. I figured it was the larger vram requirements noted in the changelog:
but reducing the offloaded layers even to extreme levels changes nothing. |
Also tried running --cublas lowvram but still no dice, I even tried using --clblast instead of cublas and still the same error. Tried lower context size, the same. Tried using --nommap but that just filled my ram so completely that the destop locked up and only a hard reboot was possible. It's unlikely to be the enviroment, as it looks like OP uses windows and I'm on Arch Linux. If the entire release was completely broken confused users would have been streaming in from the get go, but for now that doesn't seem to be the case. What's going on? |
I can repro this and have found a solution, but it is a bit strange as to why it happens. I'll also open an issue upstream as they'll likely have the same problem. |
I think I fixed it but if someone can verify with ggerganov#4461 would be good. |
@LostRuins From reading your other thread this seems to be the caused by the "partial per-layer KV offloading" in the latest version, but I have to ask: what exactly does it do/mean exactly? Tried looking it up on the wiki here and more broadly online but came up blank. Also is it preferred over the increase in vram usage? |
Actually this issue is not related to the partial KV offloading at all. It was caused by a numerical precision overflow, where a float cannot fully represent a very big number accurately. Previously, KV is only offloaded at the end after all other layers have been offloaded. Partial KV offloading is being able to offload a portion of the KV progressively alongside the other layers, which usually results in a faster speed if you're able to offload like 3/4 of all layers. I personally don't find it that useful, and usually disable it for myself. But some people seem to like it. |
So you use Also is there any objective way to benchmark this? I'm guessing... seeing how a model runs with the same amount of gpu memory taken? |
You are right, I am setting lowvram to be disabled by default in the next version. Benchmark by measuring time taken with a bunch of prompts. I have done so before - see ggerganov#4309 Some people claim it's better for them - it is a trade off as I would rather have slower processing but faster generation. |
Isn't that already the case? And no no, I meant it as "is using this flag how you personally plan to counteract this change from now on in your own personal gens?" since you said you prefer the old behavior with faster generation vs processing (just like me) I was trying to figure out what i need to change in my settings so that this doesn't affect me negatively. |
I was referring to the GUI defaults, when running in GUI mode. I will leave the lowvram checkbox unchecked by default in the next version. If you're using the command line, then it will be whatever you set it. You don't have to change anything. All existing configs and command line args will work. |
Got it but my point was that if one wants to keep the slower processing and faster generation (the pre 1.52 behavior when not fully offloading to vram) the only way currently seems to be using |
Ah no, the scratch buffer thing was for pre-gguf model behavior. There is no difference anymore. Basically:
|
I don't understand why this is even necessary. The task of initial processing of a large context is much, MUCH better solved by preserving the model context on exit. Everything else is a disadvantage. Tested the new version (1.52.1) with the model from the initial post. Now it loads and works. |
So even with the reduced amount of layers given the size increase it should still generate faster if over 3/4 of all layers? How? I thought it only affected processing and nothing else?
But if one were to still use pre-gguf models (e.g. airoboros 33b) It would still be detrimental, no? 👀 I know I'm biased here since I value actual gen speed over processing, but wouldn't it be easier to have it (partial offloading) as an optional flag for people who want to use it specifically? You mentioned a very good point on the other thread namely:
which leads me to believe this will be most likely detrimental for most kcpp users since I imagine almost everyone makes use of smart context/context shifting to reduce subsequent processing. |
Yeah if you are using pre-gguf models, then you should toggle the lowvram off/on as needed. Partial offloading is and optional flag, it's disabled with |
It is optional in the sense that it can be toggled, yes, but since it's not the default state, doesn't that make the lowvram flag the optional one in this context instead?
I understand, hard to please everyone. Ultimately given that the issue isn't really an issue per-se and just a preference, I guess ultimately you should have the final say in how this behavior should be set by default. |
Am I understanding correctly that if I didn't read this thread, I would suddenly lose ~20% of generation speed after upgrading to 1.52? :) |
Not exactly. You would now OOM a lot faster, and fixing that would net you some speed reduction. It's why I figured it should be pointed out as an issue but oh well. |
The new version of Koboldcpp (1.52) terminates with an error when trying to load the model "nethena-mlewd-xwin-23b.Q3_K_M.gguf":
This was not the case with previous versions of the program. I have 64GB of RAM and 8GB of VRAM.
The model is from here: https://huggingface.co/TheBloke/Nethena-MLewd-Xwin-23B-GGUF/tree/main
The text was updated successfully, but these errors were encountered: