-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Change mmap parameters to avoid much swap thrashing #753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
You can't use |
@prusnak Oops, copy-paste error with the PROT_READ. Updated the PR. Thanks for your input. Since I don't really know anything about this stuff and you obviously do, yet I see some improvement in performance here on my machine that I don't understand, maybe you have some idea what is going on here that helps my performance? I'm testing this by deleting the old
I personally don't care that much about understanding the technical details. I'm just happy that I can load and run the 65B model now with decent performance and wanted to share my "insights" (or lack thereof). Feel free to close if it doesn't seem salvagable without further effort from me which I really lack the understanding/expertise for. Some comments:
Hmm, I compiled with When I use No idea what all this means. Maybe the One thing I wonder about: Is running multiple llama processes actually a common use-case? Certainly not for me. Perhaps it would make sense to make any features tailored for that optional? I only know that the introduction of mmap caused my performance to degrade due to swap thrashing, and I can avoid this by either not using mmap (reverting to older version), by using --mlock (but then the process dies running out of memory, apparently can't accept swap as a substitute), or by using this "trick". |
Can you just try with |
@prusnak When I omit
I don't know why my computer behaves like this. But the combination |
Thank you for these tests! Although I can't really wrap my head around the MAP_ANONYMOUS behaviour because it's supposed to behave like described here: https://stackoverflow.com/questions/34042915/what-is-the-purpose-of-map-anonymous-flag-in-mmap-system-call |
ChatGPT also says:
Are you sure the inference gives back meaningful results when you provide the |
Clsoing this PR, because it seems to be garbage. |
@trollkotze no worries, at least I learned about MAP_POPULATE thanks to you and I updated my PR here: #740 thanks again! |
BTW, @trollkotze, are you using zram? |
@ivanstepanovftw No, I hadn't been until now. Just trying it out now. Interesting stuff. Makes it possible for me to load the 65B model with 32 GB of RAM and 50% of that configured as zram swap. Still very slow and strangely only little RAM actually consumed without random tweaks to the mmap params that I don't understand. Long pauses between short bursts where the cpu gets busy. |
I have disabled my zram and left my I have 15 GiB, and 64 GiB SWAP with Here is my options:
If you find some godlike settings for you, please share ^_^ I have also tried settings from here, but still, no luck with zram... |
The introduction of loading the model through nmap, merged with this PR #613 (picked up from #586), while probably being a great idea, led to issues with swap thrashing for many users. While previously the model was completely loaded into RAM it now used swap (unless using --mlock, which requires certain user privileges), even when enough RAM was available, making performance unbearably slow.
I don't really know anything about these things. So I asked ChatGPT what could be the problem. And it gave me a hint that works wonders for me. I don't know how universally viable or useful this is, so it would be great if someone who has more expertise in these matters could double check. But I believe it is probably useful for many.
I can now even load the 65B model with my 32 GB of RAM and performance seems to be similar to what it was for the 30B model before.
Props to @slaren or @jart for introducing mmap loading. With this tweak it is now finally even performant as before for me (maybe better), while being able to combine RAM and swap memory in a useful way (first using RAM, then swap, that's what seems to be happening with this change now).
For reference, here my conversation with ChatGPT who suggested this tweak:
trollkotze:
ChatGPT:
trollkotze:
ChatGPT:
I have just used the flags that ChatGPT suggested here:
MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE
.Probably MAP_POPULATE is the most important one. Not sure about the others, but I just kept them all. And the swap thrashing issue is gone with that. So I'm happy. Thanks, ChatGPT! :D