Skip to content

Change mmap parameters to avoid much swap thrashing #753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

trollkotze
Copy link

@trollkotze trollkotze commented Apr 4, 2023

The introduction of loading the model through nmap, merged with this PR #613 (picked up from #586), while probably being a great idea, led to issues with swap thrashing for many users. While previously the model was completely loaded into RAM it now used swap (unless using --mlock, which requires certain user privileges), even when enough RAM was available, making performance unbearably slow.

I don't really know anything about these things. So I asked ChatGPT what could be the problem. And it gave me a hint that works wonders for me. I don't know how universally viable or useful this is, so it would be great if someone who has more expertise in these matters could double check. But I believe it is probably useful for many.

I can now even load the 65B model with my 32 GB of RAM and performance seems to be similar to what it was for the 30B model before.

Props to @slaren or @jart for introducing mmap loading. With this tweak it is now finally even performant as before for me (maybe better), while being able to combine RAM and swap memory in a useful way (first using RAM, then swap, that's what seems to be happening with this change now).

For reference, here my conversation with ChatGPT who suggested this tweak:

trollkotze:

How can I avoid swap thrashing when allocating a huge chunk of memory with nmap?

ChatGPT:

When allocating a large chunk of memory with mmap, swap thrashing can occur if the system tries to allocate more memory than what's available in both physical RAM and swap space. To avoid swap thrashing, consider the following steps:

  1. Check available memory: Before allocating memory, check the available physical memory and swap space on your system. You can use commands like free -h or vmstat on Linux to check memory usage.
  2. Request appropriate memory size: Be conservative in your memory allocation request. Allocate only the amount of memory you need and ensure that it fits within the available physical memory and swap space.
  3. Use the MAP_NORESERVE flag: When using mmap, you can use the MAP_NORESERVE flag to prevent the system from reserving swap space for the allocated memory. This can be helpful if you don't expect the mapped memory to be fully used, but be cautious, as this may lead to a sudden out-of-memory (OOM) condition if you actually use more memory than what's available.
  4. Use memory in smaller chunks: Instead of allocating a huge chunk of memory, consider allocating smaller chunks and managing them within your program. This can help reduce the impact on the system and avoid swap thrashing.
  5. Monitor memory usage: Keep an eye on the memory usage of your program using system monitoring tools or custom code. If the memory usage starts to approach the limits of available physical memory and swap space, you can take appropriate action to avoid swap thrashing.
  6. Increase available memory: If you find that your system is consistently running out of memory, consider adding more RAM or increasing the swap space.
  7. Adjust the swappiness value: On Linux systems, you can modify the swappiness value to control how aggressively the kernel will use the swap space. A lower value will make the system less likely to swap out memory, which could help avoid swap thrashing. To temporarily change the swappiness value, use the following command:
sudo sysctl vm.swappiness=10
  1. To make the change permanent, edit the /etc/sysctl.conf file and add or modify the following line:
vm.swappiness=10

Remember, it's crucial to balance memory usage and system resources. Always allocate memory carefully and monitor your program's behavior to avoid swap thrashing and other performance issues.

trollkotze:

I already reduced swappiness to 5. The strange thing is that before using nmap everything was loaded into RAM. Now almost no RAM seems to be used. Instead there's swap thrashing and very slow performance.

ChatGPT:

It's possible that using mmap() with the wrong flags or settings is causing the memory to be swapped out aggressively, despite your low swappiness setting. When using mmap(), the kernel may decide to swap out memory pages depending on the flags and access patterns. Here are some suggestions to address the issue:

  1. Use MAP_POPULATE flag: When calling mmap(), try adding the MAP_POPULATE flag. This flag tells the kernel to pre-fault and populate the memory mappings in advance. This may help reduce swap thrashing by making sure that the pages are already in memory when you access them.
void *addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
  1. Access memory in a sequential manner: Accessing memory in a random or non-sequential manner can cause the kernel to swap out pages more aggressively. Try to access the mapped memory in a sequential manner, if possible, to avoid triggering swapping.
  2. Incremental allocation: Instead of allocating a large chunk of memory at once, consider allocating smaller chunks incrementally and managing them within your program. This can help reduce the impact on the system and avoid swap thrashing.
  3. Lock the memory with mlock(): If it's crucial to keep the mapped memory in RAM and avoid swapping, you can use the mlock() function to lock the memory. Note that this may require elevated privileges and can lead to resource exhaustion if not used carefully. Be cautious when using this approach and ensure you have a good understanding of your memory requirements.
// After mmap() call
if (mlock(addr, size) != 0) {
    perror("mlock");
    // Handle the error
}
  1. Review your program's overall memory usage: Check if there are other parts of your program that could be consuming memory and causing the swap thrashing. Optimize your program's memory usage to ensure there is enough RAM available for the mapped memory.
    Remember, proper memory management is key to avoiding swap thrashing and maintaining good performance. Ensure that you use appropriate mmap() flags and access patterns, and monitor your program's memory usage to avoid issues.

I have just used the flags that ChatGPT suggested here: MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE.
Probably MAP_POPULATE is the most important one. Not sure about the others, but I just kept them all. And the swap thrashing issue is gone with that. So I'm happy. Thanks, ChatGPT! :D

@prusnak
Copy link
Collaborator

prusnak commented Apr 4, 2023

  1. You duplicated the PROT_READ argument. The code does not compile because of this. Did you really submit the code you actually tested?

  2. By using MAP_PRIVATE you make it no longer possible to share the memory between multiple llama.cpp instances which effectively ruins one of the reasons why mmap was introduced in the first place.

@prusnak
Copy link
Collaborator

prusnak commented Apr 4, 2023

You can't use MAP_ANONYMOUS on a file mapping. How exactly are you testing this?

@trollkotze
Copy link
Author

trollkotze commented Apr 4, 2023

@prusnak Oops, copy-paste error with the PROT_READ. Updated the PR.

Thanks for your input. Since I don't really know anything about this stuff and you obviously do, yet I see some improvement in performance here on my machine that I don't understand, maybe you have some idea what is going on here that helps my performance?

I'm testing this by deleting the old main, running make and then running the Miku script with parameters adjusted like this:

GEN_OPTIONS="-c 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 500 --repeat_penalty 1.2"
 
./main -m "$MODEL" -t 6 -n 2048 $GEN_OPTIONS --color -i -r "${user_name}:" -p \ "
# ...

I personally don't care that much about understanding the technical details. I'm just happy that I can load and run the 65B model now with decent performance and wanted to share my "insights" (or lack thereof). Feel free to close if it doesn't seem salvagable without further effort from me which I really lack the understanding/expertise for.

Some comments:

  1. MAP_POPULATE only works on Linux and only with MAP_PRIVATE.

You can't use MAP_ANONYMOUS on a file mapping

Hmm, I compiled with MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, as master ChatGPT suggested, and it ran okay for me. Actually not consuming all the RAM at once, it seems, but enough to be fast, as compared to the current state of the master branch, which used just MAP_SHARED and thrashed swap like crazy.

When I use MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE instead it behaves differently and is ultra-aggressive in swallowing up RAM, so that the whole rest of the system becomes unresponsive while the inference is running.

No idea what all this means. Maybe the MAP_ANONYMOUS which I brainlessly copied from GPT is also doing something important here?

One thing I wonder about: Is running multiple llama processes actually a common use-case? Certainly not for me. Perhaps it would make sense to make any features tailored for that optional? I only know that the introduction of mmap caused my performance to degrade due to swap thrashing, and I can avoid this by either not using mmap (reverting to older version), by using --mlock (but then the process dies running out of memory, apparently can't accept swap as a substitute), or by using this "trick".
So these are just some uninformed ideas aroud this whole issue.

@prusnak
Copy link
Collaborator

prusnak commented Apr 4, 2023

Can you just try with MAP_SHARED | MAP_POPULATE and nothing else? I feel that MAP_ANONYMOUS is ignored anyway when you provide fd.

@trollkotze
Copy link
Author

trollkotze commented Apr 4, 2023

@prusnak When I omit MAP_ANONYMOUS it becomes extremely slow again, similar to current state of master branch. I blamed this on swap thrashing earlier. But the funny thing is: Not even much swap is used. The process is just too modest with using any kind of memory at all, even if there is a lot available:
MAP_SHARED | MAP_POPULATE
grafik
As you see, memory consumption, both RAM and Swap are extremely modest here. The 6 cores assigned to llama don't have much material to work with and are idling a lot. Output is extremely slow.

MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE
grafik
This eats up all my RAM at once, making the rest of the system quite unresponsive, but llama output works relatively fast, but the assigned cores are not running with 100% power for some reason.

MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE
grafik
This is the best of all worlds for me. Memory usage according to htop is still surprisingly low, but there is enough stuff loaded to keep the cpu cores working at 100%. Llama output speed is very good while system is responsive.

MAP_PRIVATE | MAP_POPULATE
grafik
Quite similar as MAP_SHARED | MAP_POPULATE while omitting MAP_ANONYMOUS, but even more modest in RAM usage, and even slower. The cpu cores have nothing to do, almost nothing of the model is loaded. Very silly.

I don't know why my computer behaves like this. But the combination MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE seems to be best for me.

@prusnak
Copy link
Collaborator

prusnak commented Apr 4, 2023

Thank you for these tests!

Although I can't really wrap my head around the MAP_ANONYMOUS behaviour because it's supposed to behave like described here: https://stackoverflow.com/questions/34042915/what-is-the-purpose-of-map-anonymous-flag-in-mmap-system-call

@prusnak
Copy link
Collaborator

prusnak commented Apr 4, 2023

ChatGPT also says:

If you provide a valid file descriptor along with the MAP_ANONYMOUS flag, the file descriptor will be disregarded, and the memory mapping will not be associated with any file. The mapping will still be anonymous, and the memory region will not reflect the contents of the specified file.

Are you sure the inference gives back meaningful results when you provide the MAP_ANONYMOUS flag?

@trollkotze
Copy link
Author

You're welcome. Thanks for taking the time to look at them.

Are you sure the inference gives back meaningful results when you provide the MAP_ANONYMOUS flag?

Hmm. Now that you say it, it seems that Miku has suffered some brain trauma:
grafik

Sorry for probably having wasted your time.

@trollkotze
Copy link
Author

Clsoing this PR, because it seems to be garbage.

@trollkotze trollkotze closed this Apr 4, 2023
@prusnak
Copy link
Collaborator

prusnak commented Apr 4, 2023

@trollkotze no worries, at least I learned about MAP_POPULATE thanks to you and I updated my PR here: #740

thanks again!

@ivanstepanovftw
Copy link
Collaborator

BTW, @trollkotze, are you using zram?

@trollkotze
Copy link
Author

@ivanstepanovftw No, I hadn't been until now. Just trying it out now. Interesting stuff. Makes it possible for me to load the 65B model with 32 GB of RAM and 50% of that configured as zram swap. Still very slow and strangely only little RAM actually consumed without random tweaks to the mmap params that I don't understand. Long pauses between short bursts where the cpu gets busy.

@ivanstepanovftw
Copy link
Collaborator

I have disabled my zram and left my vm.* options untouched. Please be sure to make backup of vm options by running sysctl -a. I am very unlucky with zram, because compilation is slower and system is not too responsive (even with high swappiness like 180, writeback device, lz4 compression).

I have 15 GiB, and 64 GiB SWAP with zswap in kernel boot options.

Here is my options:

kernel.sysrq = 1
kernel.yama.ptrace_scope = 0
vm.swappiness = 20
#vm.watermark_boost_factor = 0
vm.vfs_cache_pressure = 100
vm.watermark_scale_factor = 125
#vm.page-cluster = 1
#vm.dirty_ratio = 0
vm.dirty_bytes = 268435456
#vm.dirty_background_ratio = 0
vm.dirty_background_bytes = 134217728

sudo sysctl -p /etc/sysctl.d/100-ivanstepanovftw.conf

If you find some godlike settings for you, please share ^_^

I have also tried settings from here, but still, no luck with zram...
https://www.reddit.com/r/Fedora/comments/mzun99/new_zram_tuning_benchmarks/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants