Change mmap parameters to avoid much swap thrashing #753

trollkotze · 2023-04-04T00:32:40Z

The introduction of loading the model through nmap, merged with this PR #613 (picked up from #586), while probably being a great idea, led to issues with swap thrashing for many users. While previously the model was completely loaded into RAM it now used swap (unless using --mlock, which requires certain user privileges), even when enough RAM was available, making performance unbearably slow.

I don't really know anything about these things. So I asked ChatGPT what could be the problem. And it gave me a hint that works wonders for me. I don't know how universally viable or useful this is, so it would be great if someone who has more expertise in these matters could double check. But I believe it is probably useful for many.

I can now even load the 65B model with my 32 GB of RAM and performance seems to be similar to what it was for the 30B model before.

Props to @slaren or @jart for introducing mmap loading. With this tweak it is now finally even performant as before for me (maybe better), while being able to combine RAM and swap memory in a useful way (first using RAM, then swap, that's what seems to be happening with this change now).

For reference, here my conversation with ChatGPT who suggested this tweak:

trollkotze:

How can I avoid swap thrashing when allocating a huge chunk of memory with nmap?

ChatGPT:

When allocating a large chunk of memory with mmap, swap thrashing can occur if the system tries to allocate more memory than what's available in both physical RAM and swap space. To avoid swap thrashing, consider the following steps:

Check available memory: Before allocating memory, check the available physical memory and swap space on your system. You can use commands like free -h or vmstat on Linux to check memory usage.

Request appropriate memory size: Be conservative in your memory allocation request. Allocate only the amount of memory you need and ensure that it fits within the available physical memory and swap space.

Use the MAP_NORESERVE flag: When using mmap, you can use the MAP_NORESERVE flag to prevent the system from reserving swap space for the allocated memory. This can be helpful if you don't expect the mapped memory to be fully used, but be cautious, as this may lead to a sudden out-of-memory (OOM) condition if you actually use more memory than what's available.

Use memory in smaller chunks: Instead of allocating a huge chunk of memory, consider allocating smaller chunks and managing them within your program. This can help reduce the impact on the system and avoid swap thrashing.

Monitor memory usage: Keep an eye on the memory usage of your program using system monitoring tools or custom code. If the memory usage starts to approach the limits of available physical memory and swap space, you can take appropriate action to avoid swap thrashing.

Increase available memory: If you find that your system is consistently running out of memory, consider adding more RAM or increasing the swap space.

Adjust the swappiness value: On Linux systems, you can modify the swappiness value to control how aggressively the kernel will use the swap space. A lower value will make the system less likely to swap out memory, which could help avoid swap thrashing. To temporarily change the swappiness value, use the following command:

sudo sysctl vm.swappiness=10

To make the change permanent, edit the /etc/sysctl.conf file and add or modify the following line:

vm.swappiness=10

Remember, it's crucial to balance memory usage and system resources. Always allocate memory carefully and monitor your program's behavior to avoid swap thrashing and other performance issues.

trollkotze:

I already reduced swappiness to 5. The strange thing is that before using nmap everything was loaded into RAM. Now almost no RAM seems to be used. Instead there's swap thrashing and very slow performance.

ChatGPT:

It's possible that using mmap() with the wrong flags or settings is causing the memory to be swapped out aggressively, despite your low swappiness setting. When using mmap(), the kernel may decide to swap out memory pages depending on the flags and access patterns. Here are some suggestions to address the issue:

Use MAP_POPULATE flag: When calling mmap(), try adding the MAP_POPULATE flag. This flag tells the kernel to pre-fault and populate the memory mappings in advance. This may help reduce swap thrashing by making sure that the pages are already in memory when you access them.

void *addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);

Access memory in a sequential manner: Accessing memory in a random or non-sequential manner can cause the kernel to swap out pages more aggressively. Try to access the mapped memory in a sequential manner, if possible, to avoid triggering swapping.

Incremental allocation: Instead of allocating a large chunk of memory at once, consider allocating smaller chunks incrementally and managing them within your program. This can help reduce the impact on the system and avoid swap thrashing.

Lock the memory with mlock(): If it's crucial to keep the mapped memory in RAM and avoid swapping, you can use the mlock() function to lock the memory. Note that this may require elevated privileges and can lead to resource exhaustion if not used carefully. Be cautious when using this approach and ensure you have a good understanding of your memory requirements.

// After mmap() call
if (mlock(addr, size) != 0) {
    perror("mlock");
    // Handle the error
}

Review your program's overall memory usage: Check if there are other parts of your program that could be consuming memory and causing the swap thrashing. Optimize your program's memory usage to ensure there is enough RAM available for the mapped memory.
Remember, proper memory management is key to avoiding swap thrashing and maintaining good performance. Ensure that you use appropriate mmap() flags and access patterns, and monitor your program's memory usage to avoid issues.

I have just used the flags that ChatGPT suggested here: MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE.
Probably MAP_POPULATE is the most important one. Not sure about the others, but I just kept them all. And the swap thrashing issue is gone with that. So I'm happy. Thanks, ChatGPT! :D

prusnak · 2023-04-04T00:52:17Z

You duplicated the PROT_READ argument. The code does not compile because of this. Did you really submit the code you actually tested?
By using MAP_PRIVATE you make it no longer possible to share the memory between multiple llama.cpp instances which effectively ruins one of the reasons why mmap was introduced in the first place.

prusnak · 2023-04-04T01:31:06Z

You can't use MAP_ANONYMOUS on a file mapping. How exactly are you testing this?

trollkotze · 2023-04-04T01:38:51Z

@prusnak Oops, copy-paste error with the PROT_READ. Updated the PR.

Thanks for your input. Since I don't really know anything about this stuff and you obviously do, yet I see some improvement in performance here on my machine that I don't understand, maybe you have some idea what is going on here that helps my performance?

I'm testing this by deleting the old main, running make and then running the Miku script with parameters adjusted like this:

GEN_OPTIONS="-c 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 500 --repeat_penalty 1.2"
 
./main -m "$MODEL" -t 6 -n 2048 $GEN_OPTIONS --color -i -r "${user_name}:" -p \ "
# ...

I personally don't care that much about understanding the technical details. I'm just happy that I can load and run the 65B model now with decent performance and wanted to share my "insights" (or lack thereof). Feel free to close if it doesn't seem salvagable without further effort from me which I really lack the understanding/expertise for.

Some comments:

MAP_POPULATE only works on Linux and only with MAP_PRIVATE.

You can't use MAP_ANONYMOUS on a file mapping

Hmm, I compiled with MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, as master ChatGPT suggested, and it ran okay for me. Actually not consuming all the RAM at once, it seems, but enough to be fast, as compared to the current state of the master branch, which used just MAP_SHARED and thrashed swap like crazy.

When I use MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE instead it behaves differently and is ultra-aggressive in swallowing up RAM, so that the whole rest of the system becomes unresponsive while the inference is running.

No idea what all this means. Maybe the MAP_ANONYMOUS which I brainlessly copied from GPT is also doing something important here?

One thing I wonder about: Is running multiple llama processes actually a common use-case? Certainly not for me. Perhaps it would make sense to make any features tailored for that optional? I only know that the introduction of mmap caused my performance to degrade due to swap thrashing, and I can avoid this by either not using mmap (reverting to older version), by using --mlock (but then the process dies running out of memory, apparently can't accept swap as a substitute), or by using this "trick".
So these are just some uninformed ideas aroud this whole issue.

prusnak · 2023-04-04T01:45:25Z

Can you just try with MAP_SHARED | MAP_POPULATE and nothing else? I feel that MAP_ANONYMOUS is ignored anyway when you provide fd.

trollkotze · 2023-04-04T02:08:40Z

@prusnak When I omit MAP_ANONYMOUS it becomes extremely slow again, similar to current state of master branch. I blamed this on swap thrashing earlier. But the funny thing is: Not even much swap is used. The process is just too modest with using any kind of memory at all, even if there is a lot available:
MAP_SHARED | MAP_POPULATE

As you see, memory consumption, both RAM and Swap are extremely modest here. The 6 cores assigned to llama don't have much material to work with and are idling a lot. Output is extremely slow.

MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE

This eats up all my RAM at once, making the rest of the system quite unresponsive, but llama output works relatively fast, but the assigned cores are not running with 100% power for some reason.

MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE

This is the best of all worlds for me. Memory usage according to htop is still surprisingly low, but there is enough stuff loaded to keep the cpu cores working at 100%. Llama output speed is very good while system is responsive.

MAP_PRIVATE | MAP_POPULATE

Quite similar as MAP_SHARED | MAP_POPULATE while omitting MAP_ANONYMOUS, but even more modest in RAM usage, and even slower. The cpu cores have nothing to do, almost nothing of the model is loaded. Very silly.

I don't know why my computer behaves like this. But the combination MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE seems to be best for me.

prusnak · 2023-04-04T02:13:02Z

Thank you for these tests!

Although I can't really wrap my head around the MAP_ANONYMOUS behaviour because it's supposed to behave like described here: https://stackoverflow.com/questions/34042915/what-is-the-purpose-of-map-anonymous-flag-in-mmap-system-call

prusnak · 2023-04-04T02:18:44Z

ChatGPT also says:

If you provide a valid file descriptor along with the MAP_ANONYMOUS flag, the file descriptor will be disregarded, and the memory mapping will not be associated with any file. The mapping will still be anonymous, and the memory region will not reflect the contents of the specified file.

Are you sure the inference gives back meaningful results when you provide the MAP_ANONYMOUS flag?

trollkotze · 2023-04-04T02:49:57Z

You're welcome. Thanks for taking the time to look at them.

Are you sure the inference gives back meaningful results when you provide the MAP_ANONYMOUS flag?

Hmm. Now that you say it, it seems that Miku has suffered some brain trauma:

Sorry for probably having wasted your time.

trollkotze · 2023-04-04T03:05:24Z

Clsoing this PR, because it seems to be garbage.

prusnak · 2023-04-04T08:56:45Z

@trollkotze no worries, at least I learned about MAP_POPULATE thanks to you and I updated my PR here: #740

thanks again!

ivanstepanovftw · 2023-04-04T15:02:27Z

BTW, @trollkotze, are you using zram?

trollkotze · 2023-04-05T14:41:06Z

@ivanstepanovftw No, I hadn't been until now. Just trying it out now. Interesting stuff. Makes it possible for me to load the 65B model with 32 GB of RAM and 50% of that configured as zram swap. Still very slow and strangely only little RAM actually consumed without random tweaks to the mmap params that I don't understand. Long pauses between short bursts where the cpu gets busy.

ivanstepanovftw · 2023-04-05T16:07:44Z

I have disabled my zram and left my vm.* options untouched. Please be sure to make backup of vm options by running sysctl -a. I am very unlucky with zram, because compilation is slower and system is not too responsive (even with high swappiness like 180, writeback device, lz4 compression).

I have 15 GiB, and 64 GiB SWAP with zswap in kernel boot options.

Here is my options:

kernel.sysrq = 1
kernel.yama.ptrace_scope = 0
vm.swappiness = 20
#vm.watermark_boost_factor = 0
vm.vfs_cache_pressure = 100
vm.watermark_scale_factor = 125
#vm.page-cluster = 1
#vm.dirty_ratio = 0
vm.dirty_bytes = 268435456
#vm.dirty_background_ratio = 0
vm.dirty_background_bytes = 134217728

sudo sysctl -p /etc/sysctl.d/100-ivanstepanovftw.conf

If you find some godlike settings for you, please share ^_^

I have also tried settings from here, but still, no luck with zram...
https://www.reddit.com/r/Fedora/comments/mzun99/new_zram_tuning_benchmarks/

Change mmap parameters to avoid much swap thrashing

2c9910c

trollkotze mentioned this pull request Apr 4, 2023

SWAP info added to README #632

Closed

trollkotze mentioned this pull request Apr 4, 2023

Bring back the ggml model format and revert breaking mmap change (#613) #711

Closed

Fix wrongly copy-pasted mmap flags

6a4f137

trollkotze closed this Apr 4, 2023

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change mmap parameters to avoid much swap thrashing #753

Change mmap parameters to avoid much swap thrashing #753

trollkotze commented Apr 4, 2023 •

edited

Loading

prusnak commented Apr 4, 2023 •

edited

Loading

prusnak commented Apr 4, 2023 •

edited

Loading

trollkotze commented Apr 4, 2023 •

edited

Loading

prusnak commented Apr 4, 2023

trollkotze commented Apr 4, 2023 •

edited

Loading

prusnak commented Apr 4, 2023 •

edited

Loading

prusnak commented Apr 4, 2023 •

edited

Loading

trollkotze commented Apr 4, 2023

trollkotze commented Apr 4, 2023

prusnak commented Apr 4, 2023

ivanstepanovftw commented Apr 4, 2023

trollkotze commented Apr 5, 2023

ivanstepanovftw commented Apr 5, 2023

Change mmap parameters to avoid much swap thrashing #753

Change mmap parameters to avoid much swap thrashing #753

Conversation

trollkotze commented Apr 4, 2023 • edited Loading

prusnak commented Apr 4, 2023 • edited Loading

prusnak commented Apr 4, 2023 • edited Loading

trollkotze commented Apr 4, 2023 • edited Loading

prusnak commented Apr 4, 2023

trollkotze commented Apr 4, 2023 • edited Loading

prusnak commented Apr 4, 2023 • edited Loading

prusnak commented Apr 4, 2023 • edited Loading

trollkotze commented Apr 4, 2023

trollkotze commented Apr 4, 2023

prusnak commented Apr 4, 2023

ivanstepanovftw commented Apr 4, 2023

trollkotze commented Apr 5, 2023

ivanstepanovftw commented Apr 5, 2023

trollkotze commented Apr 4, 2023 •

edited

Loading

prusnak commented Apr 4, 2023 •

edited

Loading

prusnak commented Apr 4, 2023 •

edited

Loading

trollkotze commented Apr 4, 2023 •

edited

Loading

trollkotze commented Apr 4, 2023 •

edited

Loading

prusnak commented Apr 4, 2023 •

edited

Loading

prusnak commented Apr 4, 2023 •

edited

Loading