-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After rsync of ~2TiB of data large amount of SUnreclaim (ARC), keeps on growing (slabtop) without limit - slowing down system to a halt #3157
Comments
Are you by any chance rsyncing over NFS? I've had similar problems, NFS seems to use some caches in SLAB and pagecache, which pressures out the ARC. My workaround was to set up a 5-minute cron with |
no, only via USB 3.0 :( I had pasted
then it would free up SUnreclaim in 20-50 KiB steps, but it partially looked like memory was also growing (I don't have time to wait 5+ days for it to be usable again so) after 1 hour I did a reboot (via magic sysrq key) - I've the impression that there's still issues with memory pressure despite usage of all the recent stuff and #2129 |
I've done several rsync transfers of the 2TB (albeit incremental - so at best 10-30 GB max per import and export) despite now using a pre-set value for ARC
memory keeps on growing anybody can shed a light on what the problem with transparent hugepages and ZFSonLinux is ? https://groups.google.com/a/zfsonlinux.org/forum/#!msg/zfs-discuss/7a77qQcG4C0/Bpc-VHKSjycJ the advice keeps on popping up to disable it - when searching for solutions of an ever-growing ARC or ZFS slabs what is causing SPL or ZFS, ARC to continually grow ?
this is after a short uptime of 12 hours, then 2 rsync transfers (currently on the 2nd incremental run) I've read that exporting pools is supposed to reset memory consumption but how can that be the solution ? assumed - programs or the workstate of them has to be preserved @behlendorf , @tuxoko you two are the memory and/or ARC "cracks" in regard to ZFSonLinux: Meanwhile I keep on looking for reports of experience and settings that might have helped in dealing with this problem (besides disabling THP) Sorry for bothering in any case - I just want to prevent repeating an experience similar to #3142 Many thanks in advance ! |
@kernelOfTruth I've been meaning to give this issue a more serious looking into but haven't yet had a chance to do so. Looking at the numbers from your initial posting, this sticks out:
That's pretty much blowing your 4GiB ARC size limit right there. This value is the sum a few other things, the sizes of which aren't necessarily readily available. It contains the dnode cache "dnode_t", the dbuf cache "dmu_buf_impl_t" and some of the 512-byte zio buffer "zio_buf_512". Here are the relevant slab lines from above:
As you can see, they've all got a ton of items. Also, they're all somewhat sparsely populated which likely means there's a fair bit of slab fragmentation. The common thread in these related problems seems to be the use of
AFAIK right now, the only way to tame the kernel's dentry cache is to set At the point the values look like the above, I'm not sure what can be done to lower them and to reduce any fragmentation which might have occurred. |
No, it deals with ARC buffers. |
@dweeezil , @kpande , @snajpa thanks for taking a look into this ! #2129 is already on board /proc/sys/vm/vfs_cache_pressure previously was at 1000 ( #3142 ) for this issue it's at 10000, ok so I'll raise it again one step Related to vfs_cache_pressure settings what stuck out in my memory was something Andrew Morton (?) wrote: having it set at >= 100000 I'll give that a try, thanks ! Wandering around the web with this issue in mind I found following information:
Questions - for improving memory reclaiming and/or growth limiting: What is the "optimal" setting for spl_kmem_cache_expire in this connection ? What is the "optimal" setting for spl_kmem_cache_reclaim ?
so 0 == reclaim once and then no more ?
so 1 = reclaim once ? it should reclaim, I wouldn't care if latency was a little higher - as long as memory growth doesn't get out of control I'm irregularly running memory compaction manually but it might not address this kind of fragmentation issue, I'll take a look and see what can be tweaked in that regard I'll give @snajpa 's suggestion of
and see if that works here Thanks ! |
setting spl_kmem_alloc_max to 65536 per #3041 (default: 2097152) |
something's fishy: even though
are set and it could be observed that the settings seems to apply on a per-zpool basis (is that true ?) after a scrub of an additional pool and now after the export of the mentioned pool (only the pool containing /home is currently imported) settings are now at:
I should have copied arc_meta_max and arc_meta_limit before but I'm sure the values were significantly lower for at least one value (arc_meta_max ? at 6 GB ?) are those values constantly rising with each subsequent and/or imported pool and after export not reset ? also SUnreclaim was at a value of around 18-20 GB well, I would understand if it was around 14-15 GB but that is three times the value of 6 GB weird ... copying arcstats for good measure values should be for /home + l2arc, after import & export of one additional pool (2.7 TB, 1.7 TB with ditto blocks), rsync to that additional pool, after rsync to a btrfs volume (1.7 TB)
Exporting all pools seems to reclaim and/or reset memory usage but that can't be the only solution On opportunity I'll try this out without l2arc and see if that makes a change ... ... and disabled transparent hugepages as a last resort |
Without addressing a few of the specifics in @kernelOfTruth's last couple of postings, I'd like to summarize the problem: Unlike most (all?) other native Linux filesystems, ZFS carries quite a bit of baggage corresponding to the kernel's dentry cache. As of at least 302f753, ZoL is completely reliant on the Kernel's shrinker callback mechanism to shed memory. Due to the nature of Linux's dentry cache (it can grow to a lot of entries very easily) and the fact that ZFS requires a lot of metadata to be associated with each entry, the ARC can easily blow past the administrator-set limit when lots of files are being traversed. A quick peek through the kernel makes me think that In summary, if the kernel's dcache is large, ZFS will consume a correspondingly-large (actually, several times larger) amount of memory which will show up in arcstats as "other_size". That all said, however, the shrinker pressure mechanism does work... to a point. If I max out the memory on a system by traversing lots of files and causing other_size to get very large, the ARC will shrink if I apply pressure from a normal userland program trying to allocate memory. The manner in which the pressure is applied is dependent on the kernel's overcommit policy and the quantity of swap space. In particular, userland programs may find it difficult to allocate memory in large chunks but the same amount may succeed if the program "nibbles" away at the memory, causing the shrinkers to engage. I'm not sure of the best solution to this issue at the moment, but it's not unique to ZFS. There are plenty of reports around the Internet of dcache-related memory problems being caused by rsync on ext4-only systems. The difference is, however, that ext4 doesn't add it's own extras to the dcache so the effects are a lot more severs. Postings in which people are complaining about this problem usually mention |
@kernelOfTruth A bit more testing shows me that you might have better success if you set the module parameters in the |
@dweeezil would you please elaborate how setting arc limit run-time doesn't take properly, as you say, please? I've only seen so far that if I limit ARC to a smaller size, than it already is, it may in some cases never shrink (we run into a deadlock sooner than it has a chance to). |
Internally, There should be something in the documentation as to the difference between setting |
like mentioned in #3155 it would be nice, if we could avoid having to use two caches, dentry & dnode anyway @dweeezil coincidentally I also made the observation that at least two settings can't be set dynamically once spl/zfs was already loaded and started to put some settings into spl.conf & zfs.conf (when the modules are loaded): spl_kmem_cache_kmem_threads also zfs_arc_max seemingly can't be set to the same value during load (and or that error coincided with a different value of spl_kmem_cache_max_size) otherwise it would lead to lots of segmentation faults of mount so the testing settings right now are: zfs.conf
spl.conf
currently I've also have transparent hugepages disabled via
Thanks ! |
Regarding #3155, I was clearly wrong about other filesystems not hanging onto a lot of stuff in the linux slab. Here's some slabinfo entries after stat(2)ing about a million files in a 3-level nested set of 1000 directories on an EXT4 filesystem:
and this is after doing the same on a ZFS file system (with an intervening drop_caches to clean everything up):
ZFS is definitely grabbing more *node-related stuff but it's not like EXT4 doesn't add on its own stuff. |
@dweeezil I wonder why ZFS doesn't reclaim more aggressively, I'd like to investigate on this but currently I'm busy on other stuff... |
@tuxoko Right, I mainly wanted to point out that ZFS isn't the only filesystem that uses a lot of inode-related storage. Also, it's not clear to me that the kernel is handling large dentry cache sizes very well. Finally, I wanted to point out that ZFS can behave much better if the arc limit is set during module load rather than after the fact. For my part, I'm not going to be able to look into this much further now, either. I do plan on investigating related issues more closely as part of #3115 (speaking of which, and on a totally unrelated subject, I have a feeling it may be a major pain to merge ABD into that). |
Seems like the issue is resolved (reclaim seems to work fine) - I'm not really sure what of the modified settings made that change possible but I guess that it's a combination posting the data here for reference if anyone should encounter an ever-growing ARC: Keep in mind that this is tailored toward a desktop, home backup and workstation - kind of setup Kernel running 3.19 with following mentionable additional patchsets that give memory allocations a higher success chance: http://www.eenyhelp.com/patch-0-3-rfc-mm-vmalloc-fix-possible-exhaustion-vmalloc-space-help-215610311.html[PATCH V4] Allow compaction of unevictable pagesenhanced compaction algorithmswap on ZRam with LZ4 compression /etc/modprobe.d/spl.conf
/etc/modprobe.d/zfs.conf
<-- several of those parameters both for ZFS and SPL kernel modules have to be specified during loading of the modules - otherwise, behavior seems to be that those aren't adhered to slub_nomerge is appended to the kernel due to safety reasons (buggy drivers, igb had that problem of memory corruption afaik) intel_iommu=on appended to kernel per advice from @ryao CONFIG_PARAVIRT_SPINLOCKS is enabled in kernel configuration, if I remember correctly there was an issue where @ryao mentioned that certain codepaths (slowpath is removed (?) with that configuration option and thus lockups tend to occur less often. #3091
Disabling THP - transparent hugepages - which seems to work fine with the recent tweaks to ZFS, though and regularly running
might raise stability in certain cases (if I remember correctly it was also mentioned related to OpenVZ)
is also set here as a preventative & stability enhancing measure Code-changes & commits: kernelOfTruth@fa8f5cd higher ZFS_OBJ_MTX_SZ (512 ; double the value) which leads to following error messages during mount/import: http://pastebin.com/cWm5Hvn0 but works fine in operation kernelOfTruth@086f234 So ARC doesn't grow that aggressively and more objects at the same time are scanned through and recycled or evicted. It might not address SUnreclaim directly but changes in #3115 should refine ARC's behavior in that regard (arc_evict_iterations replaced with zfs_arc_evict_batch_limit) Codes changes & commits in SPL: Additional manually set settings: |
Below follow the output of /proc/slabinfo, cat /proc/meminfo and /proc/spl/kstat/zfs/arcstats during the restore operation of 1.7 TB from external USB 3.0 disk (both were ZFS pools) ZFS ARC stats, 1 TB in, mainly "larger" (hundreds MB to GB): http://pastebin.com/uASLYsqW ZFS ARC stats, 1.3 TB, more larger data: http://pastebin.com/Fi5CMc65 ZFS ARC stats, 1.6 TB, mixed (large + little data): http://pastebin.com/uDYUuBGY ZFS ARC stats, 1.7 TB, heavily mixed, close to end of backup: http://pastebin.com/tHHT1cXX ZFS ARC stats, 1.7 TB, heavily mixed, after rsync: http://pastebin.com/BEmqGFQX Mark how other_size doesn't seem to grow out of proportion anymore; only swap used during backup was from ZRAM Will post the stats later of several imports and exports of pools + Btrfs partitions and small incremental rsync updates - this was always a problem in the past where SUnreclaim would grow almost unstoppably ever larger |
So here the data after: rsync (1.7 TB) - ZFS /home to ZFS bak (several hundred megabytes transferred) other_size has twice the size of data_size, meta_size close the size of data_size zio_buf_16384 never was blown out of proportion (e.g. 4 GB) and had always a value around 1-1.3 GB Will post the data after updatedb with /home + another additional ZFS pool imported and an rsync job after that - this was usually the worst-case scenario for me in the near past where things really seemed to wreak havoc (despite using #2129 ) If things don't change I'll re-add the l2arc device and see how things goes in the next days - with it memory consumption always was greater (improved with #3115 ?) ; but even without those changes it should behave way more civil with an L2ARC |
ok, decided to add l2arc
updatedb with additional imported pool, then another rsync: http://pastebin.com/DgxWtjR5 after export of the additional pool zio_buf_16384 now even went down to 553472K - otherwise it would only ever grow; dnote_t is at around 1156928K SUnreclaim: 3544568 kB Seems like everything works as intended now 👍 |
@kernelOfTruth lucky you, my SUnreclaim still just keeps on growing. But unlike you I'm stuck with RHEL6 kernel and can't move on to anything newer (OpenVZ). |
Thanks :) @snajpa that's unfortunate :/ Does the support contract allow compiling a different kernel out of the sources - as long as you're staying on that version ? (I'm eye-balling towards the Paravirt stability issues since you also mentioned lockup problems in #3160 and that RHEL6 kernels aren't compiled with it) I also just recently added that support since I only use virtualbox for virtualization purposes From what I read there seem to be at least 2 significant landmarks: 3.10 (RHEL7 seems to contain it), 3.12 were some locking & dentry-handling changes also were introduced (http://permalink.gmane.org/gmane.linux.kernel.commits.head/407250) Experimenting with all of the options I summarized above would be possible ? Anyway: Good luck - if it can be made to work here, I'm sure you'll also figure it out, I don't have that much knowledge or expertise in the kernel or code department compared to you guys, I'm sure (doing this as a mere hobby & from experience, Gentoo user) edit: I remember having read that Disabling THP - transparent hugepages - which seems to work fine with the recent tweaks to ZFS, though and regularly running
might raise stability in certain cases (if I remember correctly it was also mentioned related to OpenVZ)
is also set here as a preventative & stability enhancing measure
(default: noop) and BFQ is also set in
where that isn't supported, CFQ could make a change related to latency or perhaps even stability - experimenting between noop, deadline & cfq perhaps that also might be of help |
With recent upstream changes in master ( #3202 ) - this doesn't seem to appear anymore but it sure still needs a few days (or weeks+) of testing appears to be fixed - therefore closing. |
@kernelOfTruth Excellent news, thanks for the update. |
Posting the data here before the system goes "boom" (it's getting slower and slower) - hope it's useful
Symptoms:
opening chromium, firefox, konqueror, etc. takes several seconds to load
besides that the system is (still) working fine
I'm not really sure to agree that SUnreclaim should be that huge
should spl_kmem_cache_reclaim be something else ?
slub_nomerge is used during bootup
Below is following output of system - no suspicious output of dmesg
will attempt "echo 3 > /proc/sys/vm/drop_caches" and see how it goes ...
The text was updated successfully, but these errors were encountered: