Still locks up after a while: Debian jessie 3.16.0-4-amd64 #3160

yarikoptic · 2015-03-08T03:23:39Z

The hope was that #2983 gets fixed by #3132 and I was recently happy to report that system seemed survived relatively heavy bombarding, but again after 4 days the beast locked down. Here are the gory details: http://www.onerussian.com/tmp//zfs_system_details_20150307/ Let me know if I should provide more information of any kind

yarikoptic · 2015-03-08T06:11:07Z

if of any help -- here is a historical overview for today/week from munin
http://www.onerussian.com/tmp/munin-locdown-20150307.html
to see eg (if any relevant, could be just an effect of the deadlock) there was a spike of interrupts right around the deadlock point

yarikoptic · 2015-03-08T06:14:17Z

please ignore above comment about interrupts -- a bit confusing plot rendering, they just actually went down

yarikoptic · 2015-03-08T14:59:02Z

fresh dump from the system info (with stacktraces.txt now, thanks @DeHackEd for spotting their omission) is available from http://www.onerussian.com/tmp//zfs_system_details_20150308/ . If I don't hear back I will reboot the beast in a few hours -- need to get it usable again

yarikoptic · 2015-03-08T15:10:23Z

BTW -- it seems that this stall feels different from before -- unlike before I can still do some actions on those mount ZFS partitions/datasets (e.g. modify files, git commit) while other actions get stuck (docker now can't do stuff even though not directly associated with ZFS, ZFS snapshots/arc_adapt/l2arc_feed/txg_sync and that lengthy rm -r directory seems to be stuck in D)

A debug patch for openzfs#3160 designed to log the holder of the ARC hash_lock if it cannot be acquired for over 60 seconds. This patch may impact performance and is for debug purposes only.t Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

A debug patch for openzfs#3160 designed to log the holder of the ARC hash_lock if it cannot be acquired for over 60 seconds. This patch may impact performance and is for debug purposes only.t Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#3160

behlendorf · 2015-03-09T17:53:59Z

@yarikoptic it appears that #2983 was significantly improved by #3121 but we didn't quite get every possible deadlock. It's clear from the backtraces you provided the threads are all blocked on the ARC hash lock but what isn't clear is what task is holding that lock and why it's not giving it up. Normally, I'm highly suspicious of the direct reclaim paths like the one taken by the l2arc_feed thread, but in this case I don't see where it would have taken the lock. I put together a debug patch to try and catch the offending task, can you try running with the patch in #3161.

behlendorf · 2015-03-09T17:55:53Z

NOTE: This may be a blocker for 0.6.4. There is clearly a potential deadlock still remaining in the code although to my knowledge @yarikoptic's workload in the only workload which has triggered it.

dweeezil · 2015-03-09T19:01:01Z

Is there a description of the workload which triggers this problem? My current testing rigs for the recent illumos ARC work (purposely not referring to those pull requests to minimize non-relevant links) would allow me to test this type of thing pretty easily.

One thing that sticks out to me in the original posting is the other_size of 35627695184 out of an ARC size of 40485186048 (ouch). It would have been interesting to see /proc/slabinfo as well.

yarikoptic · 2015-03-09T20:36:26Z

workload was primarily a

tar -cf- hcp-ls.20141020-500subjects-addurl |pv | pigz -9 >| hcp-ls.20141020-500subjects-addurl.tar.gz

command where that directory contains millions of files, symlinks and subdirectories (probably around 40mil entries altogether). Each file within there is very small (contains just a path with it). resultant tar.gz is 16GB.
In parallel or before that load I could have had also had a find/chmod/rm sweeps on similarly "large" directories.
If necessary -- I could share that generated (there was a succesfull run after all) for your testing so you could simulate similar load. but for the deadlock to happen it took a few days usually.

Unfortunately I have rebooted the beast so I guess slabinfo wouldn't be of interest atm, but I have added it now to the http://github.com/yarikoptic/zfs-system-details/blob/HEAD/zfs-system-details script so it will be reported next time

behlendorf · 2015-03-09T21:09:34Z

@yarikoptic you might also want to pick up #3163. The automated testing uncovered another issue, which looks different from yours, but is still possible.

yarikoptic · 2015-03-10T00:21:01Z

gotcha, rebuilding with 3163 and will restart/resume bombardment shortly

dweeezil · 2015-03-10T00:46:46Z

Another workload involving the traversal (statting of) million(s) of files and other_size is way out of control. YABITP (yet another big inode traversal problem). Of course, the deadlock you enountered shouldn't have happened, but with c_max at 67GB, size at 86.8GB and other_size being 35.6GB of that, it's no wonder that Bad Things™ happen.

yarikoptic · 2015-03-10T01:09:13Z

ain't I a unique precious user exposing myself to those Bad Things, @dweeezil ? ;) (beast rebooted with both patches in place and tar/chmod running)

yarikoptic · 2015-03-11T14:12:51Z

ok -- found unable to ssh into the box today and attaching to ipmi console showed that it seems we hit the target:

[133273.994931] BUG: soft lockup - CPU#7 stuck for 23s! [l2arc_feed:1098]

but the problem is that it seems I can't even login (I should have ran open ipmi console logged in I guess :-/) -- I will keep trying

yarikoptic · 2015-03-11T14:46:47Z

ok -- upon attempt to login from physical console -- it managed!
http://www.onerussian.com/tmp//zfs_system_details_20150311/
has some logs for you guys, including the messages in kern.log from that debug patch. looking at munin reports, awkward enough L2ARC stats reporting 270GB L2ARC size seems to exceed amount of available L2ARC space (I have 200GB intel SSD drive): http://www.onerussian.com/tmp/munin-lockdown-20150311.html#zfs

dweeezil · 2015-03-11T15:02:50Z

@yarikoptic Could you please add /proc/slabinfo.

yarikoptic · 2015-03-11T15:45:51Z

@dweeezil http://www.onerussian.com/tmp//zfs_system_details_20150311-2/ now should contain all prev information (updated) + slabinfo and other goodies ;) I am about to try flushing the caches as it was suggested on IRC in attempt to release unevitable data

yarikoptic · 2015-03-11T15:59:27Z

and FWIW http://www.onerussian.com/tmp//zfs_system_details_20150311-3/ should have the same info after I issues echo 3 >/proc/sys/vm/drop_caches which I believe never returned yet.

yarikoptic · 2015-03-11T16:21:24Z

since the box is not reachable via network, and I wouldn't be able to run down for another 24h, I am inclined to reboot it within half an hour. If there is any other information I could provide before then -- please let me know. Upon reboot I will not bring the same load up (since I know that it would lock it up killing remote access) unless there is a new patch to check or need for more information.

prakashsurya · 2015-03-11T16:32:57Z

Points of interest from zfs_system_details_20150311-2:

zio_buf_512       24203409 24249872    512    8    1 : tunables   54   27    8 : slabdata 3031234 3031234      0

zio_buf_16384     2221818 2221818  16384    1    4 : tunables    8    4    0 : slabdata 2221818 2221818      0

zfs_znode_cache                       0x00020 14141956096 11751261120     8192     1136  1726313 1726313 1983777  10357878 10344420 11902662      1     0     0

c_max                           4    67685584896
size                            4    81237672816

mru_size                        4    28278091264
mru_evict_data                  4    512
mru_evict_metadata              4    49152

mfu_size                        4    8521976320
mfu_evict_data                  4    5632
mfu_evict_metadata              4    3311104

arc_meta_used                   4    81051423600
arc_meta_limit                  4    50764188672
arc_meta_max                    4    81997413160

l2_hdr_size                     4    6246311920

I'm inclined to believe the workload breaks the ARC, as it seems to consist mainly of 512b buffers (perhaps spill blocks). The interesting thing, is how much data is contained in the MRU and MFU but is unevictable, so the system will just spin trying to reclaim, without making any progress. These ARC buffers are unevictable likely because the dbuf cache is holding references. Some of this unevictable data is probably blocks for the dnode object (pinned by dnode_t structs), or indirect blocks (pinned by data blocks). This is all speculation, though, just quickly glancing at the linked files.

dweeezil · 2015-03-11T17:14:01Z

Spill blocks generally wind up in kmalloc-512 which is low in this case. The zio_buf_impl_t are likely bonus buffers which should generally track dnode_t. I've been running my testing lately with a patch which breaks down the components of other_size to help in matters like this. I've also considered putting the bonus buffers in their own cache instead of letting them share the zio cache (mainly to track them better).

snajpa · 2015-03-11T18:23:03Z

We're having problems with ZFS induced lock-ups really often too, there's at least one lock up a day, but it's gotten worse than it has been. Since about ~4 weeks ago on every lockup I can't even SSH into the box to get the traces (rootfs isn't on ZFS, it's ext4, ZFS is only for containers).

It happens mostly when the workload shifts - we're balancing with all of the RAM used for applications and the rest goes to ARC, during daytime, it all works OK, but the problems come in early morning hours, when the backups get to run. We're backing up by rsync+NFS combo, since we haven't implemented send/recv backups yet. Also people are doing their own backups during the night.

This needs to be resolved soon, as it now begins to impact our reputation, we've been so much down-time that people are becoming nervous and are thinking about moving away to another host. Although we're a nonprofit org, so people cut us some slack, it still isn't a good position, if nothing else, it causes a lot of stress to me as an admin.

If anyone has any suggestions on what I can do, when the box locks up and SSH connections time out, I'm all game.

kernelOfTruth · 2015-03-11T18:34:48Z

@snajpa I know it's not really a solution but would rate-limiting help as a temporary mitigation ?

(provided of course that the backups are not that large and e.g. everything can be transferred despite that limit)

rsync --bwlimit=<kb/second> <source> <dest>

I remember having read about a similar issue (rsync- & load-related) of Btrfs where this temporarily helped

snajpa · 2015-03-11T18:44:59Z

@kernelOfTruth I've thought about that, but then we wouldn't manage to back up all the CTs on time before the daily peak load, when the IO capacity is needed for more useful things, than doing backups. Also this doesn't do anything about custom backups, which people do on their down in their containers.

Originally when the ARC prune callback was introduced the idea was to register a single callback for the ZPL. The ARC could invoke this call back if it needed the ZPL to drop dentries, inodes, or other cache objects which might be pinning buffers in the ARC. The ZPL would iterate over all ZFS super block and perform the reclaim. For the most part this design have worked well but due to limitations in 2.6.35 and earlier kernels there were some problems. This patch is desgined to address those issues. 1) iterate_supers_type() is not provided by all kernels which makes it impossible to safely iterate over all zpl_fs_type filesystems in a single callback. The most straight forward and portable way to resolve this is to register a callback per-filesystem during mount. The arc_*_prune_callback() functions have always supported multiple callbacks so this is functionally a very small change. 2) Commit 050d22b removed the non-portable shrink_dcache_memory() and shrink_icache_memory() functions and didn't replace them with equivilant functionality. This meant that for Linux 3.1 and older kernels the ARC had no mechanism to drop dentries and inodes from the caches if needed. This patch adds the missing functionalilty by calling shrink_dcache_parent() and invalidate_inodes() for these kernels. This will result in all unused cache entries being dropped which is a bit heavy handed but it's the only interface available. 3) A zpl_drop_inode() callback is registered for kernels older than 2.6.35 which do not support .evict_inode. This ensures that when the last reference on an inode is dropped it is immediately removed from the cache. If this isn't done than inode can end up on the global unused LRU with no mechanism available to ZFS to drop them. Since the ARC buffers are not dropped the hottest inodes can still be recreated without performing disk IO. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Isssue openzfs#3160

The goal of this function is to evict enough meta data buffers from the ARC in order to enforce the arc_meta_limit. Acheiving this is slightly more complicated than it appears because it is common for data buffers to have holds of meta data buffers. In addition, dnode meta data buffers will be held by the dnodes in the block preventing them from being freed. This means we can't simply traverse the ARC and expect to always find enough unheld meta data buffer to release. Therefore, this function has been updated to make alternating passes over the ARC releasing data buffers and then newly unheld meta data buffers. This ensures forward progress is maintained and arc_meta_used will decrease. Normally this is sufficient, but if required the ARC will call the registered prune callbacks causing dentry and inodes to be dropped from the VFS cache. This will make dnode meta data buffers available for reclaim. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Isssue openzfs#3160

The arc_meta_max value should be increased when space it consumed not when it is returned. This ensure's that arc_meta_max is always up to date. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue openzfs#3160

Originally when the ARC prune callback was introduced the idea was to register a single callback for the ZPL. The ARC could invoke this call back if it needed the ZPL to drop dentries, inodes, or other cache objects which might be pinning buffers in the ARC. The ZPL would iterate over all ZFS super blocks and perform the reclaim. For the most part this design has worked well but due to limitations in 2.6.35 and earlier kernels there were some problems. This patch is designed to address those issues. 1) iterate_supers_type() is not provided by all kernels which makes it impossible to safely iterate over all zpl_fs_type filesystems in a single callback. The most straight forward and portable way to resolve this is to register a callback per-filesystem during mount. The arc_*_prune_callback() functions have always supported multiple callbacks so this is functionally a very small change. 2) Commit 050d22b removed the non-portable shrink_dcache_memory() and shrink_icache_memory() functions and didn't replace them with equivalent functionality. This meant that for Linux 3.1 and older kernels the ARC had no mechanism to drop dentries and inodes from the caches if needed. This patch adds that missing functionality by calling shrink_dcache_parent() to release dentries which may be pinning inodes. This will result in all unused cache entries being dropped which is a bit heavy handed but it's the only interface available for old kernels. 3) A zpl_drop_inode() callback is registered for kernels older than 2.6.35 which do not support the .evict_inode callback. This ensures that when the last reference on an inode is dropped it is immediately removed from the cache. If this isn't done than inode can end up on the global unused LRU with no mechanism available to ZFS to drop them. Since the ARC buffers are not dropped the hottest inodes can still be recreated without performing disk IO. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue openzfs#3160

The goal of this function is to evict enough meta data buffers from the ARC in order to enforce the arc_meta_limit. Achieving this is slightly more complicated than it appears because it is common for data buffers to have holds on meta data buffers. In addition, dnode meta data buffers will be held by the dnodes in the block preventing them from being freed. This means we can't simply traverse the ARC and expect to always find enough unheld meta data buffer to release. Therefore, this function has been updated to make alternating passes over the ARC releasing data buffers and then newly unheld meta data buffers. This ensures forward progress is maintained and arc_meta_used will decrease. Normally this is sufficient, but if required the ARC will call the registered prune callbacks causing dentry and inodes to be dropped from the VFS cache. This will make dnode meta data buffers available for reclaim. The number of total restarts in limited by zfs_arc_meta_adjust_restarts to prevent spinning in the rare case where all meta data is pinned. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue openzfs#3160

behlendorf · 2015-03-20T17:17:58Z

Refreshed patches in #3202 are available which do address @snajpa's issue and should help others.

The arc_meta_max value should be increased when space it consumed not when it is returned. This ensure's that arc_meta_max is always up to date. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160

Originally when the ARC prune callback was introduced the idea was to register a single callback for the ZPL. The ARC could invoke this call back if it needed the ZPL to drop dentries, inodes, or other cache objects which might be pinning buffers in the ARC. The ZPL would iterate over all ZFS super blocks and perform the reclaim. For the most part this design has worked well but due to limitations in 2.6.35 and earlier kernels there were some problems. This patch is designed to address those issues. 1) iterate_supers_type() is not provided by all kernels which makes it impossible to safely iterate over all zpl_fs_type filesystems in a single callback. The most straight forward and portable way to resolve this is to register a callback per-filesystem during mount. The arc_*_prune_callback() functions have always supported multiple callbacks so this is functionally a very small change. 2) Commit 050d22b removed the non-portable shrink_dcache_memory() and shrink_icache_memory() functions and didn't replace them with equivalent functionality. This meant that for Linux 3.1 and older kernels the ARC had no mechanism to drop dentries and inodes from the caches if needed. This patch adds that missing functionality by calling shrink_dcache_parent() to release dentries which may be pinning inodes. This will result in all unused cache entries being dropped which is a bit heavy handed but it's the only interface available for old kernels. 3) A zpl_drop_inode() callback is registered for kernels older than 2.6.35 which do not support the .evict_inode callback. This ensures that when the last reference on an inode is dropped it is immediately removed from the cache. If this isn't done than inode can end up on the global unused LRU with no mechanism available to ZFS to drop them. Since the ARC buffers are not dropped the hottest inodes can still be recreated without performing disk IO. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160

The goal of this function is to evict enough meta data buffers from the ARC in order to enforce the arc_meta_limit. Achieving this is slightly more complicated than it appears because it is common for data buffers to have holds on meta data buffers. In addition, dnode meta data buffers will be held by the dnodes in the block preventing them from being freed. This means we can't simply traverse the ARC and expect to always find enough unheld meta data buffer to release. Therefore, this function has been updated to make alternating passes over the ARC releasing data buffers and then newly unheld meta data buffers. This ensures forward progress is maintained and arc_meta_used will decrease. Normally this is sufficient, but if required the ARC will call the registered prune callbacks causing dentry and inodes to be dropped from the VFS cache. This will make dnode meta data buffers available for reclaim. The number of total restarts in limited by zfs_arc_meta_adjust_restarts to prevent spinning in the rare case where all meta data is pinned. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160

behlendorf · 2015-03-20T23:40:18Z

These changes have been merged to master to address this issue. If you're experiencing similar to this please update to zfs-0.6.3-245 or newer.

bc88866 Fix arc_adjust_meta() behavior
2cbb06b Restructure per-filesystem reclaim
596a893 Fix arc_meta_max accounting