Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Still locks up after a while: Debian jessie 3.16.0-4-amd64 #3160

Closed
yarikoptic opened this issue Mar 8, 2015 · 70 comments
Closed

Still locks up after a while: Debian jessie 3.16.0-4-amd64 #3160

yarikoptic opened this issue Mar 8, 2015 · 70 comments
Milestone

Comments

@yarikoptic
Copy link

The hope was that #2983 gets fixed by #3132 and I was recently happy to report that system seemed survived relatively heavy bombarding, but again after 4 days the beast locked down. Here are the gory details: http://www.onerussian.com/tmp//zfs_system_details_20150307/ Let me know if I should provide more information of any kind

@yarikoptic yarikoptic changed the title Still locks up after a while on Still locks up after a while: Debian jessie 3.16.0-4-amd64 Mar 8, 2015
@yarikoptic
Copy link
Author

if of any help -- here is a historical overview for today/week from munin
http://www.onerussian.com/tmp/munin-locdown-20150307.html
to see eg (if any relevant, could be just an effect of the deadlock) there was a spike of interrupts right around the deadlock point

@yarikoptic
Copy link
Author

please ignore above comment about interrupts -- a bit confusing plot rendering, they just actually went down

@yarikoptic
Copy link
Author

fresh dump from the system info (with stacktraces.txt now, thanks @DeHackEd for spotting their omission) is available from http://www.onerussian.com/tmp//zfs_system_details_20150308/ . If I don't hear back I will reboot the beast in a few hours -- need to get it usable again

@yarikoptic
Copy link
Author

BTW -- it seems that this stall feels different from before -- unlike before I can still do some actions on those mount ZFS partitions/datasets (e.g. modify files, git commit) while other actions get stuck (docker now can't do stuff even though not directly associated with ZFS, ZFS snapshots/arc_adapt/l2arc_feed/txg_sync and that lengthy rm -r directory seems to be stuck in D)

behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 9, 2015
A debug patch for openzfs#3160 designed to log the holder of the ARC
hash_lock if it cannot be acquired for over 60 seconds.  This
patch may impact performance and is for debug purposes only.t

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 9, 2015
A debug patch for openzfs#3160 designed to log the holder of the ARC
hash_lock if it cannot be acquired for over 60 seconds.  This
patch may impact performance and is for debug purposes only.t

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#3160
behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 9, 2015
A debug patch for openzfs#3160 designed to log the holder of the ARC
hash_lock if it cannot be acquired for over 60 seconds.  This
patch may impact performance and is for debug purposes only.t

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#3160
@behlendorf
Copy link
Contributor

@yarikoptic it appears that #2983 was significantly improved by #3121 but we didn't quite get every possible deadlock. It's clear from the backtraces you provided the threads are all blocked on the ARC hash lock but what isn't clear is what task is holding that lock and why it's not giving it up. Normally, I'm highly suspicious of the direct reclaim paths like the one taken by the l2arc_feed thread, but in this case I don't see where it would have taken the lock. I put together a debug patch to try and catch the offending task, can you try running with the patch in #3161.

@behlendorf behlendorf added this to the 0.6.4 milestone Mar 9, 2015
@behlendorf
Copy link
Contributor

NOTE: This may be a blocker for 0.6.4. There is clearly a potential deadlock still remaining in the code although to my knowledge @yarikoptic's workload in the only workload which has triggered it.

@dweeezil
Copy link
Contributor

dweeezil commented Mar 9, 2015

Is there a description of the workload which triggers this problem? My current testing rigs for the recent illumos ARC work (purposely not referring to those pull requests to minimize non-relevant links) would allow me to test this type of thing pretty easily.

One thing that sticks out to me in the original posting is the other_size of 35627695184 out of an ARC size of 40485186048 (ouch). It would have been interesting to see /proc/slabinfo as well.

@yarikoptic
Copy link
Author

workload was primarily a

tar -cf- hcp-ls.20141020-500subjects-addurl |pv | pigz -9 >| hcp-ls.20141020-500subjects-addurl.tar.gz

command where that directory contains millions of files, symlinks and subdirectories (probably around 40mil entries altogether). Each file within there is very small (contains just a path with it). resultant tar.gz is 16GB.
In parallel or before that load I could have had also had a find/chmod/rm sweeps on similarly "large" directories.
If necessary -- I could share that generated (there was a succesfull run after all) for your testing so you could simulate similar load. but for the deadlock to happen it took a few days usually.

Unfortunately I have rebooted the beast so I guess slabinfo wouldn't be of interest atm, but I have added it now to the http://github.com/yarikoptic/zfs-system-details/blob/HEAD/zfs-system-details script so it will be reported next time

@behlendorf
Copy link
Contributor

@yarikoptic you might also want to pick up #3163. The automated testing uncovered another issue, which looks different from yours, but is still possible.

@yarikoptic
Copy link
Author

gotcha, rebuilding with 3163 and will restart/resume bombardment shortly

@dweeezil
Copy link
Contributor

Another workload involving the traversal (statting of) million(s) of files and other_size is way out of control. YABITP (yet another big inode traversal problem). Of course, the deadlock you enountered shouldn't have happened, but with c_max at 67GB, size at 86.8GB and other_size being 35.6GB of that, it's no wonder that Bad Things™ happen.

@yarikoptic
Copy link
Author

ain't I a unique precious user exposing myself to those Bad Things, @dweeezil ? ;) (beast rebooted with both patches in place and tar/chmod running)

@yarikoptic
Copy link
Author

ok -- found unable to ssh into the box today and attaching to ipmi console showed that it seems we hit the target:

[133273.994931] BUG: soft lockup - CPU#7 stuck for 23s! [l2arc_feed:1098]

but the problem is that it seems I can't even login (I should have ran open ipmi console logged in I guess :-/) -- I will keep trying

@yarikoptic
Copy link
Author

ok -- upon attempt to login from physical console -- it managed!
http://www.onerussian.com/tmp//zfs_system_details_20150311/
has some logs for you guys, including the messages in kern.log from that debug patch. looking at munin reports, awkward enough L2ARC stats reporting 270GB L2ARC size seems to exceed amount of available L2ARC space (I have 200GB intel SSD drive): http://www.onerussian.com/tmp/munin-lockdown-20150311.html#zfs

@dweeezil
Copy link
Contributor

@yarikoptic Could you please add /proc/slabinfo.

@yarikoptic
Copy link
Author

@dweeezil http://www.onerussian.com/tmp//zfs_system_details_20150311-2/ now should contain all prev information (updated) + slabinfo and other goodies ;) I am about to try flushing the caches as it was suggested on IRC in attempt to release unevitable data

@yarikoptic
Copy link
Author

and FWIW http://www.onerussian.com/tmp//zfs_system_details_20150311-3/ should have the same info after I issues echo 3 >/proc/sys/vm/drop_caches which I believe never returned yet.

@yarikoptic
Copy link
Author

since the box is not reachable via network, and I wouldn't be able to run down for another 24h, I am inclined to reboot it within half an hour. If there is any other information I could provide before then -- please let me know. Upon reboot I will not bring the same load up (since I know that it would lock it up killing remote access) unless there is a new patch to check or need for more information.

@prakashsurya
Copy link
Member

Points of interest from zfs_system_details_20150311-2:

zio_buf_512       24203409 24249872    512    8    1 : tunables   54   27    8 : slabdata 3031234 3031234      0

zio_buf_16384     2221818 2221818  16384    1    4 : tunables    8    4    0 : slabdata 2221818 2221818      0

zfs_znode_cache                       0x00020 14141956096 11751261120     8192     1136  1726313 1726313 1983777  10357878 10344420 11902662      1     0     0

c_max                           4    67685584896
size                            4    81237672816

mru_size                        4    28278091264
mru_evict_data                  4    512
mru_evict_metadata              4    49152

mfu_size                        4    8521976320
mfu_evict_data                  4    5632
mfu_evict_metadata              4    3311104

arc_meta_used                   4    81051423600
arc_meta_limit                  4    50764188672
arc_meta_max                    4    81997413160

l2_hdr_size                     4    6246311920

I'm inclined to believe the workload breaks the ARC, as it seems to consist mainly of 512b buffers (perhaps spill blocks). The interesting thing, is how much data is contained in the MRU and MFU but is unevictable, so the system will just spin trying to reclaim, without making any progress. These ARC buffers are unevictable likely because the dbuf cache is holding references. Some of this unevictable data is probably blocks for the dnode object (pinned by dnode_t structs), or indirect blocks (pinned by data blocks). This is all speculation, though, just quickly glancing at the linked files.

@dweeezil
Copy link
Contributor

Spill blocks generally wind up in kmalloc-512 which is low in this case. The zio_buf_impl_t are likely bonus buffers which should generally track dnode_t. I've been running my testing lately with a patch which breaks down the components of other_size to help in matters like this. I've also considered putting the bonus buffers in their own cache instead of letting them share the zio cache (mainly to track them better).

@snajpa
Copy link
Contributor

snajpa commented Mar 11, 2015

We're having problems with ZFS induced lock-ups really often too, there's at least one lock up a day, but it's gotten worse than it has been. Since about ~4 weeks ago on every lockup I can't even SSH into the box to get the traces (rootfs isn't on ZFS, it's ext4, ZFS is only for containers).

It happens mostly when the workload shifts - we're balancing with all of the RAM used for applications and the rest goes to ARC, during daytime, it all works OK, but the problems come in early morning hours, when the backups get to run. We're backing up by rsync+NFS combo, since we haven't implemented send/recv backups yet. Also people are doing their own backups during the night.

This needs to be resolved soon, as it now begins to impact our reputation, we've been so much down-time that people are becoming nervous and are thinking about moving away to another host. Although we're a nonprofit org, so people cut us some slack, it still isn't a good position, if nothing else, it causes a lot of stress to me as an admin.

If anyone has any suggestions on what I can do, when the box locks up and SSH connections time out, I'm all game.

@kernelOfTruth
Copy link
Contributor

@snajpa I know it's not really a solution but would rate-limiting help as a temporary mitigation ?

(provided of course that the backups are not that large and e.g. everything can be transferred despite that limit)

rsync --bwlimit=<kb/second> <source> <dest>

I remember having read about a similar issue (rsync- & load-related) of Btrfs where this temporarily helped

@snajpa
Copy link
Contributor

snajpa commented Mar 11, 2015

@kernelOfTruth I've thought about that, but then we wouldn't manage to back up all the CTs on time before the daily peak load, when the IO capacity is needed for more useful things, than doing backups. Also this doesn't do anything about custom backups, which people do on their down in their containers.

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 19, 2015
Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL.  The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC.  The ZPL
would iterate over all ZFS super block and perform the reclaim.

For the most part this design have worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems.  This patch
is desgined to address those issues.

1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback.  The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.

2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivilant functionality.  This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed.  This patch adds the missing functionalilty
by calling shrink_dcache_parent() and invalidate_inodes() for these
kernels.  This will result in all unused cache entries being dropped
which is a bit heavy handed but it's the only interface available.

3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support .evict_inode.  This ensures that when
the last reference on an inode is dropped it is immediately removed
from the cache.  If this isn't done than inode can end up on the
global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Isssue openzfs#3160
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 19, 2015
The goal of this function is to evict enough meta data buffers from the
ARC in order to enforce the arc_meta_limit.  Acheiving this is slightly
more complicated than it appears because it is common for data buffers
to have holds of meta data buffers.  In addition, dnode meta data buffers
will be held by the dnodes in the block preventing them from being freed.
This means we can't simply traverse the ARC and expect to always find
enough unheld meta data buffer to release.

Therefore, this function has been updated to make alternating passes
over the ARC releasing data buffers and then newly unheld meta data
buffers.  This ensures forward progress is maintained and arc_meta_used
will decrease.  Normally this is sufficient, but if required the ARC
will call the registered prune callbacks causing dentry and inodes to
be dropped from the VFS cache.  This will make dnode meta data buffers
available for reclaim.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Isssue openzfs#3160
behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 20, 2015
The arc_meta_max value should be increased when space it consumed not when
it is returned.  This ensure's that arc_meta_max is always up to date.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 20, 2015
Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL.  The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC.  The ZPL
would iterate over all ZFS super blocks and perform the reclaim.

For the most part this design has worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems.  This patch
is designed to address those issues.

1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback.  The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.

2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivalent functionality.  This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed.  This patch adds that missing functionality
by calling shrink_dcache_parent() to release dentries which may be
pinning inodes.  This will result in all unused cache entries being
dropped which is a bit heavy handed but it's the only interface
available for old kernels.

3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support the .evict_inode callback.  This ensures
that when the last reference on an inode is dropped it is immediately
removed from the cache.  If this isn't done than inode can end up on
the global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
behlendorf added a commit to behlendorf/zfs that referenced this issue Mar 20, 2015
The goal of this function is to evict enough meta data buffers from the
ARC in order to enforce the arc_meta_limit.  Achieving this is slightly
more complicated than it appears because it is common for data buffers
to have holds on meta data buffers.  In addition, dnode meta data buffers
will be held by the dnodes in the block preventing them from being freed.
This means we can't simply traverse the ARC and expect to always find
enough unheld meta data buffer to release.

Therefore, this function has been updated to make alternating passes
over the ARC releasing data buffers and then newly unheld meta data
buffers.  This ensures forward progress is maintained and arc_meta_used
will decrease.  Normally this is sufficient, but if required the ARC
will call the registered prune callbacks causing dentry and inodes to
be dropped from the VFS cache.  This will make dnode meta data buffers
available for reclaim.  The number of total restarts in limited by
zfs_arc_meta_adjust_restarts to prevent spinning in the rare case
where all meta data is pinned.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
@behlendorf
Copy link
Contributor

Refreshed patches in #3202 are available which do address @snajpa's issue and should help others.

behlendorf added a commit that referenced this issue Mar 20, 2015
The arc_meta_max value should be increased when space it consumed not when
it is returned.  This ensure's that arc_meta_max is always up to date.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
behlendorf added a commit that referenced this issue Mar 20, 2015
Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL.  The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC.  The ZPL
would iterate over all ZFS super blocks and perform the reclaim.

For the most part this design has worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems.  This patch
is designed to address those issues.

1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback.  The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.

2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivalent functionality.  This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed.  This patch adds that missing functionality
by calling shrink_dcache_parent() to release dentries which may be
pinning inodes.  This will result in all unused cache entries being
dropped which is a bit heavy handed but it's the only interface
available for old kernels.

3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support the .evict_inode callback.  This ensures
that when the last reference on an inode is dropped it is immediately
removed from the cache.  If this isn't done than inode can end up on
the global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
behlendorf added a commit that referenced this issue Mar 20, 2015
The goal of this function is to evict enough meta data buffers from the
ARC in order to enforce the arc_meta_limit.  Achieving this is slightly
more complicated than it appears because it is common for data buffers
to have holds on meta data buffers.  In addition, dnode meta data buffers
will be held by the dnodes in the block preventing them from being freed.
This means we can't simply traverse the ARC and expect to always find
enough unheld meta data buffer to release.

Therefore, this function has been updated to make alternating passes
over the ARC releasing data buffers and then newly unheld meta data
buffers.  This ensures forward progress is maintained and arc_meta_used
will decrease.  Normally this is sufficient, but if required the ARC
will call the registered prune callbacks causing dentry and inodes to
be dropped from the VFS cache.  This will make dnode meta data buffers
available for reclaim.  The number of total restarts in limited by
zfs_arc_meta_adjust_restarts to prevent spinning in the rare case
where all meta data is pinned.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue #3160
@behlendorf
Copy link
Contributor

These changes have been merged to master to address this issue. If you're experiencing similar to this please update to zfs-0.6.3-245 or newer.

bc88866 Fix arc_adjust_meta() behavior
2cbb06b Restructure per-filesystem reclaim
596a893 Fix arc_meta_max accounting

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 21, 2015
The arc_meta_max value should be increased when space it consumed not when
it is returned.  This ensure's that arc_meta_max is always up to date.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 21, 2015
Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL.  The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC.  The ZPL
would iterate over all ZFS super blocks and perform the reclaim.

For the most part this design has worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems.  This patch
is designed to address those issues.

1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback.  The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.

2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivalent functionality.  This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed.  This patch adds that missing functionality
by calling shrink_dcache_parent() to release dentries which may be
pinning inodes.  This will result in all unused cache entries being
dropped which is a bit heavy handed but it's the only interface
available for old kernels.

3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support the .evict_inode callback.  This ensures
that when the last reference on an inode is dropped it is immediately
removed from the cache.  If this isn't done than inode can end up on
the global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Mar 22, 2015
The arc_meta_max value should be increased when space it consumed not when
it is returned.  This ensure's that arc_meta_max is always up to date.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Mar 22, 2015
Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL.  The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC.  The ZPL
would iterate over all ZFS super blocks and perform the reclaim.

For the most part this design has worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems.  This patch
is designed to address those issues.

1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback.  The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.

2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivalent functionality.  This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed.  This patch adds that missing functionality
by calling shrink_dcache_parent() to release dentries which may be
pinning inodes.  This will result in all unused cache entries being
dropped which is a bit heavy handed but it's the only interface
available for old kernels.

3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support the .evict_inode callback.  This ensures
that when the last reference on an inode is dropped it is immediately
removed from the cache.  If this isn't done than inode can end up on
the global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Mar 22, 2015
The goal of this function is to evict enough meta data buffers from the
ARC in order to enforce the arc_meta_limit.  Achieving this is slightly
more complicated than it appears because it is common for data buffers
to have holds on meta data buffers.  In addition, dnode meta data buffers
will be held by the dnodes in the block preventing them from being freed.
This means we can't simply traverse the ARC and expect to always find
enough unheld meta data buffer to release.

Therefore, this function has been updated to make alternating passes
over the ARC releasing data buffers and then newly unheld meta data
buffers.  This ensures forward progress is maintained and arc_meta_used
will decrease.  Normally this is sufficient, but if required the ARC
will call the registered prune callbacks causing dentry and inodes to
be dropped from the VFS cache.  This will make dnode meta data buffers
available for reclaim.  The number of total restarts in limited by
zfs_arc_meta_adjust_restarts to prevent spinning in the rare case
where all meta data is pinned.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 22, 2015
The arc_meta_max value should be increased when space it consumed not when
it is returned.  This ensure's that arc_meta_max is always up to date.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 22, 2015
Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL.  The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC.  The ZPL
would iterate over all ZFS super blocks and perform the reclaim.

For the most part this design has worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems.  This patch
is designed to address those issues.

1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback.  The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.

2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivalent functionality.  This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed.  This patch adds that missing functionality
by calling shrink_dcache_parent() to release dentries which may be
pinning inodes.  This will result in all unused cache entries being
dropped which is a bit heavy handed but it's the only interface
available for old kernels.

3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support the .evict_inode callback.  This ensures
that when the last reference on an inode is dropped it is immediately
removed from the cache.  If this isn't done than inode can end up on
the global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Mar 22, 2015
The goal of this function is to evict enough meta data buffers from the
ARC in order to enforce the arc_meta_limit.  Achieving this is slightly
more complicated than it appears because it is common for data buffers
to have holds on meta data buffers.  In addition, dnode meta data buffers
will be held by the dnodes in the block preventing them from being freed.
This means we can't simply traverse the ARC and expect to always find
enough unheld meta data buffer to release.

Therefore, this function has been updated to make alternating passes
over the ARC releasing data buffers and then newly unheld meta data
buffers.  This ensures forward progress is maintained and arc_meta_used
will decrease.  Normally this is sufficient, but if required the ARC
will call the registered prune callbacks causing dentry and inodes to
be dropped from the VFS cache.  This will make dnode meta data buffers
available for reclaim.  The number of total restarts in limited by
zfs_arc_meta_adjust_restarts to prevent spinning in the rare case
where all meta data is pinned.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
@yarikoptic
Copy link
Author

FWIW -- used master as of bc88866. Seems to be stable for the past few days. There are no crazy ARC consumption jumps as before.

kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this issue Mar 24, 2015
in accordance with

40749aa "Use MUTEX_FSTRANS on l2arc_buflist_mtx"

Original commit message below:

Use MUTEX_FSTRANS on l2arc_buflist_mtx to prevent the following deadlock
scenario:
1. arc_release() -> hash_lock -> l2arc_buflist_mtx
2. l2arc_write_buffers() -> l2arc_buflist_mtx -> (direct reclaim) ->
   arc_buf_remove_ref() -> hash_lock

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Signed-off-by: Tim Chase <tim@chase2k.com>
Issue openzfs#3160
kernelOfTruth added a commit to kernelOfTruth/zfs that referenced this issue Mar 24, 2015
in accordance with

40749aa "Use MUTEX_FSTRANS on l2arc_buflist_mtx"

Original commit message below:

Use MUTEX_FSTRANS on l2arc_buflist_mtx to prevent the following deadlock
scenario:
1. arc_release() -> hash_lock -> l2arc_buflist_mtx
2. l2arc_write_buffers() -> l2arc_buflist_mtx -> (direct reclaim) ->
   arc_buf_remove_ref() -> hash_lock

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Signed-off-by: Tim Chase <tim@chase2k.com>
Issue openzfs#3160
@behlendorf
Copy link
Contributor

@yarikoptic Good news, thanks for the update.

DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Apr 4, 2015
The arc_meta_max value should be increased when space it consumed not when
it is returned.  This ensure's that arc_meta_max is always up to date.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Apr 4, 2015
Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL.  The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC.  The ZPL
would iterate over all ZFS super blocks and perform the reclaim.

For the most part this design has worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems.  This patch
is designed to address those issues.

1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback.  The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.

2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivalent functionality.  This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed.  This patch adds that missing functionality
by calling shrink_dcache_parent() to release dentries which may be
pinning inodes.  This will result in all unused cache entries being
dropped which is a bit heavy handed but it's the only interface
available for old kernels.

3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support the .evict_inode callback.  This ensures
that when the last reference on an inode is dropped it is immediately
removed from the cache.  If this isn't done than inode can end up on
the global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Apr 4, 2015
The goal of this function is to evict enough meta data buffers from the
ARC in order to enforce the arc_meta_limit.  Achieving this is slightly
more complicated than it appears because it is common for data buffers
to have holds on meta data buffers.  In addition, dnode meta data buffers
will be held by the dnodes in the block preventing them from being freed.
This means we can't simply traverse the ARC and expect to always find
enough unheld meta data buffer to release.

Therefore, this function has been updated to make alternating passes
over the ARC releasing data buffers and then newly unheld meta data
buffers.  This ensures forward progress is maintained and arc_meta_used
will decrease.  Normally this is sufficient, but if required the ARC
will call the registered prune callbacks causing dentry and inodes to
be dropped from the VFS cache.  This will make dnode meta data buffers
available for reclaim.  The number of total restarts in limited by
zfs_arc_meta_adjust_restarts to prevent spinning in the rare case
where all meta data is pinned.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Apr 5, 2015
The arc_meta_max value should be increased when space it consumed not when
it is returned.  This ensure's that arc_meta_max is always up to date.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Apr 5, 2015
Originally when the ARC prune callback was introduced the idea was
to register a single callback for the ZPL.  The ARC could invoke this
call back if it needed the ZPL to drop dentries, inodes, or other
cache objects which might be pinning buffers in the ARC.  The ZPL
would iterate over all ZFS super blocks and perform the reclaim.

For the most part this design has worked well but due to limitations
in 2.6.35 and earlier kernels there were some problems.  This patch
is designed to address those issues.

1) iterate_supers_type() is not provided by all kernels which makes
it impossible to safely iterate over all zpl_fs_type filesystems in
a single callback.  The most straight forward and portable way to
resolve this is to register a callback per-filesystem during mount.
The arc_*_prune_callback() functions have always supported multiple
callbacks so this is functionally a very small change.

2) Commit 050d22b removed the non-portable shrink_dcache_memory()
and shrink_icache_memory() functions and didn't replace them with
equivalent functionality.  This meant that for Linux 3.1 and older
kernels the ARC had no mechanism to drop dentries and inodes from
the caches if needed.  This patch adds that missing functionality
by calling shrink_dcache_parent() to release dentries which may be
pinning inodes.  This will result in all unused cache entries being
dropped which is a bit heavy handed but it's the only interface
available for old kernels.

3) A zpl_drop_inode() callback is registered for kernels older than
2.6.35 which do not support the .evict_inode callback.  This ensures
that when the last reference on an inode is dropped it is immediately
removed from the cache.  If this isn't done than inode can end up on
the global unused LRU with no mechanism available to ZFS to drop them.
Since the ARC buffers are not dropped the hottest inodes can still
be recreated without performing disk IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Apr 5, 2015
The goal of this function is to evict enough meta data buffers from the
ARC in order to enforce the arc_meta_limit.  Achieving this is slightly
more complicated than it appears because it is common for data buffers
to have holds on meta data buffers.  In addition, dnode meta data buffers
will be held by the dnodes in the block preventing them from being freed.
This means we can't simply traverse the ARC and expect to always find
enough unheld meta data buffer to release.

Therefore, this function has been updated to make alternating passes
over the ARC releasing data buffers and then newly unheld meta data
buffers.  This ensures forward progress is maintained and arc_meta_used
will decrease.  Normally this is sufficient, but if required the ARC
will call the registered prune callbacks causing dentry and inodes to
be dropped from the VFS cache.  This will make dnode meta data buffers
available for reclaim.  The number of total restarts in limited by
zfs_arc_meta_adjust_restarts to prevent spinning in the rare case
where all meta data is pinned.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Issue openzfs#3160
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants