-
Notifications
You must be signed in to change notification settings - Fork 178
High load, spinning in shrink_slab #420
Comments
As to
So long as the total number of objects freed is greater than 10, it keeps looping. With XFS, ZFS, etc. in your system, there are likely quite a few registered shrinkers and they're likely freeing something. As to your original problem of large amounts of time being spent in From the output above, its hard for me to get a high-level overview of the memory situation on your system when the problem occurs. What does In summary, I suggest trying to figure out why the system is getting into the high-load thrashing state rather than attempting to work around it with |
Ouch! Yes, I can see how that And yes, I'd much rather solve the underlying problem than attempting to paper over it with an ugly There is swap on the system, but it's not being used (according to I'll look at the stats as suggested when the problem reoccurs, and report back. |
OK, I stopped issuing An extract of the various stats as suggested (logged in a custom format to increase density), covering the period of the load spike -
And
During this period perf shows (sampled 07:49:28 - 07:50:40):
I.e. whilst we're spending 30.87% of our time in
If that interpretation is correct, the reason I'm seeing "spinning" in I.e. false alarm as far as ZOL is concerned? |
@chrisrd After looking this over, I came across another issue caused by the 3.12 split shrinker callbacks. Check out openzfs/zfs#2975. I've still not analyzed your new data thoroughly but there's a chance your problem might be helped by this patch. Does your system use any non-zfs filesystems? |
I noticed openzfs/zfs#2975 come in and wondered if it's related: it's definitely in the area of concern. Yes, independent of ZFS I have a small number (8) of XFS file systems and an ext4. |
Now running w/ openzfs/zfs#2975, will see if it helps with the load spikes. |
[dweeezil] The split count/scan shrinker callbacks introduced in 3.12 broke the test for HAVE_SHRINK, effectively disabling the per-superblock shrinkers. This patch re-enables the per-superblock shrinkers when the split shrinker callbacks have been detected. openzfs#2975 openzfs/spl#420
@dweeezil With 2 weeks of running openzfs/zfs#2975 it has definitely helped my load spikes. Thanks! |
@chrisrd Thanks for the feedback this is something we'll be looking to get merged soon. It's good to get confirmation it helps. |
openzfs/zfs#2975 has been merged. |
Right you are, closing. |
I'm trying to chase down large load spikes, from a baseline steady state load average of 4-6, spiking to up to 30-50 and occasionally >100. Perf tells me these spikes are associated with large amounts of time (e.g. 62%) spent in
shrink_slab
.My ZFS is the destination for rsync backups. The pool has about 1700 filesystems on it and around 13600 snapshots, with the filesystems receiving backups one at a time. Some filesystems have large numbers (millions) of small files. The box also runs
ceph-osd
, on top of XFS rather than ZFS.In trying to alleviate this issue I've been experimenting with
echo {1,2,3} > /proc/sys/vm/drop_caches
when the load starts climbing, having noted others recommending thatdrop_caches
had worked for them. This has succeeded a number of times (i.e. nothing untoward has happened) although I've yet establish if it's definitely helping (the smaller load spikes generally look after themselves anyway, although the larger ones can cause havoc).However yesterday, on reaching a load average of 26, I issued a
sync; time echo 3 > /proc/sys/vm/drop_caches
. The load reduced after 2 minutes (which may or may not be related to thedrop_caches
), however one hour later myecho
still hadn't returned, andkill -9
on the bash process didn't work. An rsync writing to ZFS was also hung and not responding tokill -9
.Whilst this was occurring perf was showing
shrink_slab
at 16.77% of the load, running out of the bash instance doing the drop_caches:I eventually hard reset the box after another 45 minutes of poking around, with the
echo
still hung.I.e. unfortunately it looks like commit c1aef26944 was premature with it's "...occasional reports of spinning in shrink_slabs(). Those issues have been resolved and can no longer can be reproduced."
...oops! And as I've been writing this, it's done it again: my automated process for doing the
drop_caches
on high load per above kicked in at a load of 11 nearly 2 hours ago, theecho
still hasn't returned, the bash is unkillable andps
shows it building up TIME at around 1s per s. Perf shows the same call chain, except this time it's causing about 25% of the load instead of 16%. Some memory stats:So it looks like I can reasonably readily reproduce this issue.
What else should I be looking at?
The text was updated successfully, but these errors were encountered: