High load, spinning in shrink_slab #420

chrisrd · 2014-12-09T01:54:10Z

I'm trying to chase down large load spikes, from a baseline steady state load average of 4-6, spiking to up to 30-50 and occasionally >100. Perf tells me these spikes are associated with large amounts of time (e.g. 62%) spent in shrink_slab.

linux-3.14.23, zfs-11662bf, spl-7f118e8 (i.e. both current HEAD)
80GB RAM, 2 x Xeon E5620 (16 cores total)
l2arc_noprefetch = 0
zfs_arc_max=17179869184 zfs_arc_meta_limit=17179869184
# zfs get primarycache,secondarycache,dedup pool
NAME  PROPERTY        VALUE           SOURCE
pool  primarycache    metadata        local
pool  secondarycache  metadata        local
pool  dedup           off             default

My ZFS is the destination for rsync backups. The pool has about 1700 filesystems on it and around 13600 snapshots, with the filesystems receiving backups one at a time. Some filesystems have large numbers (millions) of small files. The box also runs ceph-osd, on top of XFS rather than ZFS.

In trying to alleviate this issue I've been experimenting with echo {1,2,3} > /proc/sys/vm/drop_caches when the load starts climbing, having noted others recommending that drop_caches had worked for them. This has succeeded a number of times (i.e. nothing untoward has happened) although I've yet establish if it's definitely helping (the smaller load spikes generally look after themselves anyway, although the larger ones can cause havoc).

However yesterday, on reaching a load average of 26, I issued a sync; time echo 3 > /proc/sys/vm/drop_caches. The load reduced after 2 minutes (which may or may not be related to the drop_caches), however one hour later my echo still hadn't returned, and kill -9 on the bash process didn't work. An rsync writing to ZFS was also hung and not responding to kill -9.

Whilst this was occurring perf was showing shrink_slab at 16.77% of the load, running out of the bash instance doing the drop_caches:

bash @ 16.79% 
  + write @ 16.77%
  + system_call_fastpath @ 16.77%
  + sys_write @ 16.77%
  + vfs_write @ 16.77%
  + proc_sys_write @ 16.77%
  + proc_sys_call_handler @ 16.77%
  + drop_caches_sysctl_handler @ 16.77%
  + shrink_slab @ 16.77%
  + shift_slab_node @ 16.71%
  + spl_kmem_cache_generic_shrinker_scan_objects @ 16.07%
  + __spl_kemem_cache_generic_shrinker.isra.5 @ 16.07%
  + spl_kmem_cache_reap_now @ 16.04%
  + kmem_cache_shrink @ 15.93%

I eventually hard reset the box after another 45 minutes of poking around, with the echo still hung.

I.e. unfortunately it looks like commit c1aef26944 was premature with it's "...occasional reports of spinning in shrink_slabs(). Those issues have been resolved and can no longer can be reproduced."

...oops! And as I've been writing this, it's done it again: my automated process for doing the drop_caches on high load per above kicked in at a load of 11 nearly 2 hours ago, the echo still hasn't returned, the bash is unkillable and ps shows it building up TIME at around 1s per s. Perf shows the same call chain, except this time it's causing about 25% of the load instead of 16%. Some memory stats:

# egrep '^(size|meta_size|other_size)' /proc/spl/kstat/zfs/arcstats
size                            4    2158658392       
meta_size                       4    1608075264
other_size                      4    438915760

# awk '$3 != 0 || $4 != 0 { print; }' /proc/spl/kmem/slab
--------------------- cache -------------------------------------------------------  ----- slab ------  ---- object -----  --- emergency ---
name                                    flags      size     alloc slabsize  objsize  total alloc   max  total alloc   max  dlock alloc   max
ddt_cache                             0x00040    524288    499680   524288    24984      1     1     2     20    20    40      0     0     0
zio_vdev_cache                        0x00040 390070272 357870240  4194304   131088     93    93   172   2790  2730  5138      0     0     0
zio_buf_20480                         0x00040    524288     20480   524288    20480      1     1    24     21     1   504      0     0     0
zio_buf_32768                         0x00040   5242880   2129920  1048576    32768      5     5    17    140    65   476      0     0     0
zio_buf_36864                         0x00040   1048576     36864  1048576    36864      1     1    20     25     1   500      0     0     0
zio_buf_61440                         0x00040   8388608   5836800  2097152    61440      4     4    22    124    95   682      0     0     0
zio_data_buf_61440                    0x00040   2097152   1843200  2097152    61440      1     1     2     31    30    62      0     0     0
zio_buf_65536                         0x00040   4194304    131072  2097152    65536      2     2    22     60     2   660      0     0     0
zio_data_buf_65536                    0x00040   2097152   1966080  2097152    65536      1     1     4     30    30   120      0     0     0
zio_buf_110592                        0x00040   6291456   3428352  2097152   110592      3     3    38     54    31   684      0     0     0
zio_buf_131072                        0x00040  20971520  14286848  4194304   131072      5     5    12    155   109   365      0     0     0
zio_data_buf_131072                   0x00040  20971520  16121856  4194304   131072      5     5   818    155   123 25343      0     0     0
zfs_znode_cache                       0x00020  65798144  61201344    65536     2208   1004  1004 20356  29116 27718 590324      0     0     0

# slabtop -sc -o | head -30
 Active / Total Objects (% used)    : 2770308 / 3655176 (75.8%)
 Active / Total Slabs (% used)      : 141195 / 141195 (100.0%)
 Active / Total Caches (% used)     : 88 / 125 (70.4%)
 Active / Total Size (% used)       : 2639563.84K / 3121544.20K (84.6%)
 Minimum / Average / Maximum Object : 0.01K / 0.85K / 16.00K

   OBJS  ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 109546   96574  88%   16.00K  54773        2   1752736K zio_buf_16384
 159056  151228  95%    1.92K   9941       16    318112K dnode_t
 239520  152662  63%    0.79K   5988       40    191616K dmu_buf_impl_t
  41504   40551  97%    4.00K   5188        8    166016K kmalloc-4096
 214304  106333  49%    0.50K   6697       32    107152K kmalloc-512
 133320  107948  80%    0.80K   3333       40    106656K arc_buf_hdr_t   
 995072  670360  67%    0.06K  15548       64     62192K kmalloc-64
  28000   27056  96%    2.00K   1750       16     56000K kmalloc-2048
  54528   36777  67%    0.98K   1704       32     54528K inode_cache
 187479  109602  58%    0.21K   5067       37     40536K blkdev_ioc
 116350   72317  62%    0.31K   4654       25     37232K dentry
  33344   31788  95%    1.00K   1042       32     33344K kmalloc-1024
  51100   17106  33%    0.55K   1825       28     29200K radix_tree_node
 152736  152736 100%    0.16K   6364       24     25456K kernfs_node_cache
   4284    3960  92%    5.05K    714        6     22848K task_struct
  11610    4330  37%    1.69K    645       18     20640K xfs_inode

# cat /proc/buddyinfo
Node 0, zone      DMA      0      0      0      0      2      1      1      0      1      1      3
Node 0, zone    DMA32    521   1740   1610   1513   1189   1598   1274    936    548    252     93
Node 0, zone   Normal   7060   7400  12075  14092  12187  13900  10267   7369   4499   2632   1614
Node 1, zone   Normal   8397  13228  13411  19810  13492  10612   8253   6035   4079   2488   1961

So it looks like I can reasonably readily reproduce this issue.

What else should I be looking at?

The text was updated successfully, but these errors were encountered:

dweeezil · 2014-12-09T13:50:06Z

As to drop_caches, on a sufficiently busy system, it is certainly possible for it to hang or take a very long time. Its worker function is drop_slab() from fs/drop_caches.c and it does this:

        do {
                nr_objects = shrink_slab(&shrink, 1000, 1000);
        } while (nr_objects > 10);

So long as the total number of objects freed is greater than 10, it keeps looping. With XFS, ZFS, etc. in your system, there are likely quite a few registered shrinkers and they're likely freeing something.

As to your original problem of large amounts of time being spent in shrink_slab and high load, I'd suggest trying to figure out why the shrinkers are being called so much rather than using drop_caches in an attempt to work around it. Given the implementation of drop_slab() an automated drop_caches seems pretty dangerous (although it's been frequently mentioned on this list as a good tonic for various ills).

From the output above, its hard for me to get a high-level overview of the memory situation on your system when the problem occurs. What does free -[h|m] show when the system is in this situation. I also like to keep an eye on memory_[in]direct_count from grep memory /proc/spl/kstat/zfs/arcstat which will show how much the ARC shrinker is being called an in what context (direct versus indirect reclaim). Do you have any swap on the system? If so, is any being used? It might be interesting to check grep buffer_head /proc/slabinfo if free shows a lot of buffers in use. Also, double check vmalloc use with grep Vmalloc /proc/meminfo which should mostly be used by ZFS.

In summary, I suggest trying to figure out why the system is getting into the high-load thrashing state rather than attempting to work around it with drop_caches.

chrisrd · 2014-12-09T22:59:15Z

Ouch! Yes, I can see how that drop_slab can easily be problematic on a busy system. It certainly looks like it could be better implemented, but that's an issue for LKML.

And yes, I'd much rather solve the underlying problem than attempting to paper over it with an ugly drop_caches hack. (My "automated" system was a one-shot, monitoring the load in a 60-second loop and dropping out to do the drop_caches once only, and sending me an alert to look at the system - more of an attempt to characterise the problem than to fix it.)

There is swap on the system, but it's not being used (according to vmstat) when facing the high load issue.

I'll look at the stats as suggested when the problem reoccurs, and report back.

chrisrd · 2014-12-18T02:41:36Z

OK, I stopped issuing drop_caches, and the system has been through a number of load spikes up to 30-40 or so and recovered. I had another load spike to over 100 (which the system also recovered from)...

An extract of the various stats as suggested (logged in a custom format to increase density), covering the period of the load spike - loadavg, some arcstats including memory_[in]direct_count, and free (includes VMalloc fields from /proc/meminfo):

VmallocTotal: 33554431, MemTotal: 64456, SwapTotal: 20463

                    |    loadavg       | arcstats (MiB)     Mem (Count) | free (MiB)
                        5m   10m   15m  Size  Meta Other MemDir  MemInd  VMUsed  VMChunk  MemUsed  MemFree  Buffers   Cached   BCUsed   BCFree   SwUsed   SwFree
2014-12-18 07:40:29   3.19  5.44  6.87  5313  1998  2992  26635 1988160    1759 33519375    63540      916      135    44720    18684    45771       74    20389
2014-12-18 07:41:29   2.48  4.80  6.55  5315  1999  2992  26635 1988160    1790 33519375    63586      870      135    44734    18716    45740       74    20389
2014-12-18 07:42:29   2.10  4.26  6.26  5317  2001  2992  26635 1988160    1811 33519375    63613      843      135    44740    18737    45718       74    20389
2014-12-18 07:43:29   2.07  3.82  5.98  5306  1997  2981  26635 1988235    1819 33519375    63580      876      135    44727    18717    45739       74    20389
2014-12-18 07:44:29   6.14  4.79  6.18  4745  1932  2487  26635 1989276    1859 33519375    63375     1081      135    44863    18376    46080       74    20389
2014-12-18 07:45:29  21.75  9.06  7.54  5373  1907  2378  27117 1990055    1886 33518939    63807      649      135    45661    18010    46446       74    20389
2014-12-18 07:46:30  52.30 20.99 11.79  5373  1907  2378  27168 1990055    1866 33518939    63814      642      135    45686    17992    46464       74    20389
2014-12-18 07:47:30  66.26 30.65 15.68  5387  1910  2388  27286 1990214    1866 33518939    63814      642      135    45705    17973    46482       74    20389
2014-12-18 07:48:30 101.87 46.48 22.00  5387  1911  2388  27334 1990214    1870 33518939    63716      740      135    45607    17973    46483       74    20389
2014-12-18 07:49:30  96.88 55.33 26.59  5404  1912  2403  27416 1990418    1878 33518939    63538      917      135    45326    18076    46380       64    20399
2014-12-18 07:50:31  75.74 55.49 28.36  5410  1913  2408  27486 1990753    1866 33518939    63765      691      135    45516    18113    46343       61    20402
2014-12-18 07:51:31  38.08 49.06 27.89  5423  1911  2405  27493 1990761    1930 33518939    63068     1388      137    44906    18024    46432       59    20403
2014-12-18 07:52:31  16.72 40.98 26.44  5426  1918  2401  27493 1991044    1883 33518939    62861     1595      137    44782    17941    46514       59    20403
2014-12-18 07:53:31  13.64 35.58 25.49  5050  1952  2405  27493 1991044    1895 33518939    62922     1534      137    44791    17994    46462       59    20403
2014-12-18 07:54:31  10.20 30.82 24.50  4731  1980  2408  27493 1991044    1907 33518939    62964     1492      137    44789    18037    46419       59    20403
2014-12-18 07:55:31   4.62 25.46 23.05  4725  1976  2407  27493 1991044    1907 33518939    62963     1493      137    44789    18037    46419       59    20403
2014-12-18 07:56:31   2.60 21.09 21.70  4718  1970  2406  27493 1991044    1907 33518939    62966     1489      137    44792    18037    46419       59    20403
2014-12-18 07:57:31   2.03 17.56 20.45  4713  1966  2405  27493 1991044    1907 33518939    62980     1476      137    44805    18038    46418       59    20403
2014-12-18 07:58:31   3.31 15.12 19.43  4718  1971  2405  27493 1991044    1907 33518939    62980     1476      137    44803    18040    46416       59    20403
2014-12-18 07:59:31   1.82 12.54 18.27  4718  1971  2405  27493 1991044    1907 33518939    62991     1465      137    44813    18041    46415       59    20403

And slabinfo's buffer_head during the same period:

                    # name            <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
2014-12-18 07:40:13 buffer_head       4806248 6812559    104   39    1 : tunables    0    0    0 : slabdata 174681 174681      0
2014-12-18 07:41:13 buffer_head       4807128 6812559    104   39    1 : tunables    0    0    0 : slabdata 174681 174681      0
2014-12-18 07:42:13 buffer_head       4808704 6812559    104   39    1 : tunables    0    0    0 : slabdata 174681 174681      0
2014-12-18 07:43:13 buffer_head       4806611 6812559    104   39    1 : tunables    0    0    0 : slabdata 174681 174681      0
2014-12-18 07:44:13 buffer_head       4803184 6812559    104   39    1 : tunables    0    0    0 : slabdata 174681 174681      0
2014-12-18 07:45:16 buffer_head       4801261 6812559    104   39    1 : tunables    0    0    0 : slabdata 174681 174681      0
2014-12-18 07:46:41 buffer_head       4800286 6812559    104   39    1 : tunables    0    0    0 : slabdata 174681 174681      0
2014-12-18 07:48:48 buffer_head       4817948 6812442    104   39    1 : tunables    0    0    0 : slabdata 174678 174678      0
2014-12-18 07:50:39 buffer_head       4907317 6810726    104   39    1 : tunables    0    0    0 : slabdata 174634 174634      0
2014-12-18 07:51:39 buffer_head       4982631 6807489    104   39    1 : tunables    0    0    0 : slabdata 174551 174551      0
2014-12-18 07:52:39 buffer_head       5047687 6805149    104   39    1 : tunables    0    0    0 : slabdata 174491 174491      0
2014-12-18 07:53:39 buffer_head       5051432 6805110    104   39    1 : tunables    0    0    0 : slabdata 174490 174490      0
2014-12-18 07:54:39 buffer_head       5049710 6805110    104   39    1 : tunables    0    0    0 : slabdata 174490 174490      0
2014-12-18 07:55:39 buffer_head       5048452 6805110    104   39    1 : tunables    0    0    0 : slabdata 174490 174490      0
2014-12-18 07:56:39 buffer_head       5049556 6805110    104   39    1 : tunables    0    0    0 : slabdata 174490 174490      0
2014-12-18 07:57:39 buffer_head       5051583 6805110    104   39    1 : tunables    0    0    0 : slabdata 174490 174490      0
2014-12-18 07:58:39 buffer_head       5048563 6805110    104   39    1 : tunables    0    0    0 : slabdata 174490 174490      0
2014-12-18 07:59:39 buffer_head       5051784 6805032    104   39    1 : tunables    0    0    0 : slabdata 174488 174488      0

During this period perf shows (sampled 07:49:28 - 07:50:40):

  --30.90%-- try_to_free_pages
            |
            |--30.90%-- do_try_to_free_pages
                       |
                       |--30.87%-- shrink_slab
                                  |
                                  |--30.80%-- shrink_slab_node
                                             |
                                             |--30.65%-- super_cache_count
                                                        |
                                                        |--17.55%-- drop_super
                                                        |
                                                        |--11.96%-- grab_super_passive

I.e. whilst we're spending 30.87% of our time in shrink_slab, the vast bulk of that time is in super_cache_count. Probably due of the number of mounted filesystems?

# wc -l /etc/mtab
12593 /etc/mtab

If that interpretation is correct, the reason I'm seeing "spinning" in shrink_slab is because of the number of mounted filesystems and nothing to do with ZFS per se.

I.e. false alarm as far as ZOL is concerned?

dweeezil · 2014-12-18T16:24:47Z

@chrisrd After looking this over, I came across another issue caused by the 3.12 split shrinker callbacks. Check out openzfs/zfs#2975. I've still not analyzed your new data thoroughly but there's a chance your problem might be helped by this patch.

Does your system use any non-zfs filesystems?

chrisrd · 2014-12-18T22:36:48Z

I noticed openzfs/zfs#2975 come in and wondered if it's related: it's definitely in the area of concern.

Yes, independent of ZFS I have a small number (8) of XFS file systems and an ext4.

chrisrd · 2014-12-19T04:08:33Z

Now running w/ openzfs/zfs#2975, will see if it helps with the load spikes.

[dweeezil] The split count/scan shrinker callbacks introduced in 3.12 broke the test for HAVE_SHRINK, effectively disabling the per-superblock shrinkers. This patch re-enables the per-superblock shrinkers when the split shrinker callbacks have been detected. openzfs#2975 openzfs/spl#420

chrisrd · 2015-01-04T23:13:50Z

@dweeezil With 2 weeks of running openzfs/zfs#2975 it has definitely helped my load spikes. Thanks!

behlendorf · 2015-01-06T23:07:50Z

@chrisrd Thanks for the feedback this is something we'll be looking to get merged soon. It's good to get confirmation it helps.

ryao · 2015-03-19T19:16:54Z

openzfs/zfs#2975 has been merged.

behlendorf · 2015-03-20T20:46:12Z

Right you are, closing.

chrisrd mentioned this issue Dec 22, 2014

echo 2 or 3 to /proc/sys/vm/drop_caches hangs with 3.17rc5 kernel #388

Closed

behlendorf added this to the 0.6.4 milestone Jan 6, 2015

behlendorf added the Bug label Jan 6, 2015

behlendorf closed this as completed Mar 20, 2015

dweeezil mentioned this issue Apr 16, 2015

arc_adapt left spinning after rsync with lots of small files openzfs/zfs#3303

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High load, spinning in shrink_slab #420

High load, spinning in shrink_slab #420

chrisrd commented Dec 9, 2014

dweeezil commented Dec 9, 2014

chrisrd commented Dec 9, 2014

chrisrd commented Dec 18, 2014

dweeezil commented Dec 18, 2014

chrisrd commented Dec 18, 2014

chrisrd commented Dec 19, 2014

chrisrd commented Jan 4, 2015

behlendorf commented Jan 6, 2015

ryao commented Mar 19, 2015

behlendorf commented Mar 20, 2015

High load, spinning in shrink_slab #420

High load, spinning in shrink_slab #420

Comments

chrisrd commented Dec 9, 2014

dweeezil commented Dec 9, 2014

chrisrd commented Dec 9, 2014

chrisrd commented Dec 18, 2014

dweeezil commented Dec 18, 2014

chrisrd commented Dec 18, 2014

chrisrd commented Dec 19, 2014

chrisrd commented Jan 4, 2015

behlendorf commented Jan 6, 2015

ryao commented Mar 19, 2015

behlendorf commented Mar 20, 2015