Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS 0.6.5.3 servers hang trying to get mutexes #4106

Closed
jdavidlists opened this issue Dec 16, 2015 · 88 comments
Closed

ZFS 0.6.5.3 servers hang trying to get mutexes #4106

jdavidlists opened this issue Dec 16, 2015 · 88 comments
Milestone

Comments

@jdavidlists
Copy link

After upgrading several of our NFS fileservers to zfs 0.6.5.3-1~trusty on Ubuntu 14.04.3 LTS with kernel 3.13.0-71-generic, a combination which did fine in all testing, we are suddenly seeing hangs in the ZFS code that cascade down to NFS and render the whole server unresponsive.

Once this happens, any access to the pool -- read or write -- will hang. iostat shows no disk activity of any kind on the zpool devices.

This has happened on multiple servers. All hardware is happy and healthy and tests OK, and prior to this update, everything was glassy smooth for months.

This problem most frequently happens while there's a "zfs send" in progress, but it can happen at other times as well, and not every send provokes it. Overall, the busier a server is, the more likely it is to happen.

Here are some stack traces for stuck processes in dmesg:

A z_wr_int:

[132001.296225] INFO: task z_wr_int_4:1230 blocked for more than 120 seconds.
[132001.296847]       Tainted: P           OX 3.13.0-71-generic #114-Ubuntu
[132001.297415] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[132001.298094] z_wr_int_4      D ffff88043fd13180     0  1230      2 0x00000000
[132001.298098]  ffff880035dbbc00 0000000000000046 ffff880035db1800
ffff880035dbbfd8
[132001.298101]  0000000000013180 0000000000013180 ffff880035db1800
ffffffffa025b3b0
[132001.298103]  ffffffffa025b3b4 ffff880035db1800 00000000ffffffff
ffffffffa025b3b8
[132001.298106] Call Trace:
[132001.298120]  [<ffffffff81729189>] schedule_preempt_disabled+0x29/0x70
[132001.298123]  [<ffffffff8172aff5>] __mutex_lock_slowpath+0x135/0x1b0
[132001.298126]  [<ffffffff8172b08f>] mutex_lock+0x1f/0x2f
[132001.298201]  [<ffffffffa00ed4ef>] buf_hash_insert+0x7f/0x190 [zfs]
[132001.298215]  [<ffffffffa00f0b41>] arc_write_done+0x91/0x3e0 [zfs]
[132001.298274]  [<ffffffffa01a20d9>] zio_done+0x2c9/0xe60 [zfs]
[132001.298279]  [<ffffffff811a291d>] ? kfree+0xfd/0x140
[132001.298295]  [<ffffffffa004075a>] ? spl_kmem_free+0x2a/0x40 [spl]
[132001.298326]  [<ffffffffa0161de0>] ? vdev_mirror_map_free+0x20/0x30 [zfs]
[132001.298359]  [<ffffffffa01a269a>] zio_done+0x88a/0xe60 [zfs]
[132001.298396]  [<ffffffffa019dfe8>] zio_execute+0xc8/0x180 [zfs]
[132001.298402]  [<ffffffffa0043fad>] taskq_thread+0x20d/0x410 [spl]
[132001.298408]  [<ffffffff8109ace0>] ? wake_up_state+0x20/0x20
[132001.298413]  [<ffffffffa0043da0>] ? taskq_cancel_id+0x120/0x120 [spl]
[132001.298416]  [<ffffffff8108b822>] kthread+0xd2/0xf0
[132001.298419]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
[132001.298422]  [<ffffffff817353a8>] ret_from_fork+0x58/0x90
[132001.298424]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

A txg_sync:

[132001.298533] INFO: task txg_sync:1403 blocked for more than 120 seconds.
[132001.299120]       Tainted: P           OX 3.13.0-71-generic #114-Ubuntu
[132001.299695] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[132001.300427] txg_sync        D ffff88043fc13180     0  1403      2 0x00000000
[132001.300432]  ffff880423f3fbc8 0000000000000046 ffff880426486000
ffff880423f3ffd8
[132001.300435]  0000000000013180 0000000000013180 ffff880426486000
ffff88043fc13a18
[132001.300447]  ffff8804283cc328 ffff8804283cc368 ffff8804283cc350
0000000000000001
[132001.300461] Call Trace:
[132001.300474]  [<ffffffff81728f8d>] io_schedule+0x9d/0x140
[132001.300492]  [<ffffffffa0047b0f>] cv_wait_common+0x9f/0x120 [spl]
[132001.300499]  [<ffffffff810ab3e0>] ? prepare_to_wait_event+0x100/0x100
[132001.300506]  [<ffffffffa0047be8>] __cv_wait_io+0x18/0x20 [spl]
[132001.300551]  [<ffffffffa01a0213>] zio_wait+0x123/0x210 [zfs]
[132001.300589]  [<ffffffffa012a7c1>] dsl_pool_sync+0xb1/0x470 [zfs]
[132001.300624]  [<ffffffffa01452b5>] spa_sync+0x365/0xb20 [zfs]
[132001.300633]  [<ffffffff810ab3f2>] ? autoremove_wake_function+0x12/0x40
[132001.300640]  [<ffffffff8104f2bf>] ? kvm_clock_get_cycles+0x1f/0x30
[132001.300677]  [<ffffffffa0157639>] txg_sync_thread+0x3b9/0x620 [zfs]
[132001.300715]  [<ffffffffa0157280>] ? txg_quiesce_thread+0x3f0/0x3f0 [zfs]
[132001.300733]  [<ffffffffa0042e81>] thread_generic_wrapper+0x71/0x80 [spl]
[132001.300740]  [<ffffffffa0042e10>] ? __thread_exit+0x20/0x20 [spl]
[132001.300744]  [<ffffffff8108b822>] kthread+0xd2/0xf0
[132001.300748]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
[132001.300752]  [<ffffffff817353a8>] ret_from_fork+0x58/0x90
[132001.300756]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

An nfsd process (these form the bulk of any hang as that's what these servers do):

[132001.300765] INFO: task nfsd:2308 blocked for more than 120 seconds.
[132001.301464]       Tainted: P           OX 3.13.0-71-generic #114-Ubuntu
[132001.302073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[132001.302766] nfsd            D ffff88043fd13180     0  2308      2 0x00000000
[132001.302769]  ffff8804256499a0 0000000000000046 ffff8804282de000
ffff880425649fd8
[132001.302772]  0000000000013180 0000000000013180 ffff8804282de000
ffff88042088e808
[132001.302774]  ffff88042088e80c ffff8804282de000 00000000ffffffff
ffff88042088e810
[132001.302776] Call Trace:
[132001.302781]  [<ffffffff81729189>] schedule_preempt_disabled+0x29/0x70
[132001.302783]  [<ffffffff8172aff5>] __mutex_lock_slowpath+0x135/0x1b0
[132001.302786]  [<ffffffff8172b08f>] mutex_lock+0x1f/0x2f
[132001.302820]  [<ffffffffa019649b>] zfs_zget+0x17b/0x320 [zfs]
[132001.302852]  [<ffffffffa017556a>] zfs_dirent_lock+0x49a/0x580 [zfs]
[132001.302885]  [<ffffffffa01756d1>] zfs_dirlook+0x81/0x310 [zfs]
[132001.302921]  [<ffffffffa018a6e5>] zfs_lookup+0x2c5/0x310 [zfs]
[132001.302961]  [<ffffffffa01a6cce>] zpl_lookup+0x8e/0x110 [zfs]
[132001.302966]  [<ffffffff811c817d>] lookup_real+0x1d/0x50
[132001.302968]  [<ffffffff811c8ae3>] __lookup_hash+0x33/0x40
[132001.302971]  [<ffffffff811c98b6>] lookup_one_len+0xe6/0x140
[132001.302988]  [<ffffffffa04627de>] nfsd_lookup_dentry+0x11e/0x490 [nfsd]
[132001.302992]  [<ffffffff810cd390>] ? getboottime+0x30/0x40
[132001.302997]  [<ffffffffa0462bb9>] nfsd_lookup+0x69/0x130 [nfsd]
[132001.303004]  [<ffffffffa046c1b1>] nfsd3_proc_lookup+0xe1/0x1a0 [nfsd]
[132001.303008]  [<ffffffffa045dd3b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[132001.303023]  [<ffffffffa03a163d>] svc_process_common+0x46d/0x6d0 [sunrpc]
[132001.303032]  [<ffffffffa03a19a7>] svc_process+0x107/0x170 [sunrpc]
[132001.303037]  [<ffffffffa045d71f>] nfsd+0xbf/0x130 [nfsd]
[132001.303041]  [<ffffffffa045d660>] ? nfsd_destroy+0x80/0x80 [nfsd]
[132001.303044]  [<ffffffff8108b822>] kthread+0xd2/0xf0
[132001.303047]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
[132001.303049]  [<ffffffff817353a8>] ret_from_fork+0x58/0x90
[132001.303052]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

And, from a different incident, kswapd:

[55921.404023] INFO: task kswapd0:36 blocked for more than 120 seconds.
[55921.404972]       Tainted: P           OX 3.13.0-71-generic #114-Ubuntu
[55921.405786] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
[55921.406778] kswapd0         D ffff88053fd13180     0    36      2 0x00000000
[55921.406783]  ffff880522157ab0 0000000000000046 ffff880522159800 ffff880522157
fd8
[55921.406785]  0000000000013180 0000000000013180 ffff880522159800 ffff88050fff4
a10
[55921.406788]  ffff88050fff4a14 ffff880522159800 00000000ffffffff ffff88050fff4
a18
[55921.406790] Call Trace:
[55921.406799]  [<ffffffff81729189>] schedule_preempt_disabled+0x29/0x70
[55921.406802]  [<ffffffff8172aff5>] __mutex_lock_slowpath+0x135/0x1b0
[55921.406804]  [<ffffffff8172b08f>] mutex_lock+0x1f/0x2f
[55921.406848]  [<ffffffffa019ce32>] zfs_zinactive+0x52/0x180 [zfs]
[55921.406881]  [<ffffffffa0196857>] zfs_inactive+0x67/0x240 [zfs]
[55921.406885]  [<ffffffff8115f0b9>] ? truncate_pagecache+0x59/0x60
[55921.406918]  [<ffffffffa01ad6a3>] zpl_evict_inode+0x43/0x60 [zfs]
[55921.406922]  [<ffffffff811d9880>] evict+0xb0/0x1b0
[55921.406925]  [<ffffffff811d99b9>] dispose_list+0x39/0x50
[55921.406927]  [<ffffffff811da8e7>] prune_icache_sb+0x47/0x60
[55921.406930]  [<ffffffff811c1c15>] super_cache_scan+0x105/0x170
[55921.406933]  [<ffffffff811617b7>] shrink_slab+0x1c7/0x370
[55921.406935]  [<ffffffff81165328>] balance_pgdat+0x3e8/0x610
[55921.406938]  [<ffffffff811656ab>] kswapd+0x15b/0x3f0
[55921.406942]  [<ffffffff810ab3e0>] ? prepare_to_wait_event+0x100/0x100
[55921.406944]  [<ffffffff81165550>] ? balance_pgdat+0x610/0x610
[55921.406948]  [<ffffffff8108b822>] kthread+0xd2/0xf0
[55921.406950]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
[55921.406954]  [<ffffffff817353a8>] ret_from_fork+0x58/0x90
[55921.406956]  [<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

What these always seem to have in common is that they're almost always spinning on a mutex (the txg_sync trace above being a rare exception). So far, we haven't been able to identify what process is holding/losing the mutex.

What can we do to help resolve this?

@behlendorf
Copy link
Contributor

@jdavidlists your definitely on the right track, we need to identify why the mutex isn't being released. If you could post back traces from all the processes we may be able to spot the deadlock by looking at the stacks. If the system is still responsive to sysrqs,, echo t >/proc/sysrq-trigger and grab the console log.

@jdavidlists
Copy link
Author

OK, that worked, but it seems like it didn't grab the stacks of all processes. There are 626 processes listed under the CPUs' "runnable tasks" list but only 142 stack traces:

 118 nfsd
  14 kworker
   3 bash
   2 sshd
   1 sudo
   1 rpc.mountd
   1 login
   1 kauditd
   1 getty

That's not a good ratio, but we'll see what we get.

That's still a lot of data, so rather than spam a huge amount of unformatted info, it might be helpful to thin it down. It seems safe to ignore the 113 nfsd's sitting in schedule_timeout(). Next, it seems safe to omit any stack trace that doesn't touch either the zfs or spl modules. That gets us down to just 4 nfsd processes of potential interest. Here they are:

nfsd            R  running task        0  2320      2 0x00000000
ffff8804f71010d8 0000000000000046 ffff8804f8e13000 ffff8804f7101fd8
0000000000013180 0000000000013180 ffff8804f8e13000 0000000000000080
000000000000013d ffff8804f7101218 ffffffffa0494ce0 0000000000000000
Call Trace:
[<ffffffff81729059>] _cond_resched+0x29/0x40
[<ffffffff811617a4>] shrink_slab+0x1b4/0x370
[<ffffffff8116491d>] do_try_to_free_pages+0x3ed/0x540
[<ffffffff81164b5c>] try_to_free_pages+0xec/0x180
[<ffffffff811597d5>] __alloc_pages_nodemask+0x7d5/0xb80
[<ffffffff81197fd3>] alloc_pages_current+0xa3/0x160
[<ffffffff811a0a55>] new_slab+0x295/0x320
[<ffffffff81720c1b>] __slab_alloc+0x2a8/0x459
[<ffffffffa0047912>] ? spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffff8172b082>] ? mutex_lock+0x12/0x2f
[<ffffffff811a35cb>] kmem_cache_alloc+0x18b/0x1e0
[<ffffffffa0047912>] ? spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffffa0047912>] spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffff811a3475>] ? kmem_cache_alloc+0x35/0x1e0
[<ffffffffa0047912>] ? spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffffa00479a4>] ? spl_kmem_cache_alloc+0x134/0x730 [spl]
[<ffffffff81729059>] ? _cond_resched+0x29/0x40
[<ffffffffa01a22a7>] zio_buf_alloc+0x57/0x60 [zfs]
[<ffffffffa00f5989>] arc_get_data_buf.isra.21+0x279/0x3f0 [zfs]
[<ffffffffa00f9802>] arc_read+0x312/0xa80 [zfs]
[<ffffffffa010246e>] dbuf_prefetch+0x1be/0x2c0 [zfs]
[<ffffffff811a3475>] ? kmem_cache_alloc+0x35/0x1e0
[<ffffffffa0119644>] dmu_zfetch_dofetch.isra.6+0x114/0x180 [zfs]
[<ffffffffa0119dcf>] dmu_zfetch+0x4cf/0xe80 [zfs]
[<ffffffffa01017af>] dbuf_read+0x3ef/0x950 [zfs]
[<ffffffffa011b8fa>] dnode_hold_impl+0xca/0x600 [zfs]
[<ffffffffa01711ca>] ? mze_compare+0xa/0x40 [zfs]
[<ffffffffa011be49>] dnode_hold+0x19/0x20 [zfs]
[<ffffffffa01098d7>] dmu_bonus_hold+0x27/0x290 [zfs]
[<ffffffffa0142f5e>] sa_buf_hold+0xe/0x10 [zfs]
[<ffffffffa019c4c1>] zfs_zget+0x1a1/0x320 [zfs]
[<ffffffffa017b56a>] zfs_dirent_lock+0x49a/0x580 [zfs]
[<ffffffffa017b6d1>] zfs_dirlook+0x81/0x310 [zfs]
[<ffffffffa01906e5>] zfs_lookup+0x2c5/0x310 [zfs]
[<ffffffffa01accce>] zpl_lookup+0x8e/0x110 [zfs]
[<ffffffff811c817d>] lookup_real+0x1d/0x50
[<ffffffff811c8ae3>] __lookup_hash+0x33/0x40
[<ffffffff811c98b6>] lookup_one_len+0xe6/0x140
[<ffffffffa04637de>] nfsd_lookup_dentry+0x11e/0x490 [nfsd]
[<ffffffffa0463bb9>] nfsd_lookup+0x69/0x130 [nfsd]
[<ffffffffa046d1b1>] nfsd3_proc_lookup+0xe1/0x1a0 [nfsd]
[<ffffffffa045ed3b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[<ffffffffa03b863d>] svc_process_common+0x46d/0x6d0 [sunrpc]
[<ffffffffa03b89a7>] svc_process+0x107/0x170 [sunrpc]
[<ffffffffa045e71f>] nfsd+0xbf/0x130 [nfsd]
[<ffffffffa045e660>] ? nfsd_destroy+0x80/0x80 [nfsd]
[<ffffffff8108b822>] kthread+0xd2/0xf0
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
[<ffffffff817353a8>] ret_from_fork+0x58/0x90
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

nfsd            R  running task        0  2372      2 0x00000000
ffff8804f4713038 0000000000000046 ffff8804f45f9800 ffff8804f4713fd8
0000000000013180 0000000000013180 ffff8804f45f9800 ffff8804f47133f8
ffffea0013e439e0 ffff8804f47131d8 ffffea0013e439c0 0000000000000000
Call Trace:
[<ffffffff81729059>] _cond_resched+0x29/0x40
[<ffffffff8116222d>] shrink_page_list+0x11d/0xa90
[<ffffffff81163222>] shrink_inactive_list+0x1c2/0x530
[<ffffffff81163d15>] shrink_lruvec+0x355/0x6f0
[<ffffffff81164116>] shrink_zone+0x66/0x1a0
[<ffffffff81164610>] do_try_to_free_pages+0xe0/0x540
[<ffffffff81164b5c>] try_to_free_pages+0xec/0x180
[<ffffffff811597d5>] __alloc_pages_nodemask+0x7d5/0xb80
[<ffffffff81197fd3>] alloc_pages_current+0xa3/0x160
[<ffffffff811a0a55>] new_slab+0x295/0x320
[<ffffffff81720c1b>] __slab_alloc+0x2a8/0x459
[<ffffffffa0047912>] ? spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffffa0047912>] ? spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffff811a35cb>] kmem_cache_alloc+0x18b/0x1e0
[<ffffffffa0047912>] ? spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffffa0047912>] spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffff811a3475>] ? kmem_cache_alloc+0x35/0x1e0
[<ffffffffa0047912>] ? spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffffa00479a4>] ? spl_kmem_cache_alloc+0x134/0x730 [spl]
[<ffffffffa01a22a7>] zio_buf_alloc+0x57/0x60 [zfs]
[<ffffffffa00f5989>] arc_get_data_buf.isra.21+0x279/0x3f0 [zfs]
[<ffffffffa00f9802>] arc_read+0x312/0xa80 [zfs]
[<ffffffffa01a5441>] ? zio_null+0x61/0x70 [zfs]
[<ffffffffa00ffb90>] ? dbuf_rele_and_unlock+0x3e0/0x3e0 [zfs]
[<ffffffffa0101668>] dbuf_read+0x2a8/0x950 [zfs]
[<ffffffffa011b8fa>] dnode_hold_impl+0xca/0x600 [zfs]
[<ffffffffa010bcf7>] ? dmu_object_info_from_db+0x37/0x40 [zfs]
[<ffffffff8172b2b2>] ? down_read+0x12/0x30
[<ffffffffa011be49>] dnode_hold+0x19/0x20 [zfs]
[<ffffffffa01098d7>] dmu_bonus_hold+0x27/0x290 [zfs]
[<ffffffffa0142f5e>] sa_buf_hold+0xe/0x10 [zfs]
[<ffffffffa019c4c1>] zfs_zget+0x1a1/0x320 [zfs]
[<ffffffffa017b56a>] zfs_dirent_lock+0x49a/0x580 [zfs]
[<ffffffffa017b6d1>] zfs_dirlook+0x81/0x310 [zfs]
[<ffffffffa01906e5>] zfs_lookup+0x2c5/0x310 [zfs]
[<ffffffffa01accce>] zpl_lookup+0x8e/0x110 [zfs]
[<ffffffff811c817d>] lookup_real+0x1d/0x50
[<ffffffff811c8ae3>] __lookup_hash+0x33/0x40
[<ffffffff811c98b6>] lookup_one_len+0xe6/0x140
[<ffffffffa04637de>] nfsd_lookup_dentry+0x11e/0x490 [nfsd]
[<ffffffffa0463bb9>] nfsd_lookup+0x69/0x130 [nfsd]
[<ffffffffa046d1b1>] nfsd3_proc_lookup+0xe1/0x1a0 [nfsd]
[<ffffffffa045ed3b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[<ffffffffa03b863d>] svc_process_common+0x46d/0x6d0 [sunrpc]
[<ffffffffa03b89a7>] svc_process+0x107/0x170 [sunrpc]
[<ffffffffa045e71f>] nfsd+0xbf/0x130 [nfsd]
[<ffffffffa045e660>] ? nfsd_destroy+0x80/0x80 [nfsd]
[<ffffffff8108b822>] kthread+0xd2/0xf0
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
[<ffffffff817353a8>] ret_from_fork+0x58/0x90
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

nfsd            D ffff88053fd13180     0  2373      2 0x00000000
ffff8804f400da40 0000000000000046 ffff8804f45fb000 ffff8804f400dfd8
0000000000013180 0000000000013180 ffff8804f45fb000 ffff8800b70215a8
ffff8800b70215ac ffff8804f45fb000 00000000ffffffff ffff8800b70215b0
Call Trace:
[<ffffffff81729189>] schedule_preempt_disabled+0x29/0x70
[<ffffffff8172aff5>] __mutex_lock_slowpath+0x135/0x1b0
[<ffffffff8172b08f>] mutex_lock+0x1f/0x2f
[<ffffffffa019c49b>] zfs_zget+0x17b/0x320 [zfs]
[<ffffffffa004675a>] ? spl_kmem_free+0x2a/0x40 [spl]
[<ffffffffa017ae85>] ? __dprintf+0xc5/0xd0 [zfs]
[<ffffffffa0141991>] ? rrw_enter_read_impl+0xc1/0x170 [zfs]
[<ffffffffa018d6b0>] zfs_vget+0x230/0x3e0 [zfs]
[<ffffffffa01ab832>] zpl_fh_to_dentry+0x72/0xb0 [zfs]
[<ffffffffa04618d0>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd]
[<ffffffff812ad1c2>] exportfs_decode_fh+0x72/0x2e0
[<ffffffff810cd390>] ? getboottime+0x30/0x40
[<ffffffffa03c26b7>] ? cache_check+0xf7/0x3d0 [sunrpc]
[<ffffffffa0467e8c>] ? exp_find+0xfc/0x1b0 [nfsd]
[<ffffffff811a37e6>] ? kmem_cache_alloc_trace+0x1c6/0x1f0
[<ffffffff8130cec7>] ? aa_alloc_task_context+0x27/0x40
[<ffffffffa04625f6>] fh_verify+0x2f6/0x5e0 [nfsd]
[<ffffffffa046ce19>] nfsd3_proc_getattr+0x69/0xe0 [nfsd]
[<ffffffffa045ed3b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[<ffffffffa03b863d>] svc_process_common+0x46d/0x6d0 [sunrpc]
[<ffffffffa03b89a7>] svc_process+0x107/0x170 [sunrpc]
[<ffffffffa045e71f>] nfsd+0xbf/0x130 [nfsd]
[<ffffffffa045e660>] ? nfsd_destroy+0x80/0x80 [nfsd]
[<ffffffff8108b822>] kthread+0xd2/0xf0
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
[<ffffffff817353a8>] ret_from_fork+0x58/0x90
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

nfsd            R  running task        0  2385      2 0x00000000
ffff8804f3a0af48 0000000000000046 ffff8804f3df6000 ffff8804f3a0bfd8
0000000000013180 0000000000013180 ffff8804f3df6000 ffff8804f3a0b308
ffffea0013e4e8a0 ffff8804f3a0b0e8 ffffea0013e4e880 0000000000000000
Call Trace:
[<ffffffff81729059>] _cond_resched+0x29/0x40
[<ffffffff8116222d>] shrink_page_list+0x11d/0xa90
[<ffffffff81163222>] shrink_inactive_list+0x1c2/0x530
[<ffffffff81163d15>] shrink_lruvec+0x355/0x6f0
[<ffffffff81164116>] shrink_zone+0x66/0x1a0
[<ffffffff81164610>] do_try_to_free_pages+0xe0/0x540
[<ffffffff81164b5c>] try_to_free_pages+0xec/0x180
[<ffffffff811597d5>] __alloc_pages_nodemask+0x7d5/0xb80
[<ffffffff81197fd3>] alloc_pages_current+0xa3/0x160
[<ffffffff811a0a55>] new_slab+0x295/0x320
[<ffffffff81720c1b>] __slab_alloc+0x2a8/0x459
[<ffffffffa0047912>] ? spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffff8104f28f>] ? kvm_clock_read+0x1f/0x30
[<ffffffff811a35cb>] kmem_cache_alloc+0x18b/0x1e0
[<ffffffffa0047912>] ? spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffffa0047912>] spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffff811a3475>] ? kmem_cache_alloc+0x35/0x1e0
[<ffffffffa0047912>] ? spl_kmem_cache_alloc+0xa2/0x730 [spl]
[<ffffffffa00479a4>] ? spl_kmem_cache_alloc+0x134/0x730 [spl]
[<ffffffffa01a22a7>] zio_buf_alloc+0x57/0x60 [zfs]
[<ffffffffa00f5989>] arc_get_data_buf.isra.21+0x279/0x3f0 [zfs]
[<ffffffffa00f9802>] arc_read+0x312/0xa80 [zfs]
[<ffffffffa01a5441>] ? zio_null+0x61/0x70 [zfs]
[<ffffffffa00ffb90>] ? dbuf_rele_and_unlock+0x3e0/0x3e0 [zfs]
[<ffffffffa0101668>] dbuf_read+0x2a8/0x950 [zfs]
[<ffffffffa0102188>] __dbuf_hold_impl+0x478/0x500 [zfs]
[<ffffffffa0102283>] dbuf_hold_impl+0x73/0xa0 [zfs]
[<ffffffffa0102590>] dbuf_hold+0x20/0x40 [zfs]
[<ffffffffa0109c3d>] dmu_buf_hold_array_by_dnode+0xfd/0x4d0 [zfs]
[<ffffffffa00469a7>] ? spl_kmem_alloc+0xc7/0x170 [spl]
[<ffffffffa010a853>] dmu_read_uio_dnode+0x43/0xe0 [zfs]
[<ffffffffa010b827>] dmu_read_uio_dbuf+0x47/0x60 [zfs]
[<ffffffffa018ff7d>] zfs_read+0x13d/0x400 [zfs]
[<ffffffffa01085ae>] ? dmu_object_size_from_db+0x5e/0x80 [zfs]
[<ffffffffa01ac4b1>] zpl_read_common_iovec.constprop.9+0x71/0xc0 [zfs]
[<ffffffffa01ac56f>] zpl_aio_read+0x6f/0x90 [zfs]
[<ffffffff811bdd6c>] do_sync_readv_writev+0x4c/0x80
[<ffffffff811bf230>] do_readv_writev+0xb0/0x220
[<ffffffffa018fc72>] ? zfs_open+0x82/0x100 [zfs]
[<ffffffff81321ec3>] ? ima_get_action+0x23/0x30
[<ffffffff813211a2>] ? process_measurement+0x82/0x2c0
[<ffffffff811bf3cd>] vfs_readv+0x2d/0x50
[<ffffffffa0462aae>] nfsd_vfs_read.isra.12+0x6e/0x160 [nfsd]
[<ffffffffa0463da9>] ? nfsd_open+0xb9/0x190 [nfsd]
[<ffffffffa0464076>] nfsd_read+0x1e6/0x2c0 [nfsd]
[<ffffffffa046de4c>] nfsd3_proc_read+0xcc/0x170 [nfsd]
[<ffffffffa045ed3b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[<ffffffffa03b863d>] svc_process_common+0x46d/0x6d0 [sunrpc]
[<ffffffffa03b89a7>] svc_process+0x107/0x170 [sunrpc]
[<ffffffffa045e71f>] nfsd+0xbf/0x130 [nfsd]
[<ffffffffa045e660>] ? nfsd_destroy+0x80/0x80 [nfsd]
[<ffffffff8108b822>] kthread+0xd2/0xf0
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
[<ffffffff817353a8>] ret_from_fork+0x58/0x90
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

One of these, pid 2373 (in zfs_zget), is not like the others (in spl_kmem_cache_alloc).

The first process to throw a "120 seconds" warning was kswapd again, with the exact same stack trace as my original message, followed by txg_sync (also same trace as original message), and 8 nfsd's. 6 of the nfsd's have the same stack trace (varying pid):

nfsd            D ffff88053fd13180     0  2313      2 0x00000000
ffff8800b5a67bf8 0000000000000046 ffff8800b58a4800 ffff8800b5a67fd8
0000000000013180 0000000000013180 ffff8800b58a4800 ffff8800bbb0d180
ffff8800bbb0d000 ffff8800bbb0d188 ffff8800bbb0d028 0000000000000000
Call Trace:
[<ffffffff81728c69>] schedule+0x29/0x70
[<ffffffffa004db55>] cv_wait_common+0xe5/0x120 [spl]
[<ffffffff810ab3e0>] ? prepare_to_wait_event+0x100/0x100
[<ffffffffa004dba5>] __cv_wait+0x15/0x20 [spl]
[<ffffffffa019fb5b>] zil_commit.part.11+0x7b/0x7e0 [zfs]
[<ffffffffa004ea3f>] ? tsd_hash_search.isra.1+0x9f/0xd0 [spl]
[<ffffffff8172b082>] ? mutex_lock+0x12/0x2f
[<ffffffffa01a02d7>] zil_commit+0x17/0x20 [zfs]
[<ffffffffa019645a>] zfs_fsync+0x7a/0xf0 [zfs]
[<ffffffffa01abb95>] zpl_fsync+0x65/0x90 [zfs]
[<ffffffff811ee398>] vfs_fsync_range+0x18/0x30
[<ffffffffa04641f1>] nfsd_commit+0xa1/0xc0 [nfsd]
[<ffffffffa03bca7e>] ? svcauth_unix_set_client+0x33e/0x580 [sunrpc]
[<ffffffffa046d7b6>] nfsd3_proc_commit+0x96/0xf0 [nfsd]
[<ffffffffa045ed3b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[<ffffffffa03b863d>] svc_process_common+0x46d/0x6d0 [sunrpc]
[<ffffffffa03b89a7>] svc_process+0x107/0x170 [sunrpc]
[<ffffffffa045e71f>] nfsd+0xbf/0x130 [nfsd]
[<ffffffffa045e660>] ? nfsd_destroy+0x80/0x80 [nfsd]
[<ffffffff8108b822>] kthread+0xd2/0xf0
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
[<ffffffff817353a8>] ret_from_fork+0x58/0x90
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

Note that one of the 6 processes that reported "120 seconds blocked" with the above trace is 2320, which appears in the (subsequent) sysrq output with a different trace. So it did eventually make at least some progress.

The other two nfsd traces for "blocked" processes were:

nfsd            D ffff88053fd13180     0  2310      2 0x00000000
ffff8804f79577f0 0000000000000046 ffff8804f8df8000 ffff8804f7957fd8
0000000000013180 0000000000013180 ffff8804f8df8000 ffff880412d249b0
ffff880412d249b4 ffff8804f8df8000 00000000ffffffff ffff880412d249b8
Call Trace:
[<ffffffff81729189>] schedule_preempt_disabled+0x29/0x70
[<ffffffff8172aff5>] __mutex_lock_slowpath+0x135/0x1b0
[<ffffffff8172b08f>] mutex_lock+0x1f/0x2f
[<ffffffffa0119ab0>] dmu_zfetch+0x1b0/0xe80 [zfs]
[<ffffffffa0101b9d>] dbuf_read+0x7dd/0x950 [zfs]
[<ffffffffa011b8fa>] dnode_hold_impl+0xca/0x600 [zfs]
[<ffffffffa011be49>] dnode_hold+0x19/0x20 [zfs]
[<ffffffffa01098d7>] dmu_bonus_hold+0x27/0x290 [zfs]
[<ffffffff811a291d>] ? kfree+0xfd/0x140
[<ffffffffa0142f5e>] sa_buf_hold+0xe/0x10 [zfs]
[<ffffffffa019c4c1>] zfs_zget+0x1a1/0x320 [zfs]
[<ffffffffa004675a>] ? spl_kmem_free+0x2a/0x40 [spl]
[<ffffffffa017ae85>] ? __dprintf+0xc5/0xd0 [zfs]
[<ffffffffa0141991>] ? rrw_enter_read_impl+0xc1/0x170 [zfs]
[<ffffffffa018d6b0>] zfs_vget+0x230/0x3e0 [zfs]
[<ffffffffa01ab832>] zpl_fh_to_dentry+0x72/0xb0 [zfs]
[<ffffffffa04618d0>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd]
[<ffffffff812ad1c2>] exportfs_decode_fh+0x72/0x2e0
[<ffffffff810cd390>] ? getboottime+0x30/0x40
[<ffffffffa03c26b7>] ? cache_check+0xf7/0x3d0 [sunrpc]
[<ffffffffa0467e8c>] ? exp_find+0xfc/0x1b0 [nfsd]
[<ffffffff810a43cc>] ? check_preempt_wakeup+0x17c/0x280
[<ffffffff811a37e6>] ? kmem_cache_alloc_trace+0x1c6/0x1f0
[<ffffffff8130cec7>] ? aa_alloc_task_context+0x27/0x40
[<ffffffffa04625f6>] fh_verify+0x2f6/0x5e0 [nfsd]
[<ffffffffa046ce19>] nfsd3_proc_getattr+0x69/0xe0 [nfsd]
[<ffffffffa045ed3b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[<ffffffffa03b863d>] svc_process_common+0x46d/0x6d0 [sunrpc]
[<ffffffffa03b89a7>] svc_process+0x107/0x170 [sunrpc]
[<ffffffffa045e71f>] nfsd+0xbf/0x130 [nfsd]
[<ffffffffa045e660>] ? nfsd_destroy+0x80/0x80 [nfsd]
[<ffffffff8108b822>] kthread+0xd2/0xf0
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
[<ffffffff817353a8>] ret_from_fork+0x58/0x90
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

and this one, which is similar to but not the same as the other six in zil_commit:

nfsd            D ffff88053fd13180     0  2311      2 0x00000000
ffff8804f7a59bb0 0000000000000046 ffff8800b58a3000 ffff8804f7a59fd8
0000000000013180 0000000000013180 ffff8800b58a3000 ffff88053fd13a18
ffff880167987328 ffff880167987368 ffff880167987350 0000000000000001
Call Trace:
[<ffffffff81728f8d>] io_schedule+0x9d/0x140
[<ffffffffa004db0f>] cv_wait_common+0x9f/0x120 [spl]
[<ffffffff810ab3e0>] ? prepare_to_wait_event+0x100/0x100
[<ffffffffa004dbe8>] __cv_wait_io+0x18/0x20 [spl]
[<ffffffffa01a6213>] zio_wait+0x123/0x210 [zfs]
[<ffffffffa019ff31>] zil_commit.part.11+0x451/0x7e0 [zfs]
[<ffffffffa01a02d7>] zil_commit+0x17/0x20 [zfs]
[<ffffffffa019645a>] zfs_fsync+0x7a/0xf0 [zfs]
[<ffffffffa01ab645>] zpl_commit_metadata+0x55/0x80 [zfs]
[<ffffffffa0462a28>] commit_metadata.isra.11+0x28/0x40 [nfsd]
[<ffffffffa0464fde>] nfsd_unlink+0x1ee/0x240 [nfsd]
[<ffffffffa046cf0c>] nfsd3_proc_remove+0x7c/0x120 [nfsd]
[<ffffffffa045ed3b>] nfsd_dispatch+0xbb/0x200 [nfsd]
[<ffffffffa03b863d>] svc_process_common+0x46d/0x6d0 [sunrpc]
[<ffffffffa03b89a7>] svc_process+0x107/0x170 [sunrpc]
[<ffffffffa045e71f>] nfsd+0xbf/0x130 [nfsd]
[<ffffffffa045e660>] ? nfsd_destroy+0x80/0x80 [nfsd]
[<ffffffff8108b822>] kthread+0xd2/0xf0
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
[<ffffffff817353a8>] ret_from_fork+0x58/0x90
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

Perhaps all of these things have a particular mutex (or condition variable) in common?

The ZFS debug log was also enabled, although not too useful as it wrapped around and overwrote itself before we detected the problem and saved it, but it is from the "dead" time. It's 59147 lines, 32 of which are not "getting." They are:

1 0 0x01 -1 0 2888416667 41037111748797
timestamp    message 
1450310551   hdr_recl called
1450310551   hdr_recl called
1450310551   os=ffff88050660f000 obj=7075984 txg=7350251 blocksize=512 ibs=14
1450310551   hdr_recl called
1450310551   hdr_recl called
1450310551   hdr_recl called
1450310551   hdr_recl called
1450310551   hdr_recl called
1450310551   hdr_recl called
1450310551   hdr_recl called
1450310551   hdr_recl called
1450310551   os=ffff8800b8c39000 obj=64322 txg=7350251 blocksize=512 ibs=14
1450310551   hdr_recl called
1450310551   hdr_recl called
1450310551   hdr_recl called
1450310551   hdr_recl called
1450310551   os=ffff880518039800 obj=864715 txg=7350251 blocksize=512 ibs=14
1450310551   hdr_recl called
1450310551   hdr_recl called
1450310552   hdr_recl called
1450310552   hdr_recl called

So, my understand is very limited, but the situation is something like: somebody took a mutex and then asked for memory and didn't get it, got put on hold, and everybody else piled up behind him waiting for the mutex? The machine doesn't seem low on memory, but it would match the total lack of any visible disk activity. (I.e. if it was waiting for I/O I would expect to see horrifying things in iostat -x or zpool iostat, but all that's there in this state are zeroes.)

Is any of this helpful? The raw data is still available if this is over-trimmed, but it's a giant wall of text that doesn't seem to have much content. If there's another relevant process, it's likely the sysrq didn't catch it.

Where do we go from here? Whatever we can do to get more information, we'll do. Starting with gathering more info about memory usage the next time this crops up.

Thanks!

@jdavidlists
Copy link
Author

Just for the sake of completeness, the 113 nfsd processes I left out all look like this:

nfsd            S ffff88053fd13180     0  2321      2 0x00000000
ffff8804f71fbd30 0000000000000046 ffff8804f8e14800 ffff8804f71fbfd8
0000000000013180 0000000000013180 ffff8804f8e14800 ffff880522e88000
ffff8804f71fbd60 0000000100a89dda ffff880522e88000 000000000000d158
Call Trace:
[<ffffffff81728c69>] schedule+0x29/0x70
[<ffffffff81727df5>] schedule_timeout+0x175/0x2d0
[<ffffffff81074510>] ? ftrace_raw_output_tick_stop+0x70/0x70
[<ffffffffa03c7d59>] svc_get_next_xprt+0x169/0x310 [sunrpc]
[<ffffffff8109ace0>] ? wake_up_state+0x20/0x20
[<ffffffffa03c8072>] svc_recv+0xc2/0x5c0 [sunrpc]
[<ffffffffa045e70d>] nfsd+0xad/0x130 [nfsd]
[<ffffffffa045e660>] ? nfsd_destroy+0x80/0x80 [nfsd]
[<ffffffff8108b822>] kthread+0xd2/0xf0
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0
[<ffffffff817353a8>] ret_from_fork+0x58/0x90
[<ffffffff8108b750>] ? kthread_create_on_node+0x1c0/0x1c0

So they appear to be idle, which is why they got left out.

@siebenmann
Copy link
Contributor

Looking at the code involved here, it at least looks like there is significant possibility for mutex deadlock between zfs_zinactive() and anything else that uses ZFS_OBJ_HOLD_ENTER(). zfs_zinactive() is clearly being called essentially asynchronously by the overall virtual memory system as part of reclaiming memory (and comments in zfs_vnops.c assert that it may kick off IO of its own), but at the same time zfs_zget() and other users of ZFS_OBJ_HOLD_ENTER() do IO themselves via eg sa_buf_hold(), which can allocate memory and can clearly kick off memory pressure and wind up waiting for it (which is what is seen in the stack traces).

(This doesn't explain all of the mutex waits in the traces, eg dmu_zfetch().)

It strikes me as potentially significant that NFS is one of the few cases where you can have requests for an inode show up essentially out of the blue. For local IO, active inodes are tied to open files, directory lookups, and so on, and so will presumably always be held busy and non-evictable. However, NFS requests are purely directed against inodes and there is nothing to hold those active inodes busy on the server between requests. Is it possible that there's some sort of scenario where zfs_zget() is trying to revive an inode for NFS just as it's being evicted for memory pressure?

@perfinion
Copy link
Contributor

I have a lockup on my laptop that seems related to this. after rebooting i retrieved several things from pstore. Here are a few of the stacktraces that seem related to locking (the others are mostly watchdog). I was doing a lot of diskIO at the time but this machine is not an NFS server (only client but not actively using the mounts). I have also been getting several lockups on my server too but have not managed to get any stacktraces from it.

Panic#1 Part2
<4>[82891.072586]  [<ffffffffc027bad0>] ? dmu_ot+0xcd0/0x2b40 [zfs]
<4>[82891.072588]  [<ffffffffc01c54dc>] dbuf_hold+0x2c/0x50 [zfs]
<4>[82891.072590]  [<ffffffffc01de9e3>] dnode_hold_impl+0x93/0x5a0 [zfs]
<4>[82891.072592]  [<ffffffffa7bc8be9>] ? _raw_spin_unlock+0x19/0x40
<4>[82891.072594]  [<ffffffffc027acbb>] ? ddt_zap_ops+0x33b/0x3c0 [zfs]
<4>[82891.072596]  [<ffffffffc01def1d>] dnode_hold+0x2d/0x40 [zfs]
<4>[82891.072598]  [<ffffffffc01cccbe>] dmu_bonus_hold+0x2e/0x280 [zfs]
<4>[82891.072600]  [<ffffffffa70f7c54>] ? get_parent_ip+0x14/0x50
<4>[82891.072601]  [<ffffffffc0204e88>] sa_buf_hold+0x28/0x40 [zfs]
<4>[82891.072603]  [<ffffffffc025fe69>] zfs_zget+0x189/0x2e0 [zfs]
<4>[82891.072606]  [<ffffffffc0234465>] ? zap_unlockdir+0x35/0xa0 [zfs]
<4>[82891.072608]  [<ffffffffc023d4c8>] zfs_dirent_lock+0x508/0x6a0 [zfs]
<4>[82891.072610]  [<ffffffffc023d6da>] zfs_dirlook+0x7a/0x2b0 [zfs]
<4>[82891.072611]  [<ffffffffa7bc8be9>] ? _raw_spin_unlock+0x19/0x40
<4>[82891.072613]  [<ffffffffc0254586>] zfs_lookup+0x286/0x2d0 [zfs]
<4>[82891.072615]  [<ffffffffc0270b99>] zpl_follow_link+0x5c9/0x740 [zfs]
<4>[82891.072617]  [<ffffffffa723259b>] lookup_real+0x3b/0x80
<4>[82891.072619]  [<ffffffffa7232a3b>] __lookup_hash+0x3b/0x50
<4>[82891.072622]  [<ffffffffa72358e9>] walk_component+0x1d9/0x450
<4>[82891.072623]  [<ffffffffa7234859>] ? path_init+0x219/0x3e0
<4>[82891.072625]  [<ffffffffa72361b3>] path_lookupat+0x63/0x140
<4>[82891.072627]  [<ffffffffa7238c40>] filename_lookup+0xa0/0x160
<4>[82891.072629]  [<ffffffffa7bc8be9>] ? _raw_spin_unlock+0x19/0x40
<4>[82891.072631]  [<ffffffffa7164e14>] ? __audit_getname+0xa4/0xb0
<4>[82891.072633]  [<ffffffffa7238def>] user_path_at_empty+0x3f/0x60
<4>[82891.072635]  [<ffffffffa722b360>] vfs_fstatat+0x50/0xa0
Panic#1 Part3
<4>[82891.072540] Hardware name: LENOVO 20AQ006HUS/20AQ006HUS, BIOS GJET80WW (2.30 ) 10/20/2014
<4>[82891.072542] task: ffff8802f17d9400 ti: ffff8802f17d9dd0 task.ti: ffff8802f17d9dd0
<4>[82891.072544] RIP: 0010:[<ffffffffc01c1b01>]  [<ffffffffc01c1b01>] dbuf_find+0x71/0x4c0 [zfs]
<4>[82891.072546] RSP: 0018:ffffc9001877b698  EFLAGS: 00000293
<4>[82891.072548] RAX: ffff8802f17d9400 RBX: 000000000003a1f8 RCX: 00000000000000e2
<4>[82891.072550] RDX: 005824d03529b32d RSI: 0000000000000000 RDI: ffffffffc044a038
<4>[82891.072552] RBP: ffffc9001877b6f8 R08: 0000000000000047 R09: ffff8802e6164000
<4>[82891.072555] R10: 0000000000000000 R11: ffff8800c9ff1f38 R12: 0000000000000000
<4>[82891.072557] R13: ffff8800c9e72000 R14: 000000000000109b R15: 000000000008709b
<4>[82891.072559] FS:  0000000000000000(0000) GS:ffff88031e380000(0063) knlGS:00000000db962b40
<4>[82891.072561] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
<4>[82891.072563] CR2: 0000078d4c993000 CR3: 0000000027be3000 CR4: 00000000001406f0
<4>[82891.072564] Stack:
<4>[82891.072566]  0000000000000780 00000000ffffffff ffffc9001877b708 0000000000005337
<4>[82891.072568]  00ffc9001877b6d8 ffffffffc044a038 ffffffffc01240a5 0000008000000000
<4>[82891.072570]  ffff8802f17d9400 ffffffffc027bad0 ffff8802e6164000 0000000000005337
<4>[82891.072573] Call Trace:
<4>[82891.072575]  [<ffffffffc01240a5>] ? spl_kmem_zalloc+0xa5/0x190 [spl]
<4>[82891.072576]  [<ffffffffc027bad0>] ? dmu_ot+0xcd0/0x2b40 [zfs]
<4>[82891.072578]  [<ffffffffc01c4d16>] dbuf_read+0x936/0xd90 [zfs]
<4>[82891.072580]  [<ffffffffc01240a5>] ? spl_kmem_zalloc+0xa5/0x190 [spl]
<4>[82891.072582]  [<ffffffffc027bad0>] ? dmu_ot+0xcd0/0x2b40 [zfs]
<4>[82891.072584]  [<ffffffffc01c51dd>] dbuf_hold_impl+0x6d/0xa0 [zfs]
Panic#1 Part4
<4>[82891.072492]  [<ffffffffc025fe69>] zfs_zget+0x189/0x2e0 [zfs]
<4>[82891.072494]  [<ffffffffc0234465>] ? zap_unlockdir+0x35/0xa0 [zfs]
<4>[82891.072496]  [<ffffffffc023d4c8>] zfs_dirent_lock+0x508/0x6a0 [zfs]
<4>[82891.072498]  [<ffffffffc023d6da>] zfs_dirlook+0x7a/0x2b0 [zfs]
<4>[82891.072500]  [<ffffffffa7bc8be9>] ? _raw_spin_unlock+0x19/0x40
<4>[82891.072502]  [<ffffffffc0254586>] zfs_lookup+0x286/0x2d0 [zfs]
<4>[82891.072504]  [<ffffffffc0270b99>] zpl_follow_link+0x5c9/0x740 [zfs]
<4>[82891.072506]  [<ffffffffa723259b>] lookup_real+0x3b/0x80
<4>[82891.072508]  [<ffffffffa7232a3b>] __lookup_hash+0x3b/0x50
<4>[82891.072510]  [<ffffffffa72358e9>] walk_component+0x1d9/0x450
<4>[82891.072512]  [<ffffffffa72347c0>] ? path_init+0x180/0x3e0
<4>[82891.072514]  [<ffffffffa72361b3>] path_lookupat+0x63/0x140
<4>[82891.072516]  [<ffffffffa7238c40>] filename_lookup+0xa0/0x160
<4>[82891.072518]  [<ffffffffa70f7c54>] ? get_parent_ip+0x14/0x50
<4>[82891.072519]  [<ffffffffa7bc8be9>] ? _raw_spin_unlock+0x19/0x40
<4>[82891.072522]  [<ffffffffa7164e14>] ? __audit_getname+0xa4/0xb0
<4>[82891.072524]  [<ffffffffa7238def>] user_path_at_empty+0x3f/0x60
<4>[82891.072526]  [<ffffffffa722b360>] vfs_fstatat+0x50/0xa0
<4>[82891.072528]  [<ffffffffa722b84c>] SYSC_newlstat+0x2c/0x50
<4>[82891.072529]  [<ffffffffa722bd68>] SyS_newlstat+0x18/0x30
<4>[82891.072531]  [<ffffffffa7bc9699>] entry_SYSCALL_64_fastpath+0x12/0x83
<4>[82891.072534] Code: 89 e5 53 89 fb 48 83 ec 08 8b 15 f7 50 61 01 65 01 1d b0 1c f1 58 85 d2 75 11 65 8b 05 a5 1c f1 58 0f b6 c0 3d f4 00 00 00 7f 3c <65> 8b 05 94 1c f1 58 25 ff ff ff 7f 39 c3 74 0d 48 83 c4 08 5b 
<4>[82891.072536] NMI backtrace for cpu 3
Panic#1 Part5
<4>[82891.072448] NMI backtrace for cpu 2
<4>[82891.072450] CPU: 2 PID: 17857 Comm: git Tainted: P        W  O    4.2.6-hardened-r8 #1
<4>[82891.072452] Hardware name: LENOVO 20AQ006HUS/20AQ006HUS, BIOS GJET80WW (2.30 ) 10/20/2014
<4>[82891.072454] task: ffff8800ba452800 ti: ffff8800ba4531d0 task.ti: ffff8800ba4531d0
<4>[82891.072457] RIP: 0010:[<ffffffffa70f7cbd>]  [<ffffffffa70f7cbd>] preempt_count_add+0x2d/0xa0
<4>[82891.072459] RSP: 0018:ffffc90001f93888  EFLAGS: 00000093
<4>[82891.072461] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
<4>[82891.072463] RDX: 0000000000000000 RSI: ffffffffc027bad0 RDI: 0000000000000001
<4>[82891.072465] RBP: ffffc90001f93898 R08: ffffffffffffc000 R09: ffffc90001f936a8
<4>[82891.072467] R10: ffffffffc0123c00 R11: ffff8800c9ff1f38 R12: 0000000000000202
<4>[82891.072469] R13: ffff8802836194d0 R14: ffff8802836194e8 R15: ffff8801cf364480
<4>[82891.072471] FS:  0000039036e34700(0000) GS:ffff88031e300000(0000) knlGS:0000000000000000
<4>[82891.072473] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[82891.072475] CR2: 0000004ee46b7000 CR3: 0000000027be1000 CR4: 00000000001406f0
<4>[82891.072477] Stack:
<4>[82891.072479]  ffffc90001f938b8 ffff8802836194e8 ffffc90001f938b8 ffffffffa7bc8dd8
<4>[82891.072481]  ffff880270ef5970 ffffc90001f93968 ffffc90001f938f8 ffffffffc01ccd96
<4>[82891.072482]  ffffc90001f938e8 ffff8802836194d0 ffff8800c9e72000 00000000002a08b3
<4>[82891.072484] Call Trace:
<4>[82891.072486]  [<ffffffffa7bc8dd8>] _raw_spin_lock_irqsave+0x18/0x50
<4>[82891.072489]  [<ffffffffc01ccd96>] dmu_bonus_hold+0x106/0x280 [zfs]
<4>[82891.072490]  [<ffffffffc0204e88>] sa_buf_hold+0x28/0x40 [zfs]
Panic#1 Part6
<4>[82891.072401]  [<ffffffffc027acbb>] ? ddt_zap_ops+0x33b/0x3c0 [zfs]
<4>[82891.072402]  [<ffffffffc01def1d>] dnode_hold+0x2d/0x40 [zfs]
<4>[82891.072405]  [<ffffffffc01cccbe>] dmu_bonus_hold+0x2e/0x280 [zfs]
<4>[82891.072407]  [<ffffffffa70f7c54>] ? get_parent_ip+0x14/0x50
<4>[82891.072409]  [<ffffffffc0204e88>] sa_buf_hold+0x28/0x40 [zfs]
<4>[82891.072411]  [<ffffffffc025fe69>] zfs_zget+0x189/0x2e0 [zfs]
<4>[82891.072413]  [<ffffffffc0234465>] ? zap_unlockdir+0x35/0xa0 [zfs]
<4>[82891.072414]  [<ffffffffc023d4c8>] zfs_dirent_lock+0x508/0x6a0 [zfs]
<4>[82891.072416]  [<ffffffffc023d6da>] zfs_dirlook+0x7a/0x2b0 [zfs]
<4>[82891.072418]  [<ffffffffa7bc8be9>] ? _raw_spin_unlock+0x19/0x40
<4>[82891.072420]  [<ffffffffc0254586>] zfs_lookup+0x286/0x2d0 [zfs]
<4>[82891.072422]  [<ffffffffc0270b99>] zpl_follow_link+0x5c9/0x740 [zfs]
<4>[82891.072425]  [<ffffffffa723259b>] lookup_real+0x3b/0x80
<4>[82891.072426]  [<ffffffffa7232a3b>] __lookup_hash+0x3b/0x50
<4>[82891.072428]  [<ffffffffa72358e9>] walk_component+0x1d9/0x450
<4>[82891.072430]  [<ffffffffa7234859>] ? path_init+0x219/0x3e0
<4>[82891.072432]  [<ffffffffa72361b3>] path_lookupat+0x63/0x140
<4>[82891.072434]  [<ffffffffa7238c40>] filename_lookup+0xa0/0x160
<4>[82891.072436]  [<ffffffffa7bc8be9>] ? _raw_spin_unlock+0x19/0x40
<4>[82891.072438]  [<ffffffffa7164e14>] ? __audit_getname+0xa4/0xb0
<4>[82891.072440]  [<ffffffffa7238def>] user_path_at_empty+0x3f/0x60
<4>[82891.072441]  [<ffffffffa7223603>] SyS_access+0xa3/0x220
<4>[82891.072444]  [<ffffffffa7bc9699>] entry_SYSCALL_64_fastpath+0x12/0x83
Panic#1 Part7
<4>[82891.072353] RSP: 0018:ffffc9000c58b6e0  EFLAGS: 00010246
<4>[82891.072355] RAX: 0000000000000000 RBX: ffff8802af503000 RCX: 0000000000000500
<4>[82891.072357] RDX: 0000000000000800 RSI: 0000000000000000 RDI: ffff8802af501300
<4>[82891.072359] RBP: ffffc9000c58b738 R08: ffff88031e2955a0 R09: ffff8802af501000
<4>[82891.072361] R10: ffffffffc0123c00 R11: ffff8800c9ff1f38 R12: 00000000ffffffff
<4>[82891.072363] R13: 000000000000c210 R14: ffff88031dc01400 R15: ffff8802af501000
<4>[82891.072365] FS:  00000324e175e840(0000) GS:ffff88031e280000(0000) knlGS:0000000000000000
<4>[82891.072368] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[82891.072370] CR2: 0000036d80097000 CR3: 0000000027bdf000 CR4: 00000000001406f0
<4>[82891.072372] Stack:
<4>[82891.072374]  ffffffffa720c159 ffffc9000c58b708 ffffffffa70f7c54 ffffffffc01240a5
<4>[82891.072376]  0000000000000780 ffffc9000c58b728 000000000000c210 0000000000000780
<4>[82891.072378]  0000000000000000 0000000000000000 0000000000000000 ffffc9000c58b788
<4>[82891.072379] Call Trace:
<4>[82891.072381]  [<ffffffffa720c159>] ? __kmalloc_node+0x129/0x270
<4>[82891.072383]  [<ffffffffa70f7c54>] ? get_parent_ip+0x14/0x50
<4>[82891.072385]  [<ffffffffc01240a5>] ? spl_kmem_zalloc+0xa5/0x190 [spl]
<4>[82891.072388]  [<ffffffffc01240a5>] spl_kmem_zalloc+0xa5/0x190 [spl]
<4>[82891.072389]  [<ffffffffc027bad0>] ? dmu_ot+0xcd0/0x2b40 [zfs]
<4>[82891.072391]  [<ffffffffc01c51ac>] dbuf_hold_impl+0x3c/0xa0 [zfs]
<4>[82891.072393]  [<ffffffffc027bad0>] ? dmu_ot+0xcd0/0x2b40 [zfs]
<4>[82891.072395]  [<ffffffffc01c54dc>] dbuf_hold+0x2c/0x50 [zfs]
<4>[82891.072397]  [<ffffffffc01de9e3>] dnode_hold_impl+0x93/0x5a0 [zfs]
<4>[82891.072399]  [<ffffffffa7bc8be9>] ? _raw_spin_unlock+0x19/0x40
Panic#1 Part9
<4>[82891.071179]  [<ffffffffa7bc65c6>] mutex_lock+0x26/0x40
<4>[82891.071183]  [<ffffffffa72358da>] walk_component+0x1ca/0x450
<4>[82891.071186]  [<ffffffffa7232f06>] ? __inode_permission+0x56/0xd0
<4>[82891.071190]  [<ffffffffa7235ccd>] link_path_walk+0x16d/0x560
<4>[82891.071193]  [<ffffffffa7234859>] ? path_init+0x219/0x3e0
<4>[82891.071196]  [<ffffffffa72361d5>] path_lookupat+0x85/0x140
<4>[82891.071200]  [<ffffffffa7238c40>] filename_lookup+0xa0/0x160
<4>[82891.071203]  [<ffffffffa7bc8be9>] ? _raw_spin_unlock+0x19/0x40
<4>[82891.071207]  [<ffffffffa7164e14>] ? __audit_getname+0xa4/0xb0
<4>[82891.071211]  [<ffffffffa7238def>] user_path_at_empty+0x3f/0x60
<4>[82891.071215]  [<ffffffffa722b360>] vfs_fstatat+0x50/0xa0
<4>[82891.071219]  [<ffffffffa722b7f9>] SYSC_newstat+0x29/0x50
<4>[82891.071223]  [<ffffffffa722bd38>] SyS_newstat+0x18/0x30
<4>[82891.071226]  [<ffffffffa7bc9699>] entry_SYSCALL_64_fastpath+0x12/0x83
<6>[82891.071230] sending NMI to all CPUs:
<4>[82891.072288] NMI backtrace for cpu 0
<4>[82891.072291] CPU: 0 PID: 43 Comm: khungtaskd Tainted: P        W  O    4.2.6-hardened-r8 #1
<4>[82891.072293] Hardware name: LENOVO 20AQ006HUS/20AQ006HUS, BIOS GJET80WW (2.30 ) 10/20/2014
<4>[82891.072295] task: ffff88030ff02800 ti: ffff88030ff031d0 task.ti: ffff88030ff031d0
<4>[82891.072297] RIP: 0010:[<ffffffffa74ef82c>]  [<ffffffffa74ef82c>] __const_udelay+0x1c/0x50
<4>[82891.072299] RSP: 0000:ffffc90001a2bdd8  EFLAGS: 00000296
<4>[82891.072301] RAX: 0000000001062560 RBX: 0000000000418958 RCX: 0000000000000000
<4>[82891.072303] RDX: 0000000000291b3c RSI: 0000000000000002 RDI: 0000000000418958
<4>[82891.072305] RBP: ffffc90001a2bde8 R08: 000000000000000a R09: 0000000000000006
<4>[82891.072307] R10: 0000000000000927 R11: 0000000000000000 R12: 0000000000000001

@behlendorf
Copy link
Contributor

@jdavidlists thank you for stacks and detailed analysis. I'm fairly sure @siebenmann is on the right track here with a ZFS_OBJ_HOLD_ENTER() deadlock although the back traces don't appear to explicitly show this. They're just blocked waiting for memory. But this exact issue is something both Linux and FreeBSD have struggled with due to differences in the VM and VFS subsystems across platforms.

We should investigate discarding the current locking scheme we inherited from illumos and replacing it with something optimized for Linux. Besides being a place which can result in deadlocks such as this it's a needless potential point of serialization which can negatively impact metadata performance.

@jdavidlists you may be able to significantly reduce the likelihood of hitting this issue and help us verify that this is the root cause by increasing the number of locks in the array. Can you please make the following small change to ZFS. As I said it's not a proper fix but it should help. I suspect you're more likely to encounter this issue than much due to the large number of nfsd threads you're running makes it far more likely.

diff --git a/include/sys/zfs_vfsops.h b/include/sys/zfs_vfsops.h
index 28407c6..dcd5311 100644
--- a/include/sys/zfs_vfsops.h
+++ b/include/sys/zfs_vfsops.h
@@ -112,7 +112,7 @@ typedef struct zfs_sb {
        uint64_t        z_groupquota_obj;
        uint64_t        z_replay_eof;   /* New end of file - replay only */
        sa_attr_type_t  *z_attr_table;  /* SA attr mapping->id */
-#define        ZFS_OBJ_MTX_SZ  256
+#define        ZFS_OBJ_MTX_SZ  4096
        kmutex_t        *z_hold_mtx;    /* znode hold locks */
 } zfs_sb_t;

@jdavidlists
Copy link
Author

As much as it hurts to disappoint you, I'm using the Launchpad packages. Building from source should be doable, but it'll take awhile to figure out how to do that.

Also, on the mailing list there has been some discussion and investigation that suggests that the source of the memory pressure may be the sudden allocation of many millions of znode and dnode structures, if that information is in any way helpful.

@siebenmann
Copy link
Contributor

@behlendorf if I'm reading the code and traces correctly, kswapd has called zfs_zinactive(), which is probably trying to get the mutex, while zfs_zget() and company probably hold it and are trying to allocate memory. However, now that I look at the mm code it doesn't look like they're explicitly blocked waiting for kswapd; instead they're down in shrinker code flailing away and apparently not making much progress.

(I think @jdavidlists has mentioned on the mailing list that kswapd et al have used a surprisingly large amount of CPU time, which might be a symptom that they're spinning without being able to make much forward progress.)

@behlendorf
Copy link
Contributor

@siebenmann that would make a lot of sense. To help avoid potential deadlocks like this PF_TRANS is set on all the nfsd threads as they handle requests thereby effectively disabling direct filesystem reclaim. That means we're going to be heavily dependent on kswapd, or another user process, to perform any filesystem reclaim. If kswapd or a different process can't make progress because it ends up blocked on the mutex in zfs_zinactive() that would go a long way towards explaining this. Plus the more znodes we need to evict the more elements in the hash array will need to be locked which makes this more likely. And I'm to go out on a limb and guess that this system sole job is as a nfs server so there's very little else in memory which can be discarded.

Another possible easy mitigation might be to increase zfs_arc_sys_free on the system. If you instruct the ARC to always leave a few gigabytes of memory free it should be far less likely for it to enter a situation like this. Easy to try and doesn't require rebuilding from source.

       zfs_arc_sys_free (ulong)
                   The  target  number  of  bytes the ARC should leave as free
                   memory on the system.  Defaults to the larger  of  1/64  of
                   physical memory or 512K.  Setting this option to a non-zero
                   value will override the default.

                   Default value: 0.

@jdavidlists
Copy link
Author

All of these servers are indeed NFS and nothing else, with twice-daily "zfs send" backups. The memory requirements of NFS on its own are vanishingly small, so nearly all of the RAM is available to ZFS.

Unfortunately we tried zfs_arc_sys_free earlier today and it appeared to make the problem much worse. After setting zfs_arc_sys_free to 2G on these machines, this problem has happened several times.

In fact, while I was typing this, it happened to a 16G machine that still had zfs_arc_sys_free set to 2G. The machine showed 750M of RAM available, so it was and was rapidly dying. After setting zfs_arc_sys_free down to 512M the system recovered for a little while.

With zfs_arc_sys_free, machines were dying of apparent memory starvation with 1.5-1.8G free RAM. That, plus shrinking zfs_arc_sys_free allowing a gasping machine to recover, does make it seem like ARC that is the limiting factor.

CPU usage looks like this: arc_reclaim is at the top by a huge margin (approaching 40% of all the CPU time used since boot after awhile), with arc_prune (1 per CPU) and kswapd trailing distantly behind, but still well ahead of any other single process. (And, when the stack is viewed while kswapd is busy, it's in ARC functions.)

What it "feels like" is that 16-20G of RAM is so vastly inadequate for a 1TB pool that after a point it can't evict fast enough to stay afloat. Which seems wrong, and definitely was not the case on 0.6.3. If only we could figure out what changed!

@perfinion
Copy link
Contributor

The symptoms you describe apply to me too. Im pretty sure it is related to ARC shrinking. I have a pretty old server and only 4GB ram. whenever I tried to emerge chromium (needs ~3GB to link) it would deadlock and eventually the watchdog would reboot it.

A few times I have caught it happening with htop already open (once it starts anything at all that hits the disk will die so only things that need no disk work). arc and txg processes are the top CPU usages for me too.

@jdavidlists
Copy link
Author

Just found one of the servers in the about-to-die state. It has arc_sys_free set to default (256M on this 16G machine) and was just sitting with about 170-200M free. This machine happens to run Debian, not Ubuntu, and is using SLAB, not SLUB, so I was able to fire up slabtop and watch what was happening on a per-slab basis without any misattribution.

zio_buf_16384 (as a standin for ARC on the list) was at the top of the list, where it belongs, but zfs_znode_cache and dnode_t were right behind it. Basically, zio_buf_16384 was giving up RAM to zfs_znode_cache and dnode_t; it would free trying to get back to 256M and they would slurp up the difference just as fast.

As far as I could tell, nothing was evicting dnodes or znodes. Every update their numbers were bigger. If anything was evicting them, it wasn't visible using CPU time in top and it wasn't evicting them fast enough to keep up with the allocation of new ones.

As an experiment I tried "echo 3 | sudo tee /proc/sys/vm/drop_caches" (based on a mailing list suggestion from @siebenmann ) and wow, it dropped all the caches. 10 seconds later the machine had 15G of free RAM. Much more effect than I would have guessed.

So the next step was to wait for "the problem" to happen in earnest. It soon did, on an Ubuntu machine. On that machine, signal_cache (read: zfs_znode_cache) at about 3.5G had totally eclipsed zio_buf_16384 at 2.9G and dnode_t was right behind it at 2.7G.

What we have started taking as the red flag of a "hard" failure is txg_sync using consistent CPU and making no progress. At that point, ZFS writes hang and nfsd's start choking and it's time to reboot. This machine was in that state.

Interestingly, writing to /proc/sys/vm/drop_caches at that point also stalled, and no memory was freed. It would have been good to get the stack trace for that to see whether it was inside ARC code, but I panicked and rebooted. The reboot also stalled and the machine had to be reset.

So it seems like there may be two separate issues interacting:

(1) Unbounded growth / lack of eviction in znode/dnode slabs pushing out ARC until it's too small to function. (The rising CPU usage of reclaim processes being a desperate effort to recycle a smaller and smaller space faster and faster.)
(2) A deadlock issue in the memory subsystem.

So the next step is to try setting up a per-minute cron job on 2-3 of the machines that drops caches if there is <1G free and see if those machines become stable (and intermittently very slow). That should help establish (1). (2) may be trickier unless we can generate the magic stack trace that catches somebody red handed holding a mutex while they're sleeping on the job. Perhaps if we master building from source we'll be able to turn the traces into line numbers.

Until we get the source to build to try @behlendorf 's suggestion, is this the right track?

@dweeezil
Copy link
Contributor

I wish I could comment on a comment: @behlendorf I think ZFS_OBJ_MTX_SZ had better be a power of 2 due to (not using modulus):

#define ZFS_OBJ_HASH(obj_num)   ((obj_num) & (ZFS_OBJ_MTX_SZ - 1))

EDIT: This follow-up is really meant for whomever is likely to try using this as a "fix". You'll want to use 4096 instead of 4093.

@siebenmann
Copy link
Contributor

@jdavidlists I think your results from trying to drop all caches during a lockup are what you'd expect, since the mutex deadlock appears to involve locks used when dropping inodes (that's what zfs_zinactive() is trying to do). It's possible that dropping just the page cache (via 'echo 1 >/proc/sys/vm/drop_caches') wouldn't hit the deadlock and would free up enough memory to unlock things; it might be worth trying.

On the issue of the mutex deadlock, my view is that we have a strong idea of where it happens, although we don't know why zfs_zget() is stuck in the memory system. Since zfs_zget() et al do IO, it seems like a bad idea in general for them to lock out zfs_zinactive() and thus kswapd for the entire duration of this (potentially quite long) IO et al. My gut says that something has to give here, either zfs_zinactive() abandoning attempts to evict inodes that zfs_zget() is working on or zfs_zget() somehow not locking this stuff for so long and dealing with zfs_zinactive() disappearing stuff while it's running.

(The locking code is the same in upstream Illumos, but clearly the Linux and Illumos contexts are different in some way that causes issues in Linux and not in Illumos.)

@behlendorf
Copy link
Contributor

@dweeezil whoops, your right. I've updated by comment and updated it to 4096.

behlendorf added a commit to behlendorf/zfs that referenced this issue Dec 18, 2015
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array.  Increasing this value may help
make the deadlock described in openzfs#4106 less common, but this is not a
proper fix.  This patch is primarily to aid debugging and analysis.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#4106
@behlendorf
Copy link
Contributor

@siebenmann nice summary of the problem. I'm pretty much in agreement with your analysis, but I'd like to add that the reason this doesn't happen under illumos is their memory reclaim behavior is very different. This really isn't a case which can occur for them. The recent ARC restructuring and improvements we pulled in from illumos are also the most likely reason this has become more common. Specifically the code paths fpr how the ARC recycles its buffers have changed.

What I'd like to do in figure out a way to eliminate this lock entirely. Realistically that will require some careful thought and can't be done quickly. However, what we can do right now it make it relatively easy to increase ZFS_OBJ_MTX_SZ through a module option. I've proposed pull request #4124 to do this. While this isn't a proper fix it may be a reasonable band-aid for the short term while we address this properly.

@jdavidlists if you do end up building from source please apply this patch. Then you will be able adjust the size of this array without needing to rebuild by adjusting the zfs_object_mutex_size module option. Just make sure the module option is set to the desired value prior to mounting the filesystems. It will automatically round down values to the nearest power-of-two and will accept values from 1-1048576. I'd start with 4096 as suggested above, determine if it helps and increase it if needed.

@siebenmann @dweeezil I'd love to hear any thoughts you may have on how we might eliminate these locks entirely. Alternately, I do have some thoughts on how to optimize the current data structure to prevent this but frankly I'd rather just be rid of it if possible!

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Dec 19, 2015
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array.  Increasing this value may help
make the deadlock described in openzfs#4106 less common, but this is not a
proper fix.  This patch is primarily to aid debugging and analysis.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#4106
@dweeezil
Copy link
Contributor

@behlendorf I'm reviewing #3225 and related commits including d6bd8ea to remind myself of all the issues related to z_hold_mtx. As to eliminating it, nothing comes to mind right now but it seems that if were were a way to do so, it could be done under illumos and other implementations as well.

Assuming it says, for now, it does seem it could be converted to a hash table or something else to prevent collisions. Also, I wonder if ZFS_OBJ_HOLD_TRYENTER(), which ZoL currently doesn't use, could be useful, maybe in the reclaim path.

@jdavidlists
Copy link
Author

OK, here's an update.

First things first, the per-minute check-and-drop_caches script seems very effective. No crashes on any of the five affected machines since deploying it.

It really wasn't too hard to download and build the ZFS & SPL sources on a test Ubuntu VM, including the patch referenced by @behlendorf (which did apply to the 0.6.5-release branch, as I'm not brave enough to touch master).

That built and installed and loaded and let me create a zpool, all on the test machine. Of course, the problem only happens on production machines.

Is there a relatively straightforward way to set this up to switch back and forth on a production machine so I can swap in the new ones, reboot, try it, and if it blows up, revert back to the official packages?

Can I just swap these kernel modules:

$ lsmod | fgrep zfs
zfs 2764151 6
zunicode 331170 1 zfs
zcommon 55411 1 zfs
znvpair 89086 2 zfs,zcommon
spl 96378 3 zfs,zcommon,znvpair
zavl 15236 1 zfs

and reboot?

The userland tools should be the same, right?

The other question is that the kernel is different on the production machine:

Linux fs6 3.13.0-73-generic #116-Ubuntu SMP Fri Dec 4 15:31:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

versus the test machine:

Linux ubu 3.19.0-42-generic #48~14.04.1-Ubuntu SMP Fri Dec 18 10:24:49 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Despite they are both Ubuntu 14.04.3. If I remember right, installing ZFS from the official packages changed the kernel?

Either way does this mean the build must happen on the production server itself (ugh!) to make sure everything is exactly the same? Or will doing the "standard" Ubuntu ZFS setup on the test machine put the same kernel in place and then I can build them offline and move them over?

Tricky waters, but I'm making headway!

Thanks!

@jdavidlists
Copy link
Author

Well, a quick follow-up to my own question, installing the ubuntu-zfs packages on the test machine do not change the kernel. So I'm confused as to why the kernels are not the same on the same version of Ubuntu LTS, but it looks like the test machine somehow pulled in a Vivid backport all by itself. Which was easy enough to deal with, so the hanging question is:

Now that the right kernel is there, is building an alternate set of kernel modules on the test server and swapping them in & out (with reboot) on the production server enough to pick up the changes in the patch this is meant to test?

Thanks!

@behlendorf
Copy link
Contributor

@dweeezil as I understand it the locking is there to serialize access to a specific object number during create and destroy. Normally this would be done with a lock in the znode but in the create and destroy case the object doesn't yet exist so it needs some external mechanism for this. My first thought was to leverage the Linux VFS but these same functions are used in ZIL replay which happens without aid of the VFS so that might be a problem. Ideas welcome.

@jdavidlists yes, swapping in the new set of kernel modules should be all you need to do.

@dweeezil
Copy link
Contributor

@behlendorf That was my impression as well; can't lock something when the "thing" doesn't yet exist. I'm kind of surprised illumos is only using a 64-entry array for this. The entries must be locked for very very short periods of time.

@jdavidlists
Copy link
Author

@behlendorf OK, your patch is up and running on one server with zfs_object_mutex_size=4096. It sounds like the hope is this will result in lower CPU usage due to less mutex contention (because there are more of them), resulting in lower memory usage due to more effective eviction, resulting in a lack of hangs. Is that about right? Is there anything specific to watch for, other than lack of hangs?

@behlendorf
Copy link
Contributor

@jdavidlists yes the basic idea is that by increasing the number of mutex it's less likely that you'll run in to contention. Mind you it's still not impossible, just less likely so deadlocks are still possible. But this may mitigate the issue until we can work up a patch to eliminate them entirely.

@dweeezil yes I always found that surprising too. But again this is probably far less likely on their platform because of how memory reclaim is handled. As for using ZFS_OBJ_HOLD_TRYENTER I thought about that as well but couldn't think of a great want to leverage it. Unfortunately, we normally deadlock through the evict() path and there's no clean way for us to fail in this case. Once the kernel calls evict() we must succeed so we can't just use trylock and fail on contention unfortunately. Maybe we could handle this asynchronously though.

@dweeezil
Copy link
Contributor

@behlendorf Couldn't we use a ZFS_OBJ_HOLD_TRYENTER in zfs_zinactive() only in the context of a kswapd task? Isn't it OK to do nothing if called from kswapd?

@jdavidlists
Copy link
Author

@behlendorf Your patch does seem pretty effective. The one machine it's been applied to is making very efficient use of all available memory (down to 150M - 300M "free"), has not hung, and is using significantly less CPU on all fronts, especially in arc_reclaim. Its memory usage also shows the sort of give-and-take I would expect, with ARC usage going up and down, the data/metadata ratio fluctuating over time, and the znode/dnode slabs not always pegged at 100% and growing. We'll apply this patch to the other Ubuntu machine today then start trying to build it for the Debian servers over the next couple of days.

What is it about our workload that is provoking the underlying issue so consistently compared to other uses? Is the large number of nfsd processes creating all the lock contention? (The sort of massive client-side parallelism that necessitates that can't be recreated in our test environment and hence probably why we only see this in production.) It'd be great if we could be more help in developing the real solution, but if I spend a year reading the source code, then I'd probably be caught up enough to understand most of the discussion happening now. :-P

In any case, thanks very much for your help and this patch; if anything changes with it or further data emerges as we try it on more machines, I will certainly update this issue. Alternatively, if you come up with a lockless approach or other "permanent" fix you would like us to test, please tag me and I'll see what we can do. These are production servers, but we can try to schedule something for one machine during off hours now that we know how to do source builds.

@dweeezil
Copy link
Contributor

@jdavidlists I think the summary from @siebenmann above pretty much tells the story. NFS effectively performs a "first open" on every IO request and each of those requests use the relatively small per-filesystem object lock table to be used.

I think the more interesting question is why your particular workload seems to be more prone to trigger deadlocks than anyone else's (NFS workload).

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Dec 22, 2015
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array.  Increasing this value may help
make the deadlock described in openzfs#4106 less common, but this is not a
proper fix.  This patch is primarily to aid debugging and analysis.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#4106
@dweeezil
Copy link
Contributor

dweeezil commented Jan 6, 2016

The test rig is as follows:

  • Create a pool with a single filesystem and a lot of small files and a single snapshot. The current one happens to have 2.8M files.
  • Reduce BUF_LOCKS to maximize the collisions. I've got it set to 2 for the moment.
  • Load the zfs module with both zfs_arc_min=256M and zfs_arc_max=256M.
  • Run a bunch of parallel zfs send operations, let them run for about 10 seconds and then try to gracefully kill them.

Here's a high-level overview based on observing the system: The workload is almost 100% metadata due to the lots of small files. Arc size will quickly overshoot due to metadata and the arc reclaim thread starts trying to free stuff. Eventually, it seems to pick a header with waiters and lead to a deadlock. The reclaim thread can run for quite awhile before this happens and, if the BUF_LOCKS table is reasonably sized, it typically will take a long time. I'm going to add a separate arcstat for the waiter skips to get a better handle on this.

This jives with @jdavidlists observations: He mentioned seeing the arc reclaim thread consume lots of CPU which was a precursor to the deadlock.

Unfortunately, it also shows a pathological condition which could occur on any system given a sufficiently large workload. All it takes is enough un-evictable metadata to cause the whole (zfs) system to grind to a halt. In fact, even with the waiter skipping, the system can be live-locked (not everything is in D state) in which the arc reclaim thread runs perpetually trying to free metadata while other zfs tasks are blocked.

@jdavidlists
Copy link
Author

@behlendorf This change is staged on all five machines, so as soon as one of them reboots for any reason, it'll take effect.

@dweeezil The arcstat for waiter skips will be very interesting. Once implemented, in the livelock scenario, would that number just start skyrocketing?

@dweeezil
Copy link
Contributor

dweeezil commented Jan 7, 2016

I ran quite a few stress tests today with the openzfs/spl@08fbc02 and without the waiter skipping code and it seems to be holding up very well. I've not been able to deadlock it a single time, although in this extreme test, the reclaim thread can consume lots of CPU. AFAICT, this is a good fix.

kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Jan 8, 2016
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array.  Increasing this value may help
make the deadlock described in openzfs#4106 less common, but this is not a
proper fix.  This patch is primarily to aid debugging and analysis.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#4106
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Jan 8, 2016
The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to
serialize access to a znode and its SA buffer while the object is being
created or destroyed.  This kind of locking would normally reside in the
znode itself but in this case that's impossible because the znode and SA
buffer may not yet exist.  Therefore the locking is handled externally
with an array of mutexs and AVLs trees which contain per-object locks.

In zfs_znode_hold_enter() a per-object lock is created as needed, inserted
in to the correct AVL tree and finally the per-object lock is held.  In
zfs_znode_hold_exit() the process is reversed.  The per-object lock is
released, removed from the AVL tree and destroyed if there are no waiters.

This scheme has two important properties:

1) No memory allocations are performed while holding one of the z_hold_locks.
   This ensures evict(), which can be called from direct memory reclaim, will
   never block waiting on a z_hold_locks which just happens to have hashed
   to the same index.

2) All locks used to serialize access to an object are per-object and never
   shared.  This minimizes lock contention without creating a large number
   of dedicated locks.

On the downside it does require znode_lock_t structures to be frequently
allocated and freed.  However, because these are backed by a kmem cache
and very short lived this cost is minimal.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#4106
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Jan 10, 2016
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array.  Increasing this value may help
make the deadlock described in openzfs#4106 less common, but this is not a
proper fix.  This patch is primarily to aid debugging and analysis.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#4106
kernelOfTruth pushed a commit to kernelOfTruth/zfs that referenced this issue Jan 10, 2016
The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to
serialize access to a znode and its SA buffer while the object is being
created or destroyed.  This kind of locking would normally reside in the
znode itself but in this case that's impossible because the znode and SA
buffer may not yet exist.  Therefore the locking is handled externally
with an array of mutexs and AVLs trees which contain per-object locks.

In zfs_znode_hold_enter() a per-object lock is created as needed, inserted
in to the correct AVL tree and finally the per-object lock is held.  In
zfs_znode_hold_exit() the process is reversed.  The per-object lock is
released, removed from the AVL tree and destroyed if there are no waiters.

This scheme has two important properties:

1) No memory allocations are performed while holding one of the z_hold_locks.
   This ensures evict(), which can be called from direct memory reclaim, will
   never block waiting on a z_hold_locks which just happens to have hashed
   to the same index.

2) All locks used to serialize access to an object are per-object and never
   shared.  This minimizes lock contention without creating a large number
   of dedicated locks.

On the downside it does require znode_lock_t structures to be frequently
allocated and freed.  However, because these are backed by a kmem cache
and very short lived this cost is minimal.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#4106
behlendorf pushed a commit to openzfs/spl that referenced this issue Jan 12, 2016
If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.

We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.

This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.

Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue openzfs/zfs#4166
Issue openzfs/zfs#4106
behlendorf added a commit to behlendorf/zfs that referenced this issue Jan 15, 2016
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array.  Increasing this value may help
make the deadlock described in openzfs#4106 less common, but this is not a
proper fix.  This patch is primarily to aid debugging and analysis.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue openzfs#4106
behlendorf added a commit to behlendorf/zfs that referenced this issue Jan 15, 2016
The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to
serialize access to a znode and its SA buffer while the object is being
created or destroyed.  This kind of locking would normally reside in the
znode itself but in this case that's impossible because the znode and SA
buffer may not yet exist.  Therefore the locking is handled externally
with an array of mutexs and AVLs trees which contain per-object locks.

In zfs_znode_hold_enter() a per-object lock is created as needed, inserted
in to the correct AVL tree and finally the per-object lock is held.  In
zfs_znode_hold_exit() the process is reversed.  The per-object lock is
released, removed from the AVL tree and destroyed if there are no waiters.

This scheme has two important properties:

1) No memory allocations are performed while holding one of the z_hold_locks.
   This ensures evict(), which can be called from direct memory reclaim, will
   never block waiting on a z_hold_locks which just happens to have hashed
   to the same index.

2) All locks used to serialize access to an object are per-object and never
   shared.  This minimizes lock contention without creating a large number
   of dedicated locks.

On the downside it does require znode_lock_t structures to be frequently
allocated and freed.  However, because these are backed by a kmem cache
and very short lived this cost is minimal.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#4106
@behlendorf
Copy link
Contributor

OK, the proposed patches to resolve this issue have been merged to the master branch.

The first two prevent a deadlock where zfs_zget() is stuck in the memory system holding the z_hold_mtx. It's waiting on zfs_zinactive() but it can't make progress since it happens to have hashed to the same value and is waiting on the same z_hold_mtx. This can't happen under illumos due to VM differences. The fix resolves the problem by ensuring kmem_alloc() and thus inode eviction always happen outside the hash lock.

c96c36f Fix zsb->z_hold_mtx deadlock
0720116 Add zfs_object_mutex_size module option

The second deadlock uncovered in this issue was resolved by @tuxoko who implemented @grwilson suggestion. Similar to illumos cv_wait() now will not hold the mutex until it has released the cv. That means no direct zfs changes were needed. It's still possible we could have an issue lurking with pthreads condition variables but we can worry about that if/when it's ever an issue.

openzfs/spl@e843553 Don't hold mutex until release cv in cv_wait

Thanks everyone!

@jdavidlists
Copy link
Author

Thanks for all your hard work everyone! It is much appreciated!

goulvenriou pushed a commit to Alyseo/zfs that referenced this issue Jan 17, 2016
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array.  Increasing this value may help
make the deadlock described in openzfs#4106 less common, but this is not a
proper fix.  This patch is primarily to aid debugging and analysis.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue openzfs#4106
goulvenriou pushed a commit to Alyseo/zfs that referenced this issue Jan 17, 2016
The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to
serialize access to a znode and its SA buffer while the object is being
created or destroyed.  This kind of locking would normally reside in the
znode itself but in this case that's impossible because the znode and SA
buffer may not yet exist.  Therefore the locking is handled externally
with an array of mutexs and AVLs trees which contain per-object locks.

In zfs_znode_hold_enter() a per-object lock is created as needed, inserted
in to the correct AVL tree and finally the per-object lock is held.  In
zfs_znode_hold_exit() the process is reversed.  The per-object lock is
released, removed from the AVL tree and destroyed if there are no waiters.

This scheme has two important properties:

1) No memory allocations are performed while holding one of the z_hold_locks.
   This ensures evict(), which can be called from direct memory reclaim, will
   never block waiting on a z_hold_locks which just happens to have hashed
   to the same index.

2) All locks used to serialize access to an object are per-object and never
   shared.  This minimizes lock contention without creating a large number
   of dedicated locks.

On the downside it does require znode_lock_t structures to be frequently
allocated and freed.  However, because these are backed by a kmem cache
and very short lived this cost is minimal.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#4106
behlendorf pushed a commit to openzfs/spl that referenced this issue Jan 20, 2016
If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.

We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.

This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.

Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue openzfs/zfs#4166
Issue openzfs/zfs#4106
behlendorf added a commit that referenced this issue Jan 29, 2016
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array.  Increasing this value may help
make the deadlock described in #4106 less common, but this is not a
proper fix.  This patch is primarily to aid debugging and analysis.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #4106
behlendorf added a commit that referenced this issue Jan 29, 2016
The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to
serialize access to a znode and its SA buffer while the object is being
created or destroyed.  This kind of locking would normally reside in the
znode itself but in this case that's impossible because the znode and SA
buffer may not yet exist.  Therefore the locking is handled externally
with an array of mutexs and AVLs trees which contain per-object locks.

In zfs_znode_hold_enter() a per-object lock is created as needed, inserted
in to the correct AVL tree and finally the per-object lock is held.  In
zfs_znode_hold_exit() the process is reversed.  The per-object lock is
released, removed from the AVL tree and destroyed if there are no waiters.

This scheme has two important properties:

1) No memory allocations are performed while holding one of the z_hold_locks.
   This ensures evict(), which can be called from direct memory reclaim, will
   never block waiting on a z_hold_locks which just happens to have hashed
   to the same index.

2) All locks used to serialize access to an object are per-object and never
   shared.  This minimizes lock contention without creating a large number
   of dedicated locks.

On the downside it does require znode_lock_t structures to be frequently
allocated and freed.  However, because these are backed by a kmem cache
and very short lived this cost is minimal.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4106
tuxoko added a commit to tuxoko/spl that referenced this issue Feb 8, 2016
If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.

We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.

This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.

Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue openzfs/zfs#4166
Issue openzfs/zfs#4106
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Feb 8, 2016
Add a zfs_object_mutex_size module option to facilitate resizing the
the per-dataset znode mutex array.  Increasing this value may help
make the deadlock described in openzfs#4106 less common, but this is not a
proper fix.  This patch is primarily to aid debugging and analysis.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue openzfs#4106
tuxoko pushed a commit to tuxoko/zfs that referenced this issue Feb 8, 2016
The zfs_znode_hold_enter() / zfs_znode_hold_exit() functions are used to
serialize access to a znode and its SA buffer while the object is being
created or destroyed.  This kind of locking would normally reside in the
znode itself but in this case that's impossible because the znode and SA
buffer may not yet exist.  Therefore the locking is handled externally
with an array of mutexs and AVLs trees which contain per-object locks.

In zfs_znode_hold_enter() a per-object lock is created as needed, inserted
in to the correct AVL tree and finally the per-object lock is held.  In
zfs_znode_hold_exit() the process is reversed.  The per-object lock is
released, removed from the AVL tree and destroyed if there are no waiters.

This scheme has two important properties:

1) No memory allocations are performed while holding one of the z_hold_locks.
   This ensures evict(), which can be called from direct memory reclaim, will
   never block waiting on a z_hold_locks which just happens to have hashed
   to the same index.

2) All locks used to serialize access to an object are per-object and never
   shared.  This minimizes lock contention without creating a large number
   of dedicated locks.

On the downside it does require znode_lock_t structures to be frequently
allocated and freed.  However, because these are backed by a kmem cache
and very short lived this cost is minimal.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs#4106
@behlendorf behlendorf added this to the 0.6.5.5 milestone Mar 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants