Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VERIFY(zio->io_children[c][w] == 0) failed #5918

Closed
DeHackEd opened this issue Mar 24, 2017 · 27 comments
Closed

VERIFY(zio->io_children[c][w] == 0) failed #5918

DeHackEd opened this issue Mar 24, 2017 · 27 comments

Comments

@DeHackEd
Copy link
Contributor

DeHackEd commented Mar 24, 2017

(Edit: title was changed)

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 7
Linux Kernel 4.9.11-1.el7.elrepo.x86_64
Architecture x86_64
ZFS Version 0.7.0-rc3_119_g912e2ba
SPL Version 0.7.0-rc3_3_g8d5feec

Describe the problem you're observing

Been experiencing kernel crashes intermittently lately. Panic, hard reboot required. Stack trace is below.

Describe how to reproduce the problem

The only consistency is a pool that's just over 80% full. Usually purging a large amount of space will make it stable for an extended period.

Pool receives nearly continuous writes of around 150-200 megabytes/second. Files written are loosely from 1 to 4 megabytes large each on average.

/etc/modprobe.d/zfs.conf:

options zfs zfs_arc_meta_limit=40000000000 zfs_arc_max=90000000000 l2arc_headroom=10 l2arc_write_max=67108864 zfs_dirty_data_sync=2147483648 zfs_max_recordsize=4194304
options spl spl_taskq_thread_dynamic=0

Pool layout:

# zpool iostat -v
                 capacity     operations     bandwidth 
pool           alloc   free   read  write   read  write
-------------  -----  -----  -----  -----  -----  -----
poolname        115T  30.3T    208  1.67K  1.03M   245M
  raidz2       24.4T  4.57T     36    340   235K  49.1M
    A5             -      -      4     42  29.2K  6.13M
    A3             -      -      4     42  29.3K  6.13M
    A4             -      -      4     42  29.5K  6.13M
    A2             -      -      4     42  29.4K  6.13M
    A1             -      -      4     42  29.4K  6.13M
    A9             -      -      4     42  29.4K  6.13M
    A18            -      -      4     42  29.3K  6.13M
    A6             -      -      4     42  29.3K  6.13M
  raidz2       24.9T  4.11T     34    353   233K  51.0M
    A14            -      -      4     42  29.2K  6.38M
    A24+5          -      -      4     46  29.0K  6.37M
    A24+0          -      -      4     46  28.9K  6.37M
    A16            -      -      4     42  29.0K  6.38M
    A21            -      -      4     43  29.1K  6.38M
    A23            -      -      4     42  29.1K  6.38M
    A24+1          -      -      4     46  29.0K  6.37M
    A10            -      -      4     42  29.2K  6.38M
  raidz2       24.7T  4.28T     36    348   239K  50.0M
    A20            -      -      4     43  29.9K  6.25M
    A11            -      -      4     43  29.8K  6.25M
    A22            -      -      4     43  29.7K  6.25M
    A8             -      -      4     43  29.7K  6.25M
    A7             -      -      4     43  29.8K  6.25M
    A17            -      -      4     43  29.9K  6.25M
    A24+3          -      -      4     46  29.9K  6.24M
    A12            -      -      4     43  29.8K  6.25M
  raidz2       24.8T  4.24T     35    353   233K  51.9M
    A24+7          -      -      4     45  29.1K  6.48M
    A24+2          -      -      4     45  29.1K  6.48M
    A19            -      -      4     42  29.3K  6.48M
    A24+4          -      -      4     45  29.2K  6.48M
    A0             -      -      4     42  29.2K  6.48M
    A24+6          -      -      4     45  29.2K  6.48M
    A13            -      -      4     42  29.1K  6.48M
    A15            -      -      4     42  29.1K  6.48M
  raidz2       15.9T  13.1T     66    318   114K  43.2M
    A24+13         -      -      8     39  14.3K  5.40M
    A24+14         -      -      8     39  14.1K  5.40M
    A24+15         -      -      8     39  14.2K  5.40M
    A24+16         -      -      8     40  14.3K  5.40M
    A24+17         -      -      8     39  14.2K  5.40M
    A24+18         -      -      8     40  14.2K  5.40M
    A24+19         -      -      8     39  14.2K  5.40M
    A24+20         -      -      8     39  14.2K  5.40M
cache              -      -      -      -      -      -
  zfsL2arc     1.92G  98.1G     78      8   510K   854K

Include any warning/errors/backtraces from the system logs

[42315.274149] general protection fault: 0000 [#1] SMP
[42315.274332] Modules linked in: coretemp xt_conntrack kvm_intel kvm netconsole 8021q garp mrp xt_state sch_fq tcp_bbr
ixgbevf nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter zfs(PO) zunicode(PO) zavl(PO) icp(PO) iTCO_wdt iTCO_vendor_support zcommon(PO)
znvpair(PO) spl(O) sb_edac edac_core x86_pkg_temp_thermal irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw
gf128mul glue_helper ablk_helper cryptd intel_cstate intel_rapl_perf pcspkr i2c_i801 i2c_smbus ses enclosure mei_me lpc_ich input_leds
mfd_core ipmi_devintf mei sg ioatdma ipmi_si shpchp wmi ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd
grace sunrpc ip_tables ext4 jbd2 mbcache raid1 sd_mod crc32c_intel i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops ttm ahci ixgbe libahci mdio ptp drm libata mpt3sas pps_core raid_class dca scsi_transport_sas fjes
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: coretemp]
[42315.277020] CPU: 14 PID: 25432 Comm: z_rd_int_2 Tainted: P           O    4.9.11-1.el7.elrepo.x86_64 #1
[42315.277351] Hardware name: Supermicro Super Server/X10DRD-LTP, BIOS 2.0 02/26/2016
[42315.277677] task: ffff88202c2a8000 task.stack: ffffc9003c64c000
[42315.277848] RIP: 0010:[<ffffffff81381d0f>] [<ffffffff81381d0f>] __list_add+0xf/0xc0
[42315.278183] RSP: 0018:ffffc9003c64fa20  EFLAGS: 00010282
[42315.278351] RAX: 0000000000000030 RBX: ffff88127aba04b0 RCX: ffff881c26db1900
[42315.278524] RDX: dead000000000100 RSI: ffff881f0014e840 RDI: ffff881c26db1920
[42315.278697] RBP: ffffc9003c64fa38 R08: 000000000001cd20 R09: 0000000000000e00
[42315.278869] R10: ffff881d52686440 R11: ffff881d52686440 R12: dead000000000100
[42315.279042] R13: ffff881f0014e840 R14: ffff88127aba0830 R15: ffff881f0014eaa0
[42315.279215] FS:  0000000000000000(0000) GS:ffff88203f380000(0000) knlGS:0000000000000000
[42315.279542] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[42315.279711] CR2: 00007ffb3a527000 CR3: 0000001fa87fc000 CR4: 00000000003406e0
[42315.279883] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[42315.280072] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[42315.280244] Stack:
[42315.280405]  ffff88127aba04b0 ffff881f0014e720 ffff881c26db1900 ffffc9003c64fa70
[42315.280746]  ffffffffa105e1cf ffff88127aba04b0 ffff881d6af44d80 ffff881f0014e720
[42315.281085]  ffff881894b3a008 ffff8807389e1a50 ffffc9003c64fac0 ffffffffa105e649
[42315.281425] Call Trace:
[42315.281647]  [<ffffffffa105e1cf>] zio_add_child+0x9f/0x120 [zfs]
[42315.281850]  [<ffffffffa105e649>] zio_create+0x3f9/0x4e0 [zfs]
[42315.282051]  [<ffffffffa105ec21>] zio_read+0xc1/0xe0 [zfs]
[42315.282241]  [<ffffffffa0fa83e0>] ? arc_read+0xa00/0xa00 [zfs]
[42315.282431]  [<ffffffffa0fa83e0>] ? arc_read+0xa00/0xa00 [zfs]
[42315.282620]  [<ffffffffa0fa7fb1>] arc_read+0x5d1/0xa00 [zfs]
[42315.282811]  [<ffffffffa0fafb6e>] dbuf_issue_final_prefetch+0x7e/0x90 [zfs]
[42315.283004]  [<ffffffffa0fb3c90>] dbuf_prefetch_indirect_done+0x100/0x1b0 [zfs]
[42315.283336]  [<ffffffffa047bdda>] ? spl_kmem_free+0x2a/0x40 [spl]
[42315.283510]  [<ffffffff81204d21>] ? kfree+0x121/0x170
[42315.283697]  [<ffffffffa0fa8556>] arc_read_done+0x176/0x2d0 [zfs]
[42315.283887]  [<ffffffffa0fa8a32>] l2arc_read_done+0x382/0x3f0 [zfs]
[42315.284090]  [<ffffffffa10137c9>] ? vdev_stat_update+0x1e9/0x4b0 [zfs]
[42315.284292]  [<ffffffffa1061941>] zio_done+0x311/0xc00 [zfs]
[42315.284492]  [<ffffffffa105c392>] ? zio_wait_for_children+0x62/0x80 [zfs]
[42315.284692]  [<ffffffffa105c2cf>] zio_execute+0x8f/0xf0 [zfs]
[42315.284865]  [<ffffffffa047f0a8>] taskq_thread+0x248/0x460 [spl]
[42315.285040]  [<ffffffff810af070>] ? wake_up_q+0x80/0x80
[42315.285212]  [<ffffffffa047ee60>] ? taskq_thread_spawn+0x50/0x50 [spl]
[42315.285385]  [<ffffffff810a27e9>] kthread+0xd9/0xf0
[42315.285557]  [<ffffffff810a2710>] ? kthread_park+0x60/0x60
[42315.285729]  [<ffffffff81757e55>] ret_from_fork+0x25/0x30
[42315.285898] Code: 
ff  b8  f4  ff  ff  ff  e9  3b  ff ff ff b8 f4 ff ff ff e9 31 ff ff ff 0f 1f 80 
00 00 00 00 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 8b 42 08 48 89 fb 
49 39 f0 75 2a 4d 8b 45 00 4d 39 c4 75 68 
[42315.286764] RIP [<ffffffff81381d0f>] __list_add+0xf/0xc0
[42315.286938]  RSP <ffffc9003c64fa20>
[42315.287605] ---[ end trace 36e75d4dd125f048 ]---
[42315.291752] Kernel panic - not syncing: Fatal exception
[42315.291999] Kernel Offset: disabled
[42315.296170] ---[ end Kernel panic - not syncing: Fatal exception

Output was reassembled from a netconsole capture so the wrapping had to be manually re-done. It's not exactly perfect.

@dweeezil
Copy link
Contributor

Possibly related to or similar in root cause as #5429? Gang block handling issue?

@DeHackEd
Copy link
Contributor Author

Not sure. The space usage thing is anecdotal at best and I don't see anything related to pool space allocation going on here.

@DeHackEd
Copy link
Contributor Author

I've had another crash. Stack trace is nigh-identical and below. This one is flowed properly from a kernel crash dump.

Is there anything I can do to help debug? I'm building a debug version of ZFS to install but don't know what else to do.

[603767.293140] general protection fault: 0000 [#1] SMP
[603767.293317] Modules linked in: nfnetlink_queue nfnetlink_log nfnetlink bluetooth rfkill tcp_diag udp_diag inet_diag unix_diag 8021q garp mrp xt_state rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache sch_fq tcp_bbr ixgbevf nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter iTCO_wdt iTCO_vendor_support sb_edac edac_core x86_pkg_temp_thermal coretemp zfs(PO) zunicode(PO) zavl(PO) kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd icp(PO) intel_cstate intel_rapl_perf zcommon(PO) znvpair(PO) spl(O) pcspkr ses enclosure ipmi_devintf i2c_i801 i2c_smbus mei_me lpc_ich sg joydev input_leds mfd_core mei ioatdma shpchp wmi ipmi_si ipmi_msghandler acpi_power_meter binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace
[603767.295318]  sunrpc ip_tables ext4 jbd2 mbcache raid1 sd_mod crc32c_intel i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci ixgbe libahci mdio drm ptp libata mpt3sas pps_core raid_class dca scsi_transport_sas fjes dm_mirror dm_region_hash dm_log dm_mod [last unloaded: dccp]
[603767.296114] CPU: 11 PID: 27487 Comm: z_rd_int_0 Tainted: P           O    4.9.11-1.el7.elrepo.x86_64 #1
[603767.296440] Hardware name: Supermicro Super Server/X10DRD-LTP, BIOS 2.0 02/26/2016
[603767.296762] task: ffff88202c185b00 task.stack: ffffc90043b64000
[603767.296931] RIP: 0010:[<ffffffff81381d0f>]  [<ffffffff81381d0f>] __list_add+0xf/0xc0
[603767.297260] RSP: 0018:ffffc90043b67a20  EFLAGS: 00010282
[603767.297427] RAX: 0000000000000030 RBX: ffff881ce2cef080 RCX: ffff8802bc857c90
[603767.297746] RDX: dead000000000100 RSI: ffff881b9861be10 RDI: ffff8802bc857cb0
[603767.298067] RBP: ffffc90043b67a38 R08: 000000000001cd20 R09: 0000000000000e00
[603767.298386] R10: ffff881e4a642580 R11: ffff880f61d13ae0 R12: dead000000000100
[603767.298707] R13: ffff881b9861be10 R14: ffff881ce2cef400 R15: ffff881b9861c070
[603767.299027] FS:  0000000000000000(0000) GS:ffff88203f2c0000(0000) knlGS:0000000000000000
[603767.299350] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[603767.299516] CR2: 00007fbeb123e000 CR3: 0000000ce2484000 CR4: 00000000003406e0
[603767.299837] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[603767.300157] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[603767.300477] Stack:
[603767.300636]  ffff881ce2cef080 ffff881b9861bcf0 ffff8802bc857c90 ffffc90043b67a70
[603767.300971]  ffffffffa18101cf ffff881ce2cef080 ffff88135e59cd00 ffff881b9861bcf0
[603767.301304]  ffff88171d1727e8 ffff881ba6267f00 ffffc90043b67ac0 ffffffffa1810649
[603767.301637] Call Trace:
[603767.301859]  [<ffffffffa18101cf>] zio_add_child+0x9f/0x120 [zfs]
[603767.302059]  [<ffffffffa1810649>] zio_create+0x3f9/0x4e0 [zfs]
[603767.302257]  [<ffffffffa1810c21>] zio_read+0xc1/0xe0 [zfs]
[603767.302444]  [<ffffffffa175a3e0>] ? arc_read+0xa00/0xa00 [zfs]
[603767.302631]  [<ffffffffa175a3e0>] ? arc_read+0xa00/0xa00 [zfs]
[603767.302818]  [<ffffffffa1759fb1>] arc_read+0x5d1/0xa00 [zfs]
[603767.303023]  [<ffffffffa1761b6e>] dbuf_issue_final_prefetch+0x7e/0x90 [zfs]
[603767.303222]  [<ffffffffa1765c90>] dbuf_prefetch_indirect_done+0x100/0x1b0 [zfs]
[603767.303573]  [<ffffffffa1811f19>] ? zio_nowait+0xb9/0x150 [zfs]
[603767.303761]  [<ffffffffa175a556>] arc_read_done+0x176/0x2d0 [zfs]
[603767.303949]  [<ffffffffa175aa32>] l2arc_read_done+0x382/0x3f0 [zfs]
[603767.304150]  [<ffffffffa17c57c9>] ? vdev_stat_update+0x1e9/0x4b0 [zfs]
[603767.304349]  [<ffffffffa1813941>] zio_done+0x311/0xc00 [zfs]
[603767.304546]  [<ffffffffa180e392>] ? zio_wait_for_children+0x62/0x80 [zfs]
[603767.304745]  [<ffffffffa180e2cf>] zio_execute+0x8f/0xf0 [zfs]
[603767.304919]  [<ffffffffa05e80a8>] taskq_thread+0x248/0x460 [spl]
[603767.305092]  [<ffffffff810af070>] ? wake_up_q+0x80/0x80
[603767.305261]  [<ffffffffa05e7e60>] ? taskq_thread_spawn+0x50/0x50 [spl]
[603767.305431]  [<ffffffff810a27e9>] kthread+0xd9/0xf0
[603767.305597]  [<ffffffff810a2710>] ? kthread_park+0x60/0x60
[603767.305767]  [<ffffffff810a2710>] ? kthread_park+0x60/0x60
[603767.305936]  [<ffffffff81757e55>] ret_from_fork+0x25/0x30
[603767.306103] Code: ff b8 f4 ff ff ff e9 3b ff ff ff b8 f4 ff ff ff e9 31 ff ff ff 0f 1f 80 00 00 00 00 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 <4c> 8b 42 08 48 89 fb 49 39 f0 75 2a 4d 8b 45 00 4d 39 c4 75 68 
[603767.306829] RIP  [<ffffffff81381d0f>] __list_add+0xf/0xc0
[603767.307001]  RSP <ffffc90043b67a20>

@ryao
Copy link
Contributor

ryao commented Apr 13, 2017

Argument 3 of __list_add is LIST_POISON1. That means head->next is an invalid pointer. This is either from list_link_init(), which the zio code does not call, or from list_del(), which we can call indirectly through list_destroy(), list_remove(), list_remove_head() or list_remove_tail(). The only of those functions in the zio code are the first two. They are only called in zio_remove_child() and zio_destroy(). How we can call either of those is unclear to me.

@DeHackEd It might be helpful if you could install the debuginfo for your kernel. Then you could run gdb /usr/lib/debug/lib/modules/$(uname -r)/vmlinux and disassemble __list_add. The output from that should be helpful in understanding more of the registers.

@dweeezil I suspect that this is unrelated to #5429 because of the LIST_POISON1.

@DeHackEd
Copy link
Contributor Author

elrepo doesn't make debuginfo builds. I'll have to run a custom build. But I'll do that.

@ryao
Copy link
Contributor

ryao commented Apr 13, 2017

@DeHackEd In hindsight, no need. Just run it on your actual kernel binary. Hopefully we'll get disassembly.

Edit: That won't work due to compression. Use this:

https://blog.packagecloud.io/eng/2016/03/08/how-to-extract-and-disassmble-a-linux-kernel-image-vmlinuz/

@DeHackEd
Copy link
Contributor Author

Disassembly of affected functions (reposting from IRC):

__list_add: https://pastebin.com/raw/M0tSULEV
zio_add_child: https://pastebin.com/raw/qtBHkS8y

Note that I don't have the exact same .ko that produced the stack trace above so the zio_add_child is a manual recompile and disassembly after the fact.

@DeHackEd
Copy link
Contributor Author

Debug build non-fatal crash dump:

[464792.580697] VERIFY(zio->io_children[c][w] == 0) failed
[464792.580893] PANIC at zio.c:3759:zio_done()
[464792.581099] Showing stack for process 31381
[464792.581281] CPU: 14 PID: 31381 Comm: z_wr_iss Tainted: P           O    4.9.21 #1
[464792.581616] Hardware name: Supermicro Super Server/X10DRD-LTP, BIOS 2.0 02/26/2016
[464792.581956]  ffffc90026c4b6e0 ffffffff81315dec ffffffffa0930f83 0000000000000eaf
[464792.582317]  ffffc90026c4b6f0 ffffffffa0607fc4 ffffc90026c4b878 ffffffffa060808f
[464792.582669]  0000000000000001 ffffc90000000028 ffffc90026c4b888 ffffc90026c4b828
[464792.583013] Call Trace:
[464792.583205]  [<ffffffff81315dec>] dump_stack+0x63/0x87
[464792.583405]  [<ffffffffa0607fc4>] spl_dumpstack+0x44/0x50 [spl]
[464792.583588]  [<ffffffffa060808f>] spl_panic+0xbf/0xf0 [spl]
[464792.583777]  [<ffffffff816f0364>] ? __schedule+0x224/0x6a0
[464792.583965]  [<ffffffff816f0a0a>] ? preempt_schedule_common+0x18/0x2e
[464792.584149]  [<ffffffff816f26d2>] ? mutex_lock+0x12/0x2f
[464792.584564]  [<ffffffffa083edc4>] zio_done+0x1344/0x1840 [zfs]
[464792.584849]  [<ffffffffa083f51d>] ? zio_ready+0x25d/0x630 [zfs]
[464792.585091]  [<ffffffffa083a9d1>] zio_nowait+0x121/0x320 [zfs]
[464792.585408]  [<ffffffffa0759fae>] dbuf_prefetch+0x45e/0x5f0 [zfs]
[464792.585745]  [<ffffffffa0764bcf>] dmu_prefetch+0x23f/0x260 [zfs]
[464792.586081]  [<ffffffffa07d1fb5>] space_map_load+0xf5/0x650 [zfs]
[464792.586500]  [<ffffffffa07abc49>] metaslab_load+0x69/0x1d0 [zfs]
[464792.586756]  [<ffffffffa07abfe9>] metaslab_activate+0x89/0x160 [zfs]
[464792.587007]  [<ffffffffa07aca31>] metaslab_alloc_dva.isra.9+0x531/0x10a0 [zfs]
[464792.587413]  [<ffffffffa07af6c4>] metaslab_alloc+0x124/0x460 [zfs]
[464792.587672]  [<ffffffffa083ae9e>] zio_dva_allocate+0x17e/0xa10 [zfs]
[464792.587915]  [<ffffffffa07b2e86>] ? zfs_refcount_add+0x16/0x20 [zfs]
[464792.588148]  [<ffffffffa07af449>] ? metaslab_class_throttle_reserve+0xb9/0x130 [zfs]
[464792.588536]  [<ffffffffa060ac66>] ? tsd_hash_search.isra.0+0x46/0x90 [spl]
[464792.588729]  [<ffffffffa060ad2e>] ? tsd_get_by_thread+0x2e/0x50 [spl]
[464792.588910]  [<ffffffffa0605238>] ? taskq_member+0x18/0x30 [spl]
[464792.589205]  [<ffffffffa0834285>] zio_execute+0xe5/0x2f0 [zfs]
[464792.589389]  [<ffffffffa06060a8>] taskq_thread+0x248/0x460 [spl]
[464792.589580]  [<ffffffff810ac090>] ? wake_up_q+0x80/0x80
[464792.589774]  [<ffffffffa0605e60>] ? taskq_thread_spawn+0x50/0x50 [spl]
[464792.589971]  [<ffffffff8109f739>] kthread+0xd9/0xf0
[464792.590156]  [<ffffffff8109f660>] ? kthread_park+0x60/0x60
[464792.590349]  [<ffffffff8109f660>] ? kthread_park+0x60/0x60
[464792.590550]  [<ffffffff816f5195>] ret_from_fork+0x25/0x30
[464968.688233] INFO: task z_wr_iss:31381 blocked for more than 120 seconds.
[464968.688433]       Tainted: P           O    4.9.21 #1
[464968.688605] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[464968.688926] z_wr_iss        D    0 31381      2 0x00000080
[464968.689105]  ffff8820355187c0 0000000000000000 ffff880f62b2c440 ffff882026e88000
[464968.689441]  ffff88203f399240 ffffc90026c4b6d8 ffffffff816f035c 0000000000000286
[464968.689772]  ffffffffffffff10 ffffffff81315e00 ffff880f62b2c440 0000000000000eaf
[464968.690111] Call Trace:
[464968.690283]  [<ffffffff816f035c>] ? __schedule+0x21c/0x6a0
[464968.690453]  [<ffffffff81315e00>] ? dump_stack+0x77/0x87
[464968.690623]  [<ffffffff816f0816>] schedule+0x36/0x80
[464968.690798]  [<ffffffffa06080b5>] spl_panic+0xe5/0xf0 [spl]
[464968.690966]  [<ffffffff816f0364>] ? __schedule+0x224/0x6a0
[464968.691139]  [<ffffffff816f0a0a>] ? preempt_schedule_common+0x18/0x2e
[464968.691310]  [<ffffffff816f26d2>] ? mutex_lock+0x12/0x2f
[464968.691565]  [<ffffffffa083edc4>] zio_done+0x1344/0x1840 [zfs]
[464968.691772]  [<ffffffffa083f51d>] ? zio_ready+0x25d/0x630 [zfs]
[464968.691977]  [<ffffffffa083a9d1>] zio_nowait+0x121/0x320 [zfs]
[464968.692197]  [<ffffffffa0759fae>] dbuf_prefetch+0x45e/0x5f0 [zfs]
[464968.692396]  [<ffffffffa0764bcf>] dmu_prefetch+0x23f/0x260 [zfs]
[464968.692603]  [<ffffffffa07d1fb5>] space_map_load+0xf5/0x650 [zfs]
[464968.692808]  [<ffffffffa07abc49>] metaslab_load+0x69/0x1d0 [zfs]
[464968.693010]  [<ffffffffa07abfe9>] metaslab_activate+0x89/0x160 [zfs]
[464968.693230]  [<ffffffffa07aca31>] metaslab_alloc_dva.isra.9+0x531/0x10a0 [zfs]
[464968.693588]  [<ffffffffa07af6c4>] metaslab_alloc+0x124/0x460 [zfs]
[464968.693795]  [<ffffffffa083ae9e>] zio_dva_allocate+0x17e/0xa10 [zfs]
[464968.694000]  [<ffffffffa07b2e86>] ? zfs_refcount_add+0x16/0x20 [zfs]
[464968.694221]  [<ffffffffa07af449>] ? metaslab_class_throttle_reserve+0xb9/0x130 [zfs]
[464968.694548]  [<ffffffffa060ac66>] ? tsd_hash_search.isra.0+0x46/0x90 [spl]
[464968.694720]  [<ffffffffa060ad2e>] ? tsd_get_by_thread+0x2e/0x50 [spl]
[464968.694891]  [<ffffffffa0605238>] ? taskq_member+0x18/0x30 [spl]
[464968.695117]  [<ffffffffa0834285>] zio_execute+0xe5/0x2f0 [zfs]
[464968.695291]  [<ffffffffa06060a8>] taskq_thread+0x248/0x460 [spl]
[464968.695463]  [<ffffffff810ac090>] ? wake_up_q+0x80/0x80
[464968.695631]  [<ffffffffa0605e60>] ? taskq_thread_spawn+0x50/0x50 [spl]
[464968.695804]  [<ffffffff8109f739>] kthread+0xd9/0xf0
[464968.695969]  [<ffffffff8109f660>] ? kthread_park+0x60/0x60
[464968.696140]  [<ffffffff8109f660>] ? kthread_park+0x60/0x60
[464968.696310]  [<ffffffff816f5195>] ret_from_fork+0x25/0x30
[464968.696531] INFO: task txg_sync:32375 blocked for more than 120 seconds.
[464968.696699]       Tainted: P           O    4.9.21 #1
[464968.696863] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[464968.697188] txg_sync        D    0 32375      2 0x00000080
[464968.697360]  ffff8810108ff440 ffff880704da5540 ffff8810321c5b00 ffff8809f083ad80
[464968.697692]  ffff88103fb99240 ffffc90021f57ac0 ffffffff816f035c ffff8810321c5b00
[464968.698023]  00000000ecdc0887 ffff881948b1f9e0 ffff8810321c5b00 0000000000000000
[464968.698360] Call Trace:
[464968.698522]  [<ffffffff816f035c>] ? __schedule+0x21c/0x6a0
[464968.698689]  [<ffffffff816f0816>] schedule+0x36/0x80
[464968.698853]  [<ffffffff816f3a3c>] schedule_timeout+0x21c/0x3a0
[464968.699058]  [<ffffffffa0832d12>] ? zio_issue_async+0x12/0x20 [zfs]
[464968.699275]  [<ffffffffa083a9d1>] ? zio_nowait+0x121/0x320 [zfs]
[464968.699446]  [<ffffffff810f973c>] ? ktime_get+0x3c/0xb0
[464968.699613]  [<ffffffff816f00d6>] io_schedule_timeout+0xa6/0x110
[464968.699784]  [<ffffffffa060a133>] cv_wait_common+0xb3/0x130 [spl]
[464968.699955]  [<ffffffff810c63d0>] ? prepare_to_wait_event+0xf0/0xf0
[464968.700134]  [<ffffffffa060a208>] __cv_wait_io+0x18/0x20 [spl]
[464968.700357]  [<ffffffffa083a0b3>] zio_wait+0x1e3/0x3e0 [zfs]
[464968.700562]  [<ffffffffa079b6ab>] dsl_pool_sync+0x11b/0x610 [zfs]
[464968.700785]  [<ffffffffa07bf0d6>] spa_sync+0x456/0x1220 [zfs]
[464968.700952]  [<ffffffff810ac0a2>] ? default_wake_function+0x12/0x20
[464968.701179]  [<ffffffffa07d4f6c>] txg_sync_thread+0x2ec/0x570 [zfs]
[464968.701387]  [<ffffffffa07d4c80>] ? txg_init+0x2c0/0x2c0 [zfs]
[464968.701557]  [<ffffffffa0604f42>] thread_generic_wrapper+0x72/0x80 [spl]
[464968.701727]  [<ffffffffa0604ed0>] ? __thread_exit+0x20/0x20 [spl]
[464968.701895]  [<ffffffff8109f739>] kthread+0xd9/0xf0
[464968.702066]  [<ffffffff8109f660>] ? kthread_park+0x60/0x60
[464968.702235]  [<ffffffff8109f660>] ? kthread_park+0x60/0x60
[464968.702401]  [<ffffffff816f5195>] ret_from_fork+0x25/0x30

System is currently running but hung. I can't leave it like this long.

Kernel is also a semi-debug build. It was built with frame pointers.

@DeHackEd DeHackEd changed the title Kernel segfault in zio_add_child VERIFY(zio->io_children[c][w] == 0) failed Apr 18, 2017
@DeHackEd
Copy link
Contributor Author

I have a crash dump from a kernel with this thread hung in the VERIFY() clause but don't know how to extract any useful data out of the backtrace. crash is a pretty poor substitute for gdb.

@ryao
Copy link
Contributor

ryao commented Apr 22, 2017

After talking with @DeHackEd in IRC, it turns out that he is using an Intel E5-2620 v4 and /proc/cpuinfo supports hle/rtm, which are Intel's TSX-NI. That is known to be broken (from what I had been told a year or two ago in #illumos) and likely is disabled in the latest BIOS update. I suggested DeHacked update his BIOS. The CentOS 7 kernel likely supports TSX-NI. I think it is possible that we are looking at the result of a broken Intel TSX-NI implementation. We'll see if this goes away after his BIOS update presumably disables TSX-NI.

Edit: I cannot find TSX-NI support code inside the kernel itself. I still suspect that a CPU errata might be causing this, so a newer BIOS and the latest microcode could fix it.

@kernelOfTruth
Copy link
Contributor

kernelOfTruth commented Apr 23, 2017

microcode update via early_ucode should strip the CPU off its support, I ran into issues however after doing so & had to re-compile my entire system (Gentoo),

so if anything relies on it (march=native, etc.) - you might end up with a broken system - so: caveat emptor !

I also have a Haswell cpu and with a microcode update

cat /proc/cpuinfo | grep -i hle

and

cat /proc/cpuinfo | grep -i rtm

turn out empty

@DeHackEd
Copy link
Contributor Author

DeHackEd commented Apr 23, 2017

After the BIOS update I have a new CPU flag - ida which appears to be a performance/turbo boost thing. No other changes. Microcode base is unchanged on the CPUs.

I'll look into a specific microcode update for Linux to apply as well.

Edit: microcode acquired, date is after BIOS update. After installing the flags didn't change though. I'll just let this run.

@DeHackEd
Copy link
Contributor Author

Crashed again anyway. Firmware updates didn't help.

@DeHackEd
Copy link
Contributor Author

kernel: VERIFY3(zio->io_children[c][w] == 0) failed (1 == 0)

Does this mean anything?

@DeHackEd
Copy link
Contributor Author

DeHackEd commented May 3, 2017

Had a crash. This time I grabbed a lot more debugging information, including a kernel core dump with a debug-enabled kernel + debug-enabled ZFS.

What's odd is that the assertion that fails (see above) turns out to be true after the fact. Possibly a race condition?

Gist with the object dump of the zio_t being examined at crash time, and some info on what I did to get the dump (because crash is inferior to GDB in spite of using GDB internally). https://gist.github.com/DeHackEd/e09f447fbdcdfa3d014f1236490d3715

@DeHackEd
Copy link
Contributor Author

DeHackEd commented May 4, 2017

Going over my crash dump history, all crashes were either from L2ARC or from the metaslab code. Broadly speaking the L2ARC code crashes when --enable-debug is disabled and the metaslab code when --enable-debug is active. I'm pulling my L2ARC out for now and seeing if this helps.

@DeHackEd
Copy link
Contributor Author

DeHackEd commented May 8, 2017

Commenting that removal of the L2ARC seems to have stabilized the system. I want to give it at least 10 days total (~6 days to go) before I jump for joy though.

@DeHackEd
Copy link
Contributor Author

Okay, I'm calling that removal of the SSD fixed the problem. Now, what could cause a zio crash/ASSERT failure when L2ARC is involved? More specifically, in the metaslab/spacemap layer.

@DeHackEd
Copy link
Contributor Author

Just going over my old issues. This has completely gone away since the removal of the L2ARC device. I'm using metadata allocation classes instead.

@tjikkun
Copy link

tjikkun commented Jan 9, 2018

I am seeing what I think is the same bug after switching from a EL6-ish kernel to an EL7-ish one:

[35271.152080] general protection fault: 0000 [#1] SMP
[35271.154765] Modules linked in: cls_fw sch_sfq sch_htb netconsole arc4 md4 nls_utf8 cifs dns_resolver ip6t_REJECT nf_reject_ipv6 ip_set_hash_ip xt_set ip_set_hash_net nf_conntrack_ipv6 nf_defrag_ipv6 nf_log_ipv6 nf_log_ipv4 nf_log_common ip_set nfnetlink ip6table_filter xt_owner nf_nat_ftp xt_REDIRECT nf_nat_redirect xt_conntrack nf_conntrack_ftp ipt_REJECT nf_reject_ipv4 xt_LOG xt_limit iptable_filter xt_multiport iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 8021q mrp garp stp llc bonding kmodlve(O) vzdev ip6table_mangle ip6_tables iptable_mangle xt_mark ip_tables binfmt_misc iTCO_wdt iTCO_vendor_support zfs(PO) zcommon(PO) znvpair(PO) icp(PO) spl(O) zlib_deflate zavl(PO) zunicode(PO) pcspkr sb_edac edac_core sg i2c_i801 lpc_ich ipmi_devintf ipmi_si ipmi_msghandler mei_me mei ioatdma ixgbe dca ptp pps_core mdio fjes ext4 jbd2 mbcache raid1 sd_mod crc_t10dif crct10dif_common mpt3sas raid_class ahci libahci isci libsas libata scsi_transport_sas wmi mgag200 ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
[35271.172427] CPU: 12 PID: 7262 Comm: z_rd_int_4 ve: 0 Tainted: P           O   ------------   3.10.0-714.10.2.lve1.4.79.el6h.x86_64 #1 29.2
[35271.176501] Hardware name: Supermicro X9DRT-PT/X9DRT-PT, BIOS 3.0a 01/15/2014
[35271.178063] task: ffff881836fe4fb0 ti: ffff88184a598000 task.ti: ffff88184a598000
[35271.179617] RIP: 0010:[<ffffffff8133c9ef>] [<ffffffff8133c9ef>] __list_add+0xf/0xc0
[35271.181353] RSP: 0018:ffff88184a59ba30  EFLAGS: 00010282
[35271.182920] RAX: 0000000000000000 RBX: ffff881334de8000 RCX: ffff88184a59bfd8
[35271.184460] RDX: dead000000000100 RSI: ffff881334dea260 RDI: ffff881429438d70
[35271.186068] RBP: ffff88184a59ba48 R08: 0000000000026a10 R09: 0000000000001000
[35271.187558] R10: ffff8818359a0000 R11: ffffea002ee4a480 R12: dead000000000100
[35271.189376] R13: ffff881334dea260 R14: ffff881334de8380 R15: ffff881334dea4c0
[35271.191018] FS:  0000000000000000(0000) GS:ffff880c4fb80000(0000) knlGS:0000000000000000
[35271.192463] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[35271.193965] CR2: ffffffffff600400 CR3: 00000017cd67c000 CR4: 00000000001607e0
[35271.195362] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[35271.196757] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[35271.198408] Call Trace:
[35271.199818]  [<ffffffffa05932b3>] zio_add_child+0xc3/0x150 [zfs]
[35271.201410]  [<ffffffffa0593747>] zio_create+0x407/0x510 [zfs]
[35271.203093]  [<ffffffffa0594496>] zio_read_phys+0x76/0x80 [zfs]
[35271.204477]  [<ffffffffa04c9ea0>] ? arc_read_done+0x310/0x310 [zfs]
[35271.205778]  [<ffffffffa04c9aea>] arc_read+0xa1a/0xac0 [zfs]
[35271.207072]  [<ffffffffa04c9ea0>] ? arc_read_done+0x310/0x310 [zfs]
[35271.208395]  [<ffffffffa04d32c5>] dbuf_issue_final_prefetch+0x85/0x90 [zfs]
[35271.209679]  [<ffffffffa04d6b93>] dbuf_prefetch_indirect_done+0x103/0x1b0 [zfs]
[35271.210912]  [<ffffffffa04c8a58>] ? arc_buf_alloc_impl.isra.29+0x268/0x2e0 [zfs]
[35271.212126]  [<ffffffffa04c9d18>] arc_read_done+0x188/0x310 [zfs]
[35271.213320]  [<ffffffffa04ca306>] l2arc_read_done+0x466/0x550 [zfs]
[35271.214698]  [<ffffffffa0540bd2>] ? vdev_stat_update+0x212/0x510 [zfs]
[35271.216365]  [<ffffffffa0596c61>] zio_done+0x321/0xce0 [zfs]
[35271.217663]  [<ffffffff816a0202>] ? mutex_lock+0x12/0x2f
[35271.218868]  [<ffffffffa05907e5>] ? zio_wait_for_children+0x85/0xa0 [zfs]
[35271.219988]  [<ffffffffa059192c>] zio_execute+0x9c/0x100 [zfs]
[35271.221062]  [<ffffffffa0416f46>] taskq_thread+0x256/0x480 [spl]
[35271.222111]  [<ffffffff810cb6f0>] ? wake_up_state+0x20/0x20
[35271.223265]  [<ffffffffa0416cf0>] ? taskq_thread_spawn+0x60/0x60 [spl]
[35271.224290]  [<ffffffff810b4a3f>] kthread+0xcf/0xe0
[35271.225280]  [<ffffffff810b4970>] ? create_kthread+0x60/0x60
[35271.226250]  [<ffffffff816ad9d8>] ret_from_fork+0x58/0x90
[35271.227265]  [<ffffffff810b4970>] ? create_kthread+0x60/0x60
[35271.228187] Code: 48 89 df e8 34 70 eb ff b8 f4 ff ff ff e9 3b ff ff ff b8 f4 ff ff ff e9 31 ff ff ff 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 8b 42 08 48 89 fb 49 39 f0 75 2a 4d 8b 45 00 4d 39 c4 75 68
[35271.230551] RIP [<ffffffff8133c9ef>] __list_add+0xf/0xc0
[35271.231459]  RSP <ffff88184a59ba30>

System information

Type Version/Name
Distribution Name CloudLinux
Distribution Version 6.9
Linux Kernel 3.10.0-714.10.2.lve1.4.79.el6h.x86_64
Architecture x86_64
ZFS Version 0.7.3
SPL Version 0.7.3

Disaabling L2ARC is not really an option for us. Any suggestions on what to do?

@DeHackEd
Copy link
Contributor Author

The kernel version thing is interesting. It suggests it's a bisectable kernel issue. If I can reproduce this consistently I could do a bisect.

Is your L2ARC hit rate really high enough that dropping it will ruin performance that badly?

@loli10K
Copy link
Contributor

loli10K commented Jan 13, 2018

I don't currently have the hardware in place to test (nor reproduce) the issue but given this seems to have something to do with l2arc/prefetch code it could be explained by "Illumos 8857 - zio_remove_child() panic due to already destroyed parent zio" (Fix PR: openzfs/openzfs#505)

@DeHackEd
Copy link
Contributor Author

That looks very promising. And if that gets accepted, it should be added to our stable branch(es?) as well.

behlendorf pushed a commit that referenced this issue Feb 14, 2018
…nt zio

PROBLEM
=======
It's possible for a parent zio to complete even though it has children
which have not completed. This can result in the following panic:
    > $C
    ffffff01809128c0 vpanic()
    ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
    ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
    ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
    ffffff3373370908)
    ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
    ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
    ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
    ffffff0180912b40 thread_start+8()
    > ::status
    debugging crash dump vmcore.2 (64-bit) from batfs0390
    operating system: 5.11 joyent_20170911T171900Z (i86pc)
    image uuid: (not set)
    panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80
    owner=ffffff3c59b39480 thread=ffffff0180912c40
    dump content: kernel pages only
The problem is that dbuf_prefetch along with l2arc can create a zio tree
which confuses the parent zio and allows it to complete with while children
still exist. Here's the scenario:
    zio tree:
        pio
         |--- lio
The parent zio, pio, has entered the zio_done stage and begins to check its
children to see there are still some that have not completed. In zio_done(),
the children are checked in the following order:
    zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE)
If pio, finds any child which has not completed then it stops executing and
goes to sleep. Each call to zio_wait_for_children() will grab the io_lock
while checking the particular child.
In this scenario, the pio has completed the first call to
zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since
the only zio in the zio tree right now is the logical zio, lio, then it
completes that call and prepares to check the next child type.
In the meantime, the lio completes and in its callback creates a child vdev
zio, cio. The zio tree looks like this:
    zio tree:
        pio
         |--- lio
         |--- cio
The lio then grabs the parent's io_lock and removes itself.
    zio tree:
        pio
         |--- cio
The pio continues to run but has already completed its check for ZIO_CHILD_VDEV
and will erroneously complete. When the child zio, cio, completes it will panic
the system trying to reference the parent zio which has been destroyed.
SOLUTION
========
The fix is to rework the zio_wait_for_children() logic to accept a bitfield
for all the children types that it's interested in checking. The
io_lock will is held the entire time we check all the children types. Since
the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided
to allow for the conversion between a ZIO_CHILD type and the bitfield used by
the zio_wiat_for_children logic.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8857
OpenZFS-commit: openzfs/openzfs@862ff6d99c
Issue #5918
Closes #7168
@behlendorf
Copy link
Contributor

The OpenZFS fix for this issue has been merged. If anyone has been able to consistently reproduce this issue it would be great if you could verify it's now resolved in master. This fix has been added to the list of proposed changes for the next 0.7 release.

@DeHackEd
Copy link
Contributor Author

Unfortunately not me. That system is now running metadata allocation classes instead so there's no SSD space left for L2ARC.

@DeHackEd
Copy link
Contributor Author

Slight edit, another machine is exhibiting the same symptoms with an L2ARC. Maybe I can.

(I absentmindedly removed the L2ARC as a fix before I remembered this issue was still outstanding)...

But this will likely take some time. It's around a month between crashes. And I'll have to get approval from above before I leave a machine in a potentially crash-prone state.

tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Mar 7, 2018
…nt zio

PROBLEM
=======
It's possible for a parent zio to complete even though it has children
which have not completed. This can result in the following panic:
    > $C
    ffffff01809128c0 vpanic()
    ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
    ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
    ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
    ffffff3373370908)
    ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
    ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
    ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
    ffffff0180912b40 thread_start+8()
    > ::status
    debugging crash dump vmcore.2 (64-bit) from batfs0390
    operating system: 5.11 joyent_20170911T171900Z (i86pc)
    image uuid: (not set)
    panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80
    owner=ffffff3c59b39480 thread=ffffff0180912c40
    dump content: kernel pages only
The problem is that dbuf_prefetch along with l2arc can create a zio tree
which confuses the parent zio and allows it to complete with while children
still exist. Here's the scenario:
    zio tree:
        pio
         |--- lio
The parent zio, pio, has entered the zio_done stage and begins to check its
children to see there are still some that have not completed. In zio_done(),
the children are checked in the following order:
    zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE)
If pio, finds any child which has not completed then it stops executing and
goes to sleep. Each call to zio_wait_for_children() will grab the io_lock
while checking the particular child.
In this scenario, the pio has completed the first call to
zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since
the only zio in the zio tree right now is the logical zio, lio, then it
completes that call and prepares to check the next child type.
In the meantime, the lio completes and in its callback creates a child vdev
zio, cio. The zio tree looks like this:
    zio tree:
        pio
         |--- lio
         |--- cio
The lio then grabs the parent's io_lock and removes itself.
    zio tree:
        pio
         |--- cio
The pio continues to run but has already completed its check for ZIO_CHILD_VDEV
and will erroneously complete. When the child zio, cio, completes it will panic
the system trying to reference the parent zio which has been destroyed.
SOLUTION
========
The fix is to rework the zio_wait_for_children() logic to accept a bitfield
for all the children types that it's interested in checking. The
io_lock will is held the entire time we check all the children types. Since
the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided
to allow for the conversion between a ZIO_CHILD type and the bitfield used by
the zio_wiat_for_children logic.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8857
OpenZFS-commit: openzfs/openzfs@862ff6d99c
Issue openzfs#5918
Closes openzfs#7168
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Mar 12, 2018
…nt zio

PROBLEM
=======
It's possible for a parent zio to complete even though it has children
which have not completed. This can result in the following panic:
    > $C
    ffffff01809128c0 vpanic()
    ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
    ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
    ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
    ffffff3373370908)
    ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
    ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
    ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
    ffffff0180912b40 thread_start+8()
    > ::status
    debugging crash dump vmcore.2 (64-bit) from batfs0390
    operating system: 5.11 joyent_20170911T171900Z (i86pc)
    image uuid: (not set)
    panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80
    owner=ffffff3c59b39480 thread=ffffff0180912c40
    dump content: kernel pages only
The problem is that dbuf_prefetch along with l2arc can create a zio tree
which confuses the parent zio and allows it to complete with while children
still exist. Here's the scenario:
    zio tree:
        pio
         |--- lio
The parent zio, pio, has entered the zio_done stage and begins to check its
children to see there are still some that have not completed. In zio_done(),
the children are checked in the following order:
    zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE)
If pio, finds any child which has not completed then it stops executing and
goes to sleep. Each call to zio_wait_for_children() will grab the io_lock
while checking the particular child.
In this scenario, the pio has completed the first call to
zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since
the only zio in the zio tree right now is the logical zio, lio, then it
completes that call and prepares to check the next child type.
In the meantime, the lio completes and in its callback creates a child vdev
zio, cio. The zio tree looks like this:
    zio tree:
        pio
         |--- lio
         |--- cio
The lio then grabs the parent's io_lock and removes itself.
    zio tree:
        pio
         |--- cio
The pio continues to run but has already completed its check for ZIO_CHILD_VDEV
and will erroneously complete. When the child zio, cio, completes it will panic
the system trying to reference the parent zio which has been destroyed.
SOLUTION
========
The fix is to rework the zio_wait_for_children() logic to accept a bitfield
for all the children types that it's interested in checking. The
io_lock will is held the entire time we check all the children types. Since
the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided
to allow for the conversion between a ZIO_CHILD type and the bitfield used by
the zio_wiat_for_children logic.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8857
OpenZFS-commit: openzfs/openzfs@862ff6d99c
Issue openzfs#5918
Closes openzfs#7168
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Mar 13, 2018
…nt zio

PROBLEM
=======
It's possible for a parent zio to complete even though it has children
which have not completed. This can result in the following panic:
    > $C
    ffffff01809128c0 vpanic()
    ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
    ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
    ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
    ffffff3373370908)
    ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
    ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
    ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
    ffffff0180912b40 thread_start+8()
    > ::status
    debugging crash dump vmcore.2 (64-bit) from batfs0390
    operating system: 5.11 joyent_20170911T171900Z (i86pc)
    image uuid: (not set)
    panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80
    owner=ffffff3c59b39480 thread=ffffff0180912c40
    dump content: kernel pages only
The problem is that dbuf_prefetch along with l2arc can create a zio tree
which confuses the parent zio and allows it to complete with while children
still exist. Here's the scenario:
    zio tree:
        pio
         |--- lio
The parent zio, pio, has entered the zio_done stage and begins to check its
children to see there are still some that have not completed. In zio_done(),
the children are checked in the following order:
    zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE)
If pio, finds any child which has not completed then it stops executing and
goes to sleep. Each call to zio_wait_for_children() will grab the io_lock
while checking the particular child.
In this scenario, the pio has completed the first call to
zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since
the only zio in the zio tree right now is the logical zio, lio, then it
completes that call and prepares to check the next child type.
In the meantime, the lio completes and in its callback creates a child vdev
zio, cio. The zio tree looks like this:
    zio tree:
        pio
         |--- lio
         |--- cio
The lio then grabs the parent's io_lock and removes itself.
    zio tree:
        pio
         |--- cio
The pio continues to run but has already completed its check for ZIO_CHILD_VDEV
and will erroneously complete. When the child zio, cio, completes it will panic
the system trying to reference the parent zio which has been destroyed.
SOLUTION
========
The fix is to rework the zio_wait_for_children() logic to accept a bitfield
for all the children types that it's interested in checking. The
io_lock will is held the entire time we check all the children types. Since
the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided
to allow for the conversion between a ZIO_CHILD type and the bitfield used by
the zio_wiat_for_children logic.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8857
OpenZFS-commit: openzfs/openzfs@862ff6d99c
Issue openzfs#5918
Closes openzfs#7168
tonyhutter pushed a commit that referenced this issue Mar 19, 2018
…nt zio

PROBLEM
=======
It's possible for a parent zio to complete even though it has children
which have not completed. This can result in the following panic:
    > $C
    ffffff01809128c0 vpanic()
    ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
    ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
    ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
    ffffff3373370908)
    ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
    ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
    ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
    ffffff0180912b40 thread_start+8()
    > ::status
    debugging crash dump vmcore.2 (64-bit) from batfs0390
    operating system: 5.11 joyent_20170911T171900Z (i86pc)
    image uuid: (not set)
    panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80
    owner=ffffff3c59b39480 thread=ffffff0180912c40
    dump content: kernel pages only
The problem is that dbuf_prefetch along with l2arc can create a zio tree
which confuses the parent zio and allows it to complete with while children
still exist. Here's the scenario:
    zio tree:
        pio
         |--- lio
The parent zio, pio, has entered the zio_done stage and begins to check its
children to see there are still some that have not completed. In zio_done(),
the children are checked in the following order:
    zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE)
If pio, finds any child which has not completed then it stops executing and
goes to sleep. Each call to zio_wait_for_children() will grab the io_lock
while checking the particular child.
In this scenario, the pio has completed the first call to
zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since
the only zio in the zio tree right now is the logical zio, lio, then it
completes that call and prepares to check the next child type.
In the meantime, the lio completes and in its callback creates a child vdev
zio, cio. The zio tree looks like this:
    zio tree:
        pio
         |--- lio
         |--- cio
The lio then grabs the parent's io_lock and removes itself.
    zio tree:
        pio
         |--- cio
The pio continues to run but has already completed its check for ZIO_CHILD_VDEV
and will erroneously complete. When the child zio, cio, completes it will panic
the system trying to reference the parent zio which has been destroyed.
SOLUTION
========
The fix is to rework the zio_wait_for_children() logic to accept a bitfield
for all the children types that it's interested in checking. The
io_lock will is held the entire time we check all the children types. Since
the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided
to allow for the conversion between a ZIO_CHILD type and the bitfield used by
the zio_wiat_for_children logic.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8857
OpenZFS-commit: openzfs/openzfs@862ff6d99c
Issue #5918
Closes #7168
@DeHackEd
Copy link
Contributor Author

Reporting general stability after 24 days. I'm willing to give this a "probably fixed" at this point. Will re-open if another crash happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants