Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple parallel rm get deadlocked #4895

Closed
tuomari opened this issue Jul 28, 2016 · 10 comments
Closed

Multiple parallel rm get deadlocked #4895

tuomari opened this issue Jul 28, 2016 · 10 comments

Comments

@tuomari
Copy link

tuomari commented Jul 28, 2016

I have multiple rm running in parallel on directories ( ~1G & 6000 files per directory ). After a day or so some of the rm processes get stuck with "INFO: task rm:14138 blocked for more than 120 seconds." The rm processes can not be killed, and have not finished after ~6h. This behaviour has now manifested itself 4-5 times. The filesystem can be written to, and deleted from, except the directories where rm processes are stuck on.

I am running ZFS HEAD 25458cb with kernel version 4.5.5. The removed files are on a filesystem and the pool layout is following:

config:

    NAME                                                       STATE     READ WRITE CKSUM
    tank                                                       ONLINE       0     0     0
      raidz1-0                                                 ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F38DJ3-part4                  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F38MTB-part4                  ONLINE       0     0     0
        ata-ST3000DM001-1ER166_Z5009QEX                        ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N0HRLESN-part4         ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N2SED25L-part4         ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F4CKB1                        ONLINE       0     0     0
      raidz1-1                                                 ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F4REQ1-part4                  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F4S6CX-part4                  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F3XC4M-part4                  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F4S822-part4                  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F4S6D9-part4                  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F4S7Z6-part4                  ONLINE       0     0     0
    logs
      mirror-2                                                 ONLINE       0     0     0
        ata-KINGSTON_SH103S3120G_50026B7334000234-part4        ONLINE       0     0     0
        ata-Kingston_SSDNow_V_Series_128GB_06P970009771-part4  ONLINE       0     0     0
    cache
      ata-KINGSTON_SH103S3120G_50026B7334000234-part5          ONLINE       0     0     0
      ata-Kingston_SSDNow_V_Series_128GB_06P970009771-part5    ONLINE       0     0     0
    spares
      ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3ELL01X                 AVAIL   
      ata-ST3000DM001-9YN166_W1F3AEZG                          AVAIL   

errors: No known data errors

Here are the stack traces of the D-state processes.

Jul 28 21:40:31 helvi kernel: [152906.932673]   task                        PC stack   pid father
Jul 28 21:40:31 helvi kernel: [152906.932818] rm              D ffff880385f1f918     0 14138      1 0x00000004
Jul 28 21:40:31 helvi kernel: [152906.932876]  ffff880385f1f918 ffff88082d92c500 ffff880429510000 ffff880385f20000
Jul 28 21:40:31 helvi kernel: [152906.932963]  0000000000000000 7fffffffffffffff ffff88028f980378 0000000000000001
Jul 28 21:40:31 helvi kernel: [152906.933051]  ffff880385f1f930 ffffffff81679d40 ffff88042fd55300 ffff880385f1f9d0
Jul 28 21:40:31 helvi kernel: [152906.933138] Call Trace:
Jul 28 21:40:31 helvi kernel: [152906.933183]  [<ffffffff81679d40>] schedule+0x30/0x80
Jul 28 21:40:31 helvi kernel: [152906.933232]  [<ffffffff8167d464>] schedule_timeout+0x224/0x2a0
Jul 28 21:40:31 helvi kernel: [152906.933335]  [<ffffffffa0ac68ee>] ? zio_create+0x3be/0x460 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.933387]  [<ffffffff810f6a97>] ? ktime_get+0x37/0xa0
Jul 28 21:40:31 helvi kernel: [152906.933436]  [<ffffffff81679411>] io_schedule_timeout+0xa1/0x110
Jul 28 21:40:31 helvi kernel: [152906.933493]  [<ffffffffa04b2156>] cv_wait_common+0xa6/0x120 [spl]
Jul 28 21:40:31 helvi kernel: [152906.933547]  [<ffffffff810d60b0>] ? prepare_to_wait_event+0xf0/0xf0
Jul 28 21:40:31 helvi kernel: [152906.933602]  [<ffffffffa04b2223>] __cv_wait_io+0x13/0x20 [spl]
Jul 28 21:40:31 helvi kernel: [152906.933688]  [<ffffffffa0ac7948>] zio_wait+0xc8/0x160 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.933758]  [<ffffffffa0a204c7>] dbuf_read+0x637/0x8d0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.933828]  [<ffffffffa0a21e55>] ? dbuf_hold_impl+0x85/0xa0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.933905]  [<ffffffffa0a3a608>] dnode_hold_impl+0xe8/0x710 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.933992]  [<ffffffffa0aba907>] ? zfs_znode_hold_enter+0x117/0x160 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.934066]  [<ffffffffa0a1f711>] ? dbuf_rele+0x31/0x40 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.934141]  [<ffffffffa0a3ac46>] dnode_hold+0x16/0x20 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.934212]  [<ffffffffa0a27c5e>] dmu_bonus_hold+0x1e/0x180 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.934294]  [<ffffffffa0a60b19>] sa_buf_hold+0x9/0x10 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.934379]  [<ffffffffa0abe50e>] zfs_zget+0x11e/0x210 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.934463]  [<ffffffffa0a9dc75>] zfs_dirent_lock+0x535/0x620 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.934550]  [<ffffffffa0a9ddd6>] zfs_dirlook+0x76/0x260 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.934635]  [<ffffffffa0ab1ea6>] zfs_lookup+0x296/0x2e0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.934722]  [<ffffffffa0ace04f>] zpl_lookup+0xaf/0x1b0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.934772]  [<ffffffff8167dd39>] ? _raw_spin_unlock+0x9/0x10
Jul 28 21:40:31 helvi kernel: [152906.934823]  [<ffffffff811dd3de>] ? d_alloc+0x6e/0x80
Jul 28 21:40:31 helvi kernel: [152906.934872]  [<ffffffff811cef08>] lookup_real+0x18/0x50
Jul 28 21:40:31 helvi kernel: [152906.934919]  [<ffffffff811cf30e>] __lookup_hash+0x2e/0x40
Jul 28 21:40:31 helvi kernel: [152906.934968]  [<ffffffff811d4d3d>] do_unlinkat+0x11d/0x2a0
Jul 28 21:40:31 helvi kernel: [152906.935017]  [<ffffffff811d57d6>] SyS_unlinkat+0x16/0x30
Jul 28 21:40:31 helvi kernel: [152906.935065]  [<ffffffff8167e117>] entry_SYSCALL_64_fastpath+0x12/0x6a
Jul 28 21:40:31 helvi kernel: [152906.935118] rm              D ffff88024d023a10     0 31328      1 0x00000004
Jul 28 21:40:31 helvi kernel: [152906.935175]  ffff88024d023a10 ffff88042ddc7300 ffff8804191ac500 ffff88024d024000
Jul 28 21:40:31 helvi kernel: [152906.935262]  ffff8802d2fc25b8 ffff8802d2fc2638 ffff8802d2fc25f8 0000000000000000
Jul 28 21:40:31 helvi kernel: [152906.935349]  ffff88024d023a28 ffffffff81679d40 ffff8802d2fc2630 ffff88024d023a88
Jul 28 21:40:31 helvi kernel: [152906.935436] Call Trace:
Jul 28 21:40:31 helvi kernel: [152906.935475]  [<ffffffff81679d40>] schedule+0x30/0x80
Jul 28 21:40:31 helvi kernel: [152906.935525]  [<ffffffffa04b2196>] cv_wait_common+0xe6/0x120 [spl]
Jul 28 21:40:31 helvi kernel: [152906.935577]  [<ffffffff810d60b0>] ? prepare_to_wait_event+0xf0/0xf0
Jul 28 21:40:31 helvi kernel: [152906.935632]  [<ffffffffa04b21e0>] __cv_wait+0x10/0x20 [spl]
Jul 28 21:40:31 helvi kernel: [152906.935701]  [<ffffffffa0a20562>] dbuf_read+0x6d2/0x8d0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.935754]  [<ffffffffa04aac95>] ? spl_kmem_free+0x25/0x30 [spl]
Jul 28 21:40:31 helvi kernel: [152906.935825]  [<ffffffffa0a21e55>] ? dbuf_hold_impl+0x85/0xa0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.935902]  [<ffffffffa0a3a608>] dnode_hold_impl+0xe8/0x710 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.935989]  [<ffffffffa0aba907>] ? zfs_znode_hold_enter+0x117/0x160 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.936063]  [<ffffffffa0a1f711>] ? dbuf_rele+0x31/0x40 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.936138]  [<ffffffffa0a3ac46>] dnode_hold+0x16/0x20 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.936209]  [<ffffffffa0a27c5e>] dmu_bonus_hold+0x1e/0x180 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.936291]  [<ffffffffa0a60b19>] sa_buf_hold+0x9/0x10 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.936376]  [<ffffffffa0abe50e>] zfs_zget+0x11e/0x210 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.936461]  [<ffffffffa0a9dc75>] zfs_dirent_lock+0x535/0x620 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.936547]  [<ffffffffa0a9ddd6>] zfs_dirlook+0x76/0x260 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.936633]  [<ffffffffa0ab1ea6>] zfs_lookup+0x296/0x2e0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.936728]  [<ffffffffa0ace04f>] zpl_lookup+0xaf/0x1b0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.936780]  [<ffffffff8167dd39>] ? _raw_spin_unlock+0x9/0x10
Jul 28 21:40:31 helvi kernel: [152906.936829]  [<ffffffff811dd3de>] ? d_alloc+0x6e/0x80
Jul 28 21:40:31 helvi kernel: [152906.936876]  [<ffffffff811cef08>] lookup_real+0x18/0x50
Jul 28 21:40:31 helvi kernel: [152906.936924]  [<ffffffff811cf30e>] __lookup_hash+0x2e/0x40
Jul 28 21:40:31 helvi kernel: [152906.939432]  [<ffffffff811d4d3d>] do_unlinkat+0x11d/0x2a0
Jul 28 21:40:31 helvi kernel: [152906.939481]  [<ffffffff811d57d6>] SyS_unlinkat+0x16/0x30
Jul 28 21:40:31 helvi kernel: [152906.939528]  [<ffffffff8167e117>] entry_SYSCALL_64_fastpath+0x12/0x6a
Jul 28 21:40:31 helvi kernel: [152906.939580] rm              D ffff88038afe3a10     0 17025      1 0x00000004
Jul 28 21:40:31 helvi kernel: [152906.939637]  ffff88038afe3a10 ffff88082d928000 ffff8808263e3980 ffff88038afe4000
Jul 28 21:40:31 helvi kernel: [152906.939724]  ffff8802d2fc25b8 ffff8802d2fc2638 ffff8802d2fc25f8 0000000000000000
Jul 28 21:40:31 helvi kernel: [152906.939811]  ffff88038afe3a28 ffffffff81679d40 ffff8802d2fc2630 ffff88038afe3a88
Jul 28 21:40:31 helvi kernel: [152906.939898] Call Trace:
Jul 28 21:40:31 helvi kernel: [152906.939937]  [<ffffffff81679d40>] schedule+0x30/0x80
Jul 28 21:40:31 helvi kernel: [152906.939988]  [<ffffffffa04b2196>] cv_wait_common+0xe6/0x120 [spl]
Jul 28 21:40:31 helvi kernel: [152906.940039]  [<ffffffff810d60b0>] ? prepare_to_wait_event+0xf0/0xf0
Jul 28 21:40:31 helvi kernel: [152906.940094]  [<ffffffffa04b21e0>] __cv_wait+0x10/0x20 [spl]
Jul 28 21:40:31 helvi kernel: [152906.940164]  [<ffffffffa0a20562>] dbuf_read+0x6d2/0x8d0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.940216]  [<ffffffffa04aac95>] ? spl_kmem_free+0x25/0x30 [spl]
Jul 28 21:40:31 helvi kernel: [152906.940287]  [<ffffffffa0a21e55>] ? dbuf_hold_impl+0x85/0xa0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.940364]  [<ffffffffa0a3a608>] dnode_hold_impl+0xe8/0x710 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.940452]  [<ffffffffa0aba907>] ? zfs_znode_hold_enter+0x117/0x160 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.940526]  [<ffffffffa0a1f711>] ? dbuf_rele+0x31/0x40 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.940602]  [<ffffffffa0a3ac46>] dnode_hold+0x16/0x20 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.940673]  [<ffffffffa0a27c5e>] dmu_bonus_hold+0x1e/0x180 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.940795]  [<ffffffffa0a60b19>] sa_buf_hold+0x9/0x10 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.940882]  [<ffffffffa0abe50e>] zfs_zget+0x11e/0x210 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.940967]  [<ffffffffa0a9dc75>] zfs_dirent_lock+0x535/0x620 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.941055]  [<ffffffffa0a9ddd6>] zfs_dirlook+0x76/0x260 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.941141]  [<ffffffffa0ab1ea6>] zfs_lookup+0x296/0x2e0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.941227]  [<ffffffffa0ace04f>] zpl_lookup+0xaf/0x1b0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.941278]  [<ffffffff8167dd39>] ? _raw_spin_unlock+0x9/0x10
Jul 28 21:40:31 helvi kernel: [152906.941327]  [<ffffffff811dd3de>] ? d_alloc+0x6e/0x80
Jul 28 21:40:31 helvi kernel: [152906.941374]  [<ffffffff811cef08>] lookup_real+0x18/0x50
Jul 28 21:40:31 helvi kernel: [152906.941421]  [<ffffffff811cf30e>] __lookup_hash+0x2e/0x40
Jul 28 21:40:31 helvi kernel: [152906.941470]  [<ffffffff811d4d3d>] do_unlinkat+0x11d/0x2a0
Jul 28 21:40:31 helvi kernel: [152906.941519]  [<ffffffff811d57d6>] SyS_unlinkat+0x16/0x30
Jul 28 21:40:31 helvi kernel: [152906.941566]  [<ffffffff8167e117>] entry_SYSCALL_64_fastpath+0x12/0x6a
Jul 28 21:40:31 helvi kernel: [152906.941618] rm              D ffff8802b7913a10     0   985      1 0x00000004
Jul 28 21:40:31 helvi kernel: [152906.941675]  ffff8802b7913a10 ffff88082aec4500 ffff88042d9ed080 ffff8802b7914000
Jul 28 21:40:31 helvi kernel: [152906.941762]  ffff8802d2fc25b8 ffff8802d2fc2638 ffff8802d2fc25f8 0000000000000000
Jul 28 21:40:31 helvi kernel: [152906.941849]  ffff8802b7913a28 ffffffff81679d40 ffff8802d2fc2630 ffff8802b7913a88
Jul 28 21:40:31 helvi kernel: [152906.941937] Call Trace:
Jul 28 21:40:31 helvi kernel: [152906.941975]  [<ffffffff81679d40>] schedule+0x30/0x80
Jul 28 21:40:31 helvi kernel: [152906.942026]  [<ffffffffa04b2196>] cv_wait_common+0xe6/0x120 [spl]
Jul 28 21:40:31 helvi kernel: [152906.942077]  [<ffffffff810d60b0>] ? prepare_to_wait_event+0xf0/0xf0
Jul 28 21:40:31 helvi kernel: [152906.942132]  [<ffffffffa04b21e0>] __cv_wait+0x10/0x20 [spl]
Jul 28 21:40:31 helvi kernel: [152906.942202]  [<ffffffffa0a20562>] dbuf_read+0x6d2/0x8d0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.942255]  [<ffffffffa04aac95>] ? spl_kmem_free+0x25/0x30 [spl]
Jul 28 21:40:31 helvi kernel: [152906.942326]  [<ffffffffa0a21e55>] ? dbuf_hold_impl+0x85/0xa0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.942403]  [<ffffffffa0a3a608>] dnode_hold_impl+0xe8/0x710 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.942490]  [<ffffffffa0aba907>] ? zfs_znode_hold_enter+0x117/0x160 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.942565]  [<ffffffffa0a1f711>] ? dbuf_rele+0x31/0x40 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.942640]  [<ffffffffa0a3ac46>] dnode_hold+0x16/0x20 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.942712]  [<ffffffffa0a27c5e>] dmu_bonus_hold+0x1e/0x180 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.942794]  [<ffffffffa0a60b19>] sa_buf_hold+0x9/0x10 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.942879]  [<ffffffffa0abe50e>] zfs_zget+0x11e/0x210 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.942964]  [<ffffffffa0a9dc75>] zfs_dirent_lock+0x535/0x620 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.943050]  [<ffffffffa0a9ddd6>] zfs_dirlook+0x76/0x260 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.943136]  [<ffffffffa0ab1ea6>] zfs_lookup+0x296/0x2e0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.943223]  [<ffffffffa0ace04f>] zpl_lookup+0xaf/0x1b0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.943272]  [<ffffffff8167dd39>] ? _raw_spin_unlock+0x9/0x10
Jul 28 21:40:31 helvi kernel: [152906.943322]  [<ffffffff811dd3de>] ? d_alloc+0x6e/0x80
Jul 28 21:40:31 helvi kernel: [152906.943369]  [<ffffffff811cef08>] lookup_real+0x18/0x50
Jul 28 21:40:31 helvi kernel: [152906.943416]  [<ffffffff811cf30e>] __lookup_hash+0x2e/0x40
Jul 28 21:40:31 helvi kernel: [152906.943464]  [<ffffffff811d4d3d>] do_unlinkat+0x11d/0x2a0
Jul 28 21:40:31 helvi kernel: [152906.943513]  [<ffffffff811d57d6>] SyS_unlinkat+0x16/0x30
Jul 28 21:40:31 helvi kernel: [152906.943560]  [<ffffffff8167e117>] entry_SYSCALL_64_fastpath+0x12/0x6a
Jul 28 21:40:31 helvi kernel: [152906.943612] rm              D ffff8801eca73918     0  9480      1 0x00000004
Jul 28 21:40:31 helvi kernel: [152906.943669]  ffff8801eca73918 ffff88042ddc7300 ffff8802567d8000 ffff8801eca74000
Jul 28 21:40:31 helvi kernel: [152906.943756]  0000000000000000 7fffffffffffffff ffff88039728cc98 0000000000000001
Jul 28 21:40:31 helvi kernel: [152906.943843]  ffff8801eca73930 ffffffff81679d40 ffff88042fc95300 ffff8801eca739d0
Jul 28 21:40:31 helvi kernel: [152906.943930] Call Trace:
Jul 28 21:40:31 helvi kernel: [152906.943968]  [<ffffffff81679d40>] schedule+0x30/0x80
Jul 28 21:40:31 helvi kernel: [152906.944016]  [<ffffffff8167d464>] schedule_timeout+0x224/0x2a0
Jul 28 21:40:31 helvi kernel: [152906.944102]  [<ffffffffa0ac68ee>] ? zio_create+0x3be/0x460 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.944153]  [<ffffffff810f6a97>] ? ktime_get+0x37/0xa0
Jul 28 21:40:31 helvi kernel: [152906.944201]  [<ffffffff81679411>] io_schedule_timeout+0xa1/0x110
Jul 28 21:40:31 helvi kernel: [152906.944255]  [<ffffffffa04b2156>] cv_wait_common+0xa6/0x120 [spl]
Jul 28 21:40:31 helvi kernel: [152906.944306]  [<ffffffff810d60b0>] ? prepare_to_wait_event+0xf0/0xf0
Jul 28 21:40:31 helvi kernel: [152906.944361]  [<ffffffffa04b2223>] __cv_wait_io+0x13/0x20 [spl]
Jul 28 21:40:31 helvi kernel: [152906.944447]  [<ffffffffa0ac7948>] zio_wait+0xc8/0x160 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.944517]  [<ffffffffa0a204c7>] dbuf_read+0x637/0x8d0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.944588]  [<ffffffffa0a21e55>] ? dbuf_hold_impl+0x85/0xa0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.944664]  [<ffffffffa0a3a608>] dnode_hold_impl+0xe8/0x710 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.944769]  [<ffffffffa0aba907>] ? zfs_znode_hold_enter+0x117/0x160 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.944846]  [<ffffffffa0a1f711>] ? dbuf_rele+0x31/0x40 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.944922]  [<ffffffffa0a3ac46>] dnode_hold+0x16/0x20 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.944993]  [<ffffffffa0a27c5e>] dmu_bonus_hold+0x1e/0x180 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.945076]  [<ffffffffa0a60b19>] sa_buf_hold+0x9/0x10 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.945161]  [<ffffffffa0abe50e>] zfs_zget+0x11e/0x210 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.945245]  [<ffffffffa0a9dc75>] zfs_dirent_lock+0x535/0x620 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.945332]  [<ffffffffa0a9ddd6>] zfs_dirlook+0x76/0x260 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.945418]  [<ffffffffa0ab1ea6>] zfs_lookup+0x296/0x2e0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.945504]  [<ffffffffa0ace04f>] zpl_lookup+0xaf/0x1b0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.945554]  [<ffffffff8167dd39>] ? _raw_spin_unlock+0x9/0x10
Jul 28 21:40:31 helvi kernel: [152906.945604]  [<ffffffff811dd3de>] ? d_alloc+0x6e/0x80
Jul 28 21:40:31 helvi kernel: [152906.945651]  [<ffffffff811cef08>] lookup_real+0x18/0x50
Jul 28 21:40:31 helvi kernel: [152906.945698]  [<ffffffff811cf30e>] __lookup_hash+0x2e/0x40
Jul 28 21:40:31 helvi kernel: [152906.945746]  [<ffffffff811d4d3d>] do_unlinkat+0x11d/0x2a0
Jul 28 21:40:31 helvi kernel: [152906.945795]  [<ffffffff811d57d6>] SyS_unlinkat+0x16/0x30
Jul 28 21:40:31 helvi kernel: [152906.945842]  [<ffffffff8167e117>] entry_SYSCALL_64_fastpath+0x12/0x6a
Jul 28 21:40:31 helvi kernel: [152906.945897] rm              D ffff880030bd7918     0  8683      1 0x00000004
Jul 28 21:40:31 helvi kernel: [152906.945954]  ffff880030bd7918 ffff88082d928000 ffff88082c3e6780 ffff880030bd8000
Jul 28 21:40:31 helvi kernel: [152906.946041]  0000000000000000 7fffffffffffffff ffff88012784af58 0000000000000001
Jul 28 21:40:31 helvi kernel: [152906.946129]  ffff880030bd7930 ffffffff81679d40 ffff88042fcd5300 ffff880030bd79d0
Jul 28 21:40:31 helvi kernel: [152906.946216] Call Trace:
Jul 28 21:40:31 helvi kernel: [152906.946255]  [<ffffffff81679d40>] schedule+0x30/0x80
Jul 28 21:40:31 helvi kernel: [152906.946302]  [<ffffffff8167d464>] schedule_timeout+0x224/0x2a0
Jul 28 21:40:31 helvi kernel: [152906.946388]  [<ffffffffa0ac68ee>] ? zio_create+0x3be/0x460 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.946439]  [<ffffffff810f6a97>] ? ktime_get+0x37/0xa0
Jul 28 21:40:31 helvi kernel: [152906.946488]  [<ffffffff81679411>] io_schedule_timeout+0xa1/0x110
Jul 28 21:40:31 helvi kernel: [152906.946542]  [<ffffffffa04b2156>] cv_wait_common+0xa6/0x120 [spl]
Jul 28 21:40:31 helvi kernel: [152906.946593]  [<ffffffff810d60b0>] ? prepare_to_wait_event+0xf0/0xf0
Jul 28 21:40:31 helvi kernel: [152906.946648]  [<ffffffffa04b2223>] __cv_wait_io+0x13/0x20 [spl]
Jul 28 21:40:31 helvi kernel: [152906.946734]  [<ffffffffa0ac7948>] zio_wait+0xc8/0x160 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.946804]  [<ffffffffa0a204c7>] dbuf_read+0x637/0x8d0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.946875]  [<ffffffffa0a21e55>] ? dbuf_hold_impl+0x85/0xa0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.946952]  [<ffffffffa0a3a608>] dnode_hold_impl+0xe8/0x710 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.947039]  [<ffffffffa0aba907>] ? zfs_znode_hold_enter+0x117/0x160 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.947114]  [<ffffffffa0a1f711>] ? dbuf_rele+0x31/0x40 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.947189]  [<ffffffffa0a3ac46>] dnode_hold+0x16/0x20 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.947261]  [<ffffffffa0a27c5e>] dmu_bonus_hold+0x1e/0x180 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.947343]  [<ffffffffa0a60b19>] sa_buf_hold+0x9/0x10 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.947428]  [<ffffffffa0abe50e>] zfs_zget+0x11e/0x210 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.947513]  [<ffffffffa0a9dc75>] zfs_dirent_lock+0x535/0x620 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.947599]  [<ffffffffa0a9ddd6>] zfs_dirlook+0x76/0x260 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.947685]  [<ffffffffa0ab1ea6>] zfs_lookup+0x296/0x2e0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.947772]  [<ffffffffa0ace04f>] zpl_lookup+0xaf/0x1b0 [zfs]
Jul 28 21:40:31 helvi kernel: [152906.947821]  [<ffffffff8167dd39>] ? _raw_spin_unlock+0x9/0x10
Jul 28 21:40:31 helvi kernel: [152906.947871]  [<ffffffff811dd3de>] ? d_alloc+0x6e/0x80
Jul 28 21:40:31 helvi kernel: [152906.947918]  [<ffffffff811cef08>] lookup_real+0x18/0x50
Jul 28 21:40:31 helvi kernel: [152906.947965]  [<ffffffff811cf30e>] __lookup_hash+0x2e/0x40
Jul 28 21:40:31 helvi kernel: [152906.948013]  [<ffffffff811d4d3d>] do_unlinkat+0x11d/0x2a0
Jul 28 21:40:31 helvi kernel: [152906.948061]  [<ffffffff811d57d6>] SyS_unlinkat+0x16/0x30
Jul 28 21:40:31 helvi kernel: [152906.948109]  [<ffffffff8167e117>] entry_SYSCALL_64_fastpath+0x12/0x6a
@dweeezil
Copy link
Contributor

@tuomari I'd start by checking zpool events for any delayed IO and then follow up with zpool events -v if there are any.

@tuomari
Copy link
Author

tuomari commented Jul 29, 2016

Unfortunately I already booted the system. I will report the problem comes up again.

Also, I noticed that the deadlocked directories were created roughly at the same time. The video survaillance software creates directories for each camera for every 10 minutes of data in the format: camid/yy/mm/dd/hh/MM. The stuck processes were removing directories from different cameras, but from the same 10-minute slot:

root     14138  0.0  0.0   7556  1316 tty1     D    Jul28   0:00 /bin/rm -rf 13/16/07/06/12/30/00
root     31328  0.0  0.0   7160  1248 tty1     D    Jul28   0:00 /bin/rm -rf 18/16/07/06/12/30/00
root     17025  0.0  0.0   7556  1300 ?        D    Jul28   0:00 /bin/rm -rf 14/16/07/06/12/30/00
root       985  0.0  0.0   7160  1296 ?        D    Jul28   0:00 /bin/rm -rf 17/16/07/06/12/30/00

@dweeezil
Copy link
Contributor

@tuomari If you're running zed, it should record some basic information about events in your syslog. Maybe something interesting wound up there.

The problem you're seeing may be exacerbated by, but shouldn't be caused by, the new dnode shedding logic. Some of the rm instances are are waiting for an IO to complete while others are waiting for a signal which will be provided via the dbuf_read_done() callback once an IO is complete (likely the same IOs which the other ones are waiting for). So long as the IOs complete, I don't see any reason why this should deadlock and is why I suggested looking for delayed IO operations.

@tuomari
Copy link
Author

tuomari commented Jul 29, 2016

Unfortunately I was not running zed at the moment. I will report as soon as I have another incident, which can take few days.

@tuomari
Copy link
Author

tuomari commented Jul 30, 2016

Hmm.. I tried to regenerate the problem by removing an hour worth of directories at once (88 in total), and got this. Not sure if this is related, or not:

[163017.870395] BUG: unable to handle kernel paging request at 00000000000ff88e
[163017.870527] IP: [<ffffffffa05f9ed1>] dmu_objset_pool+0x1/0x30 [zfs]
[163017.870710] PGD 0 
[163017.870811] Oops: 0000 [#1] SMP 
[163017.870895] Modules linked in: binfmt_misc target_core_mod configfs 8021q garp stp llc xt_multiport iptable_filter ip_tables x_tables openvswitch nf_defrag_ipv6 nf_conntrack sit tunnel4 ip_tunnel nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace sunrpc deflate ctr twofish_generic twofish_x86_64_3way xts twofish_x86_64 twofish_common camellia_generic serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cbc xcbc rmd160 sha512_generic sha1_ssse3 sha1_generic af_key xfrm_algo gpio_ich iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 ablk_helper cryptd lrw gf128mul glue_helper pcspkr serio_raw zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) ast bttv
[163017.872509]  ttm tveeprom videobuf_dma_sg drm_kms_helper tea575x videobuf_core syscopyarea rc_core sysfillrect sysimgblt v4l2_common fb_sys_fops videodev media i2c_i801 drm agpgart lpc_ich mfd_core btrfs snd_hda_intel snd_hda_codec snd_hda_core snd_pcm snd_timer snd soundcore ioatdma i7core_edac edac_core i5500_temp dca ipmi_si ipmi_msghandler evdev shpchp acpi_cpufreq tpm_tis tpm w83795 w83627ehf hwmon_vid coretemp loop autofs4 hid_generic usbhid hid crc32c_intel psmouse uhci_hcd ehci_pci ehci_hcd usbcore usb_common sg sd_mod netxen_nic
[163017.873683] CPU: 13 PID: 4361 Comm: z_wr_int_1 Tainted: P           O    4.5.5.remod.kvm.ovs.1 #13
[163017.873934] Hardware name: System manufacturer System Product Name/Z8NA-D6(C), BIOS 1303    05/10/2012
[163017.874172] task: ffff88082a8a3980 ti: ffff8808245ac000 task.ti: ffff8808245ac000
[163017.874413] RIP: 0010:[<ffffffffa05f9ed1>]  [<ffffffffa05f9ed1>] dmu_objset_pool+0x1/0x30 [zfs]
[163017.874676] RSP: 0018:ffff8808245afc08  EFLAGS: 00010286
[163017.874819] RAX: ffff880808f26000 RBX: ffff8804e6d36660 RCX: ffffffffa05eb9f0
[163017.875047] RDX: ffff880808f266c0 RSI: ffff8807fa794cc0 RDI: 00000000000ff88e
[163017.875292] RBP: ffff8808245afc28 R08: 0000000000000000 R09: ffff88047082da30
[163017.875521] R10: 0000000000000000 R11: ffff880606c1c6b2 R12: ffff880808f266c0
[163017.875767] R13: 0000000000000100 R14: ffff8806c3d752e0 R15: ffff88041358e008
[163017.876004] FS:  0000000000000000(0000) GS:ffff88083fd40000(0000) knlGS:0000000000000000
[163017.876236] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[163017.876380] CR2: 00000000000ff88e CR3: 00000003fa43a000 CR4: 00000000000006e0
[163017.876597] Stack:
[163017.876701]  ffff8808245afc28 ffffffffa05eba06 ffff8806c3d752e0 ffff88041358c000
[163017.876908]  ffff8808245afc38 ffffffffa05e01bd ffff8808245afc60 ffffffffa0691527
[163017.877172]  0000000000080000 00000000ffffffff 00000000000a0000 ffff8808245afc90
[163017.877410] Call Trace:
[163017.877548]  [<ffffffffa05eba06>] ? dbuf_write_physdone+0x16/0x50 [zfs]
[163017.877709]  [<ffffffffa05e01bd>] arc_write_physdone+0x1d/0x20 [zfs]
[163017.877907]  [<ffffffffa0691527>] zio_vdev_io_assess+0x97/0x1c0 [zfs]
[163017.878101]  [<ffffffffa06929b0>] zio_execute+0x90/0xf0 [zfs]
[163017.878282]  [<ffffffffa065256a>] vdev_queue_io_to_issue+0x2aa/0x8b0 [zfs]
[163017.878461]  [<ffffffffa065308a>] vdev_queue_io_done+0x17a/0x220 [zfs]
[163017.878662]  [<ffffffffa06916f1>] zio_vdev_io_done+0xa1/0x190 [zfs]
[163017.878849]  [<ffffffffa06929b0>] zio_execute+0x90/0xf0 [zfs]
[163017.879004]  [<ffffffffa02be41d>] taskq_thread+0x25d/0x490 [spl]
[163017.879152]  [<ffffffff810c49c0>] ? wake_up_q+0x70/0x70
[163017.879266]  [<ffffffffa02be1c0>] ? taskq_cancel_id+0x110/0x110 [spl]
[163017.879427]  [<ffffffff810bb0d6>] kthread+0xd6/0xf0
[163017.879565]  [<ffffffff811c7999>] ? ____fput+0x9/0x10
[163017.879683]  [<ffffffff810bb000>] ? kthread_create_on_node+0x190/0x190
[163017.879837]  [<ffffffff8167e45f>] ret_from_fork+0x3f/0x70
[163017.879962]  [<ffffffff810bb000>] ? kthread_create_on_node+0x190/0x190
[163017.880098] Code: 7d d8 48 c7 c6 80 c4 6d a0 89 c3 e8 5a 60 04 00 89 d8 48 83 c4 08 5b 41 5c 41 5d 41 5e 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 55 <48> 8b 07 48 89 e5 48 85 c0 74 15 48 8b 80 d8 00 00 00 48 85 c0 
[163017.880676] RIP  [<ffffffffa05f9ed1>] dmu_objset_pool+0x1/0x30 [zfs]
[163017.880827]  RSP <ffff8808245afc08>
[163017.880950] CR2: 00000000000ff88e
[163017.881484] ---[ end trace 50357d60d4e40176 ]---
[163018.314825] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[163018.315097] IP: [<ffffffffa05e01a7>] arc_write_physdone+0x7/0x20 [zfs]
[163018.315314] PGD 0 
[163018.315463] Oops: 0000 [#2] SMP 
[163018.315651] Modules linked in: binfmt_misc target_core_mod configfs 8021q garp stp llc xt_multiport iptable_filter ip_tables x_tables openvswitch nf_defrag_ipv6 nf_conntrack sit tunnel4 ip_tunnel nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace sunrpc deflate ctr twofish_generic twofish_x86_64_3way xts twofish_x86_64 twofish_common camellia_generic serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cbc xcbc rmd160 sha512_generic sha1_ssse3 sha1_generic af_key xfrm_algo gpio_ich iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 ablk_helper cryptd lrw gf128mul glue_helper pcspkr serio_raw zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) ast bttv
[163018.319790]  ttm tveeprom videobuf_dma_sg drm_kms_helper tea575x videobuf_core syscopyarea rc_core sysfillrect sysimgblt v4l2_common fb_sys_fops videodev media i2c_i801 drm agpgart lpc_ich mfd_core btrfs snd_hda_intel snd_hda_codec snd_hda_core snd_pcm snd_timer snd soundcore ioatdma i7core_edac edac_core i5500_temp dca ipmi_si ipmi_msghandler evdev shpchp acpi_cpufreq tpm_tis tpm w83795 w83627ehf hwmon_vid coretemp loop autofs4 hid_generic usbhid hid crc32c_intel psmouse uhci_hcd ehci_pci ehci_hcd usbcore usb_common sg sd_mod netxen_nic
[163018.322809] CPU: 7 PID: 4367 Comm: z_wr_int_7 Tainted: P      D    O    4.5.5.remod.kvm.ovs.1 #13
[163018.322962] Hardware name: System manufacturer System Product Name/Z8NA-D6(C), BIOS 1303    05/10/2012
[163018.323098] task: ffff88082a8a5c00 ti: ffff880826218000 task.ti: ffff880826218000
[163018.323250] RIP: 0010:[<ffffffffa05e01a7>]  [<ffffffffa05e01a7>] arc_write_physdone+0x7/0x20 [zfs]
[163018.323453] RSP: 0018:ffff88082621bdb0  EFLAGS: 00010286
[163018.323554] RAX: 0000000000000000 RBX: ffff8806c3d72be0 RCX: 0000000000000000
[163018.323691] RDX: ffff8806c3d72f38 RSI: 0000000000000000 RDI: ffff8804e6d36660
[163018.323831] RBP: ffff88082621bdd0 R08: 0000000000000000 R09: ffff880416801f00
[163018.323971] R10: 0000000000000000 R11: 00000000000ffe8c R12: ffff88041366c000
[163018.324120] R13: 0000000000000100 R14: ffff8806c3d72be0 R15: ffff880427ab1ea0
[163018.324244] FS:  0000000000000000(0000) GS:ffff88083fcc0000(0000) knlGS:0000000000000000
[163018.324399] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[163018.324503] CR2: 0000000000000018 CR3: 0000000001c0a000 CR4: 00000000000006e0
[163018.324624] Stack:
[163018.324743]  ffffffffa0691527 0000000000080000 00000000ffffffff 00000000000a0000
[163018.325031]  ffff88082621be00 ffffffffa06929b0 ffff88082a8a5c00 0000000000000000
[163018.325324]  0000000000000246 ffff880413614a00 ffff88082621bea8 ffffffffa02be41d
[163018.325577] Call Trace:
[163018.325691]  [<ffffffffa0691527>] ? zio_vdev_io_assess+0x97/0x1c0 [zfs]
[163018.325813]  [<ffffffffa06929b0>] zio_execute+0x90/0xf0 [zfs]
[163018.325905]  [<ffffffffa02be41d>] taskq_thread+0x25d/0x490 [spl]
[163018.325991]  [<ffffffff810c49c0>] ? wake_up_q+0x70/0x70
[163018.326074]  [<ffffffffa02be1c0>] ? taskq_cancel_id+0x110/0x110 [spl]
[163018.326160]  [<ffffffff810bb0d6>] kthread+0xd6/0xf0
[163018.326238]  [<ffffffff810bb000>] ? kthread_create_on_node+0x190/0x190
[163018.326323]  [<ffffffff8167e45f>] ret_from_fork+0x3f/0x70
[163018.326404]  [<ffffffff810bb000>] ? kthread_create_on_node+0x190/0x190
[163018.326487] Code: 1f 44 00 00 48 8b 87 68 01 00 00 55 48 89 e5 48 8b 70 28 48 8b 10 ff 50 10 5d c3 66 0f 1f 84 00 00 00 00 00 48 8b 87 68 01 00 00 <48> 8b 48 18 48 85 c9 74 0e 55 48 8b 70 28 48 89 e5 48 8b 10 ff 
[163018.328917] RIP  [<ffffffffa05e01a7>] arc_write_physdone+0x7/0x20 [zfs]
[163018.329055]  RSP <ffff88082621bdb0>
[163018.329128] CR2: 0000000000000018
[163018.329202] ---[ end trace 50357d60d4e40177 ]---
[163018.359096] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[163018.361785] IP: [<ffffffffa05e01a7>] arc_write_physdone+0x7/0x20 [zfs]
[163018.361958] PGD 0 
[163018.362118] Oops: 0000 [#3] SMP 
[163018.362260] Modules linked in: binfmt_misc target_core_mod configfs 8021q garp stp llc xt_multiport iptable_filter ip_tables x_tables openvswitch nf_defrag_ipv6 nf_conntrack sit tunnel4 ip_tunnel nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace sunrpc deflate ctr twofish_generic twofish_x86_64_3way xts twofish_x86_64 twofish_common camellia_generic serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cbc xcbc rmd160 sha512_generic sha1_ssse3 sha1_generic af_key xfrm_algo gpio_ich iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 ablk_helper cryptd lrw gf128mul glue_helper pcspkr serio_raw zfs(PO) zunicode(PO) zavl(PO) zcommon(PO) znvpair(PO) spl(O) ast bttv
[163018.365753]  ttm tveeprom videobuf_dma_sg drm_kms_helper tea575x videobuf_core syscopyarea rc_core sysfillrect sysimgblt v4l2_common fb_sys_fops videodev media i2c_i801 drm agpgart lpc_ich mfd_core btrfs snd_hda_intel snd_hda_codec snd_hda_core snd_pcm snd_timer snd soundcore ioatdma i7core_edac edac_core i5500_temp dca ipmi_si ipmi_msghandler evdev shpchp acpi_cpufreq tpm_tis tpm w83795 w83627ehf hwmon_vid coretemp loop autofs4 hid_generic usbhid hid crc32c_intel psmouse uhci_hcd ehci_pci ehci_hcd usbcore usb_common sg sd_mod netxen_nic
[163018.368477] CPU: 15 PID: 4366 Comm: z_wr_int_6 Tainted: P      D    O    4.5.5.remod.kvm.ovs.1 #13
[163018.368627] Hardware name: System manufacturer System Product Name/Z8NA-D6(C), BIOS 1303    05/10/2012
[163018.368782] task: ffff88082a8a1700 ti: ffff880824594000 task.ti: ffff880824594000
[163018.368916] RIP: 0010:[<ffffffffa05e01a7>]  [<ffffffffa05e01a7>] arc_write_physdone+0x7/0x20 [zfs]
[163018.369150] RSP: 0018:ffff880824597db0  EFLAGS: 00010286
[163018.369233] RAX: 0000000000000000 RBX: ffff8804f6d1c920 RCX: 0000000000000000
[163018.369380] RDX: ffff8804f6d1cc78 RSI: 0000000000000000 RDI: ffff8804e6d36660
[163018.369523] RBP: ffff880824597dd0 R08: 0000000000000000 R09: ffff88082cecbd40
[163018.369644] R10: 0000000000000000 R11: 00000000000ffb48 R12: ffff880413564000
[163018.369803] R13: 0000000000000100 R14: ffff8804f6d1c920 R15: ffff880427ab1e40
[163018.369945] FS:  0000000000000000(0000) GS:ffff88083fdc0000(0000) knlGS:0000000000000000
[163018.370084] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[163018.370190] CR2: 0000000000000018 CR3: 0000000001c0a000 CR4: 00000000000006e0
[163018.370336] Stack:
[163018.370413]  ffffffffa0691527 0000000000080000 00000000ffffffff 00000000000a0000
[163018.370719]  ffff880824597e00 ffffffffa06929b0 ffff88082a8a1700 0000000000000000
[163018.371009]  0000000000000246 ffff880413614800 ffff880824597ea8 ffffffffa02be41d
[163018.371304] Call Trace:
[163018.371443]  [<ffffffffa0691527>] ? zio_vdev_io_assess+0x97/0x1c0 [zfs]
[163018.371591]  [<ffffffffa06929b0>] zio_execute+0x90/0xf0 [zfs]
[163018.371711]  [<ffffffffa02be41d>] taskq_thread+0x25d/0x490 [spl]
[163018.371821]  [<ffffffff810c49c0>] ? wake_up_q+0x70/0x70
[163018.371908]  [<ffffffffa02be1c0>] ? taskq_cancel_id+0x110/0x110 [spl]
[163018.372016]  [<ffffffff810bb0d6>] kthread+0xd6/0xf0
[163018.372123]  [<ffffffff810bb000>] ? kthread_create_on_node+0x190/0x190
[163018.372226]  [<ffffffff8167e45f>] ret_from_fork+0x3f/0x70
[163018.372311]  [<ffffffff810bb000>] ? kthread_create_on_node+0x190/0x190
[163018.372417] Code: 1f 44 00 00 48 8b 87 68 01 00 00 55 48 89 e5 48 8b 70 28 48 8b 10 ff 50 10 5d c3 66 0f 1f 84 00 00 00 00 00 48 8b 87 68 01 00 00 <48> 8b 48 18 48 85 c9 74 0e 55 48 8b 70 28 48 89 e5 48 8b 10 ff 
[163018.375138] RIP  [<ffffffffa05e01a7>] arc_write_physdone+0x7/0x20 [zfs]
[163018.375319]  RSP <ffff880824597db0>
[163018.375401] CR2: 0000000000000018
[163018.375489] ---[ end trace 50357d60d4e40178 ]---

@tuxoko
Copy link
Contributor

tuxoko commented Aug 1, 2016

The new stack trace looks similar to #4737
Note that the symbol names are all wrong in #4737.

@behlendorf behlendorf added this to the 0.7.0 milestone Aug 1, 2016
@tuomari
Copy link
Author

tuomari commented Aug 9, 2016

@dweeezil You were right. There were a lot of delayed IOs in the pool. Am I correct to assume that the pool gets stuck because of vdev 0x6de336ca30fa5d9d. Is there any reason why zfs does not kick it out of the pool after some reasonable timeout period?

  class = "ereport.fs.zfs.delay"
        ena = 0x5766eab353d00401
        detector = (embedded nvlist)
                version = 0x0
                scheme = "zfs"
                pool = 0x87f6a0b739f3d163
                vdev = 0x6de336ca30fa5d9d
        (end detector)
        pool = "tank"
        pool_guid = 0x87f6a0b739f3d163
        pool_context = 0x0
        pool_failmode = "wait"
        vdev_guid = 0x6de336ca30fa5d9d
        vdev_type = "raidz"
        vdev_ashift = 0xc
        vdev_complete_ts = 0x0
        vdev_delta_ts = 0x0
        vdev_read_errors = 0x0
        vdev_write_errors = 0x0
        vdev_cksum_errors = 0x0
        parent_guid = 0x87f6a0b739f3d163
        parent_type = "root"
        vdev_spare_paths = "/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3ELL01X-part1" "/dev/disk/by-id/ata-ST3000DM001-9YN166_W1F3AEZG-part1" 
        vdev_spare_guids = 0xfc002bcbdaede94f 0x732207f2561da521 
        zio_err = 0x0
        zio_flags = 0x80080
        zio_stage = 0x200000
        zio_pipeline = 0x2e0000
        zio_delay = 0x972265071
        zio_timestamp = 0x0
        zio_delta = 0x0
        zio_offset = 0xa5ffffa2000
        zio_size = 0x600
        zio_objset = 0x3e
        zio_object = 0x0
        zio_level = 0x0
        zio_blkid = 0x1d2c7b
        time = 0x57a779e9 0x25cef5ba 
        eid = 0x5a9567

@tuomari
Copy link
Author

tuomari commented Aug 9, 2016

After a day or so of being stuck the system received a general protection fault.
Time of the GPF seems to be around the same time it started to receive nightly backups from other systems.

[Mon Aug  8 00:13:59 2016] general protection fault: 0000 [#1] SMP 
[Mon Aug  8 00:13:59 2016] Modules linked in: 8021q garp stp llc target_core_mod configfs binfmt_misc xt_multiport iptable_filter ip_tables x_tables openvswitch nf_defrag_ipv6 nf_conntrack sit tunnel4 ip_tunnel nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace sunrpc deflate ctr twofish_generic twofish_x86_64_3way xts twofish_x86_64 twofish_common camellia_generic serpent_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_generic cast_common des_generic cbc xcbc rmd160 sha512_generic sha1_ssse3 sha1_generic af_key xfrm_algo iTCO_wdt gpio_ich iTCO_vendor_support zfs(PO) zunicode(PO) zavl(PO) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_generic hmac drbg ansi_cprng aesni_intel aes_x86_64 ablk_helper cryptd lrw gf128mul glue_helper serio_raw pcspkr i2c_i801 zcommon(PO) znvpair(PO) spl(O)
[Mon Aug  8 00:13:59 2016]  bttv tveeprom videobuf_dma_sg ast tea575x ttm videobuf_core drm_kms_helper syscopyarea sysfillrect rc_core v4l2_common sysimgblt fb_sys_fops videodev media drm agpgart lpc_ich mfd_core snd_hda_intel btrfs snd_hda_codec snd_hda_core snd_pcm snd_timer snd soundcore ioatdma dca i7core_edac i5500_temp edac_core ipmi_si ipmi_msghandler shpchp evdev acpi_cpufreq tpm_tis tpm w83795 w83627ehf hwmon_vid coretemp loop autofs4 hid_generic usbhid hid crc32c_intel psmouse ehci_pci uhci_hcd ehci_hcd usbcore usb_common sg sd_mod netxen_nic
[Mon Aug  8 00:13:59 2016] CPU: 1 PID: 2466 Comm: z_rd_int_5 Tainted: P           O    4.5.5.remod.kvm.ovs.1 #13
[Mon Aug  8 00:13:59 2016] Hardware name: System manufacturer System Product Name/Z8NA-D6(C), BIOS 1303    05/10/2012
[Mon Aug  8 00:13:59 2016] task: ffff88042950ae00 ti: ffff88042abf4000 task.ti: ffff88042abf4000
[Mon Aug  8 00:13:59 2016] RIP: 0010:[<ffffffffa0e0446e>]  [<ffffffffa0e0446e>] zio_add_child+0xae/0x170 [zfs]
[Mon Aug  8 00:13:59 2016] RSP: 0018:ffff88042abf7a68  EFLAGS: 00010286
[Mon Aug  8 00:13:59 2016] RAX: ffff8802cbcd4830 RBX: ffff8801fb9af020 RCX: ffff8802cbcd4810
[Mon Aug  8 00:13:59 2016] RDX: dead000000000100 RSI: 0000000002404200 RDI: ffff88023ddbc778
[Mon Aug  8 00:13:59 2016] RBP: ffff88042abf7a90 R08: 0000000000000000 R09: ffffffffa055d9ae
[Mon Aug  8 00:13:59 2016] R10: 0000000000000000 R11: ffff8808268d2000 R12: ffff88023ddbc440
[Mon Aug  8 00:13:59 2016] R13: ffff8802cbcd4810 R14: ffff8801fb9af358 R15: ffff88023ddbc778
[Mon Aug  8 00:13:59 2016] FS:  0000000000000000(0000) GS:ffff88042fc40000(0000) knlGS:0000000000000000
[Mon Aug  8 00:13:59 2016] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[Mon Aug  8 00:13:59 2016] CR2: 00007f78b6060000 CR3: 0000000001c0a000 CR4: 00000000000006e0
[Mon Aug  8 00:13:59 2016] Stack:
[Mon Aug  8 00:13:59 2016]  ffff8801fb9af020 0000000000000000 0000000000000000 0000000000000001
[Mon Aug  8 00:13:59 2016]  ffff88023ddbc440 ffff88042abf7ae8 ffffffffa0e048e2 ffff8808268d2000
[Mon Aug  8 00:13:59 2016]  0000000000000000 ffff880146028000 0000000000001600 0000000000000002
[Mon Aug  8 00:13:59 2016] Call Trace:
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0e048e2>] zio_create+0x3b2/0x460 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0e054ac>] zio_read_phys+0x6c/0x80 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0d542e0>] ? arc_read_done+0x2f0/0x2f0 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0d57ce7>] arc_read+0x9c7/0xaa0 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0d542e0>] ? arc_read_done+0x2f0/0x2f0 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0d5dd2a>] dbuf_issue_final_prefetch+0x5a/0x60 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0d5de09>] dbuf_prefetch_indirect_done+0xd9/0x110 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0d51711>] ? buf_hash_find+0xa1/0x150 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0d5418b>] arc_read_done+0x19b/0x2f0 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0d546a5>] l2arc_read_done+0x3c5/0x470 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0e0753c>] zio_done+0x30c/0xd90 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0e01f02>] ? zio_wait_for_children+0x82/0x90 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa0e039b0>] zio_execute+0x90/0xf0 [zfs]
[Mon Aug  8 00:13:59 2016]  [<ffffffffa055f41d>] taskq_thread+0x25d/0x490 [spl]
[Mon Aug  8 00:13:59 2016]  [<ffffffff810c49c0>] ? wake_up_q+0x70/0x70
[Mon Aug  8 00:13:59 2016]  [<ffffffffa055f1c0>] ? taskq_cancel_id+0x110/0x110 [spl]
[Mon Aug  8 00:13:59 2016]  [<ffffffff810bb0d6>] kthread+0xd6/0xf0
[Mon Aug  8 00:13:59 2016]  [<ffffffff810bb000>] ? kthread_create_on_node+0x190/0x190
[Mon Aug  8 00:13:59 2016]  [<ffffffff8167e45f>] ret_from_fork+0x3f/0x70
[Mon Aug  8 00:13:59 2016]  [<ffffffff810bb000>] ? kthread_create_on_node+0x190/0x190
[Mon Aug  8 00:13:59 2016] Code: 31 d2 8b 43 40 48 c1 e0 04 80 7b 4e 00 0f 94 c2 49 01 94 04 c0 02 00 00 4c 89 e8 49 8b 94 24 20 01 00 00 49 03 84 24 18 01 00 00 <48> 89 42 08 48 89 10 49 8d 94 24 20 01 00 00 48 89 50 08 49 89 
[Mon Aug  8 00:13:59 2016] RIP  [<ffffffffa0e0446e>] zio_add_child+0xae/0x170 [zfs]
[Mon Aug  8 00:13:59 2016]  RSP <ffff88042abf7a68>
[Mon Aug  8 00:13:59 2016] ---[ end trace 43e1820b3d76dce4 ]---
[Tue Aug  9 00:00:02 2016] Process accounting resumed

@behlendorf
Copy link
Contributor

@tuomari did things improve after you identified a problematic drive?

@tuomari
Copy link
Author

tuomari commented Jun 20, 2017

Have not hit this in a while. Closing as this is most likely fixed...

@tuomari tuomari closed this as completed Jun 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants