-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspended pool causes utilities to hang #1857
Comments
Similar to #1928, although in this case the pool was suspended. |
Same here: [132338.152613] SPLError: 1588:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'griseldaremote' has encountered an uncorrectable I/O failure and has been suspended. Pool is built on dm-crypt on nbd, which devices sometimes dis- and re-appear. |
The following deadlock was possible when a pool was suspended. 1) txg_sync() blockeded in zio_wait() to suspended pool 2) 'zpool export' holds spa_namespace block in txg_wait_synced() 3) 'zpool status/clear' all deadlock on spa_namespace lock To resolve this the POOL_CHECK_SUSPENDED flag is specified when registering the export and destroy handlers. This causes the 'zpool export' command to cleanly fail when run against suspended pool. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#1857
The spa_get_stats() function should never issue IO to the pool when it's suspended. When the pool is suspended the IO will never complete by definition and the caller will block indefinitely. TODO: A cleaner solution for this issue would be the cache the features in the spa. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#1857
I met a similar issue during zpool import. It looked like the global mutex spa_namespace_lock was held during the import, which hanged on an IO error, which then blocked all other zpool commands (even on those perfectly healthy pools). The code says the mutex must be "Held for the duration of create/destroy/import/export". I don't see why that's necessary. But it's very undesirable if zpool commands on healthy pools have to be blocked indefinitely because a previous zpool command on a bad pool was stuck. |
This has been at least in part resolved by master by never issuing IO to a pool we know is suspended due to an issue with the devices. |
I have iscsi disks that sometimes go missing that make up certain ZPOOLS. When they do, rather than failing gracefully and going to a corrupt status it locks up all the ZFS management processes. /var/log/messages below:
Nov 11 21:30:37 storage1 kernel: SPLError: 3699:0:(spl-err.c:67:vcmn_err()) WARNING: Pool 'avp1' has encountered an uncorrectable I/O failure and has been suspended.
Nov 11 21:30:37 storage1 kernel:
Nov 11 21:33:58 storage1 kernel: INFO: task txg_sync:3795 blocked for more than 120 seconds.
Nov 11 21:33:58 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 11 21:33:58 storage1 kernel: txg_sync D 0000000000000006 0 3795 2 0x00000080
Nov 11 21:33:58 storage1 kernel: ffff88085037bb60 0000000000000046 0000000000000001 ffff88086a8a3d30
Nov 11 21:33:58 storage1 kernel: 0000000000000000 0000000000000000 ffff88085037bae0 ffffffff810639a2
Nov 11 21:33:58 storage1 kernel: ffff88085a5cfaf8 ffff88085037bfd8 000000000000fb88 ffff88085a5cfaf8
Nov 11 21:33:58 storage1 kernel: Call Trace:
Nov 11 21:33:58 storage1 kernel: [] ? default_wake_function+0x12/0x20
Nov 11 21:33:58 storage1 kernel: [] ? ktime_get_ts+0xb1/0xf0
Nov 11 21:33:58 storage1 kernel: [] io_schedule+0x73/0xc0
Nov 11 21:33:58 storage1 kernel: [] cv_wait_common+0xac/0x1c0 [spl]
Nov 11 21:33:58 storage1 kernel: [] ? dmu_objset_write_ready+0x0/0x50 [zfs]
Nov 11 21:33:58 storage1 kernel: [] ? zio_execute+0x0/0x140 [zfs]
Nov 11 21:33:58 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 11 21:33:58 storage1 kernel: [] __cv_wait_io+0x18/0x20 [spl]
Nov 11 21:33:58 storage1 kernel: [] zio_wait+0xfb/0x1b0 [zfs]
Nov 11 21:33:58 storage1 kernel: [] dsl_pool_sync+0x2f5/0x540 [zfs]
Nov 11 21:33:58 storage1 kernel: [] spa_sync+0x40e/0xad0 [zfs]
Nov 11 21:33:58 storage1 kernel: [] txg_sync_thread+0x33f/0x5d0 [zfs]
Nov 11 21:33:58 storage1 kernel: [] ? set_user_nice+0xc9/0x130
Nov 11 21:33:58 storage1 kernel: [] ? txg_sync_thread+0x0/0x5d0 [zfs]
Nov 11 21:33:58 storage1 kernel: [] thread_generic_wrapper+0x68/0x80 [spl]
Nov 11 21:33:58 storage1 kernel: [] ? thread_generic_wrapper+0x0/0x80 [spl]
Nov 11 21:33:58 storage1 kernel: [] kthread+0x96/0xa0
Nov 11 21:33:58 storage1 kernel: [] child_rip+0xa/0x20
Nov 11 21:33:58 storage1 kernel: [] ? kthread+0x0/0xa0
Nov 11 21:33:58 storage1 kernel: [] ? child_rip+0x0/0x20
Nov 11 21:33:58 storage1 kernel: INFO: task zpool:4303 blocked for more than 120 seconds.
Nov 11 21:33:58 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 11 21:33:58 storage1 kernel: zpool D 0000000000000007 0 4303 3529 0x00000084
Nov 11 21:33:58 storage1 kernel: ffff880568b0db48 0000000000000082 0000000000000000 ffff88087fc21180
Nov 11 21:33:58 storage1 kernel: ffff88087fead000 ffff88087fc21180 ffff880568b0db58 ffff88051519d180
Nov 11 21:33:58 storage1 kernel: ffff88086b8a1ab8 ffff880568b0dfd8 000000000000fb88 ffff88086b8a1ab8
Nov 11 21:33:58 storage1 kernel: Call Trace:
Nov 11 21:33:58 storage1 kernel: [] __mutex_lock_slowpath+0x13e/0x180
Nov 11 21:33:58 storage1 kernel: [] mutex_lock+0x2b/0x50
Nov 11 21:33:58 storage1 kernel: [] spa_open_common+0x69/0x3b0 [zfs]
Nov 11 21:33:58 storage1 kernel: [] ? __kmalloc+0x20c/0x220
Nov 11 21:33:58 storage1 kernel: [] spa_get_stats+0x49/0x280 [zfs]
Nov 11 21:33:58 storage1 kernel: [] zfs_ioc_pool_stats+0x31/0x70 [zfs]
Nov 11 21:33:58 storage1 kernel: [] zfsdev_ioctl+0x48e/0x500 [zfs]
Nov 11 21:33:58 storage1 kernel: [] ? __switch_to+0x26e/0x320
Nov 11 21:33:58 storage1 kernel: [] ? thread_return+0x4e/0x76e
Nov 11 21:33:58 storage1 kernel: [] ? __hrtimer_start_range_ns+0x1a5/0x470
Nov 11 21:33:58 storage1 kernel: [] ? lock_hrtimer_base+0x31/0x60
Nov 11 21:33:58 storage1 kernel: [] ? hrtimer_try_to_cancel+0x3f/0xd0
Nov 11 21:33:58 storage1 kernel: [] vfs_ioctl+0x22/0xa0
Nov 11 21:33:58 storage1 kernel: [] ? do_nanosleep+0x93/0xc0
Nov 11 21:33:58 storage1 kernel: [] do_vfs_ioctl+0x84/0x580
Nov 11 21:33:58 storage1 kernel: [] ? hrtimer_wakeup+0x0/0x30
Nov 11 21:33:58 storage1 kernel: [] ? hrtimer_start_range_ns+0x14/0x20
Nov 11 21:33:58 storage1 kernel: [] sys_ioctl+0x81/0xa0
Nov 11 21:33:58 storage1 kernel: [] ? __audit_syscall_exit+0x265/0x290
Nov 11 21:33:58 storage1 kernel: [] system_call_fastpath+0x16/0x1b
Nov 11 21:33:58 storage1 kernel: INFO: task zpool:18253 blocked for more than 120 seconds.
Nov 11 21:33:58 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 11 21:33:58 storage1 kernel: zpool D 0000000000000000 0 18253 2268 0x00000080
Nov 11 21:33:58 storage1 kernel: ffff880519749b48 0000000000000086 0000000000000000 ffff880500000006
Nov 11 21:33:58 storage1 kernel: 0000000600000046 0000000000000001 ffff880519749be8 0000000000000082
Nov 11 21:33:58 storage1 kernel: ffff88051c6e1af8 ffff880519749fd8 000000000000fb88 ffff88051c6e1af8
Nov 11 21:33:58 storage1 kernel: Call Trace:
Nov 11 21:33:58 storage1 kernel: [] cv_wait_common+0x105/0x1c0 [spl]
Nov 11 21:33:58 storage1 kernel: [] ? autoremove_wake_function+0x0/0x40
Nov 11 21:33:58 storage1 kernel: [] __cv_wait+0x15/0x20 [spl]
Nov 11 21:33:58 storage1 kernel: [] txg_wait_synced+0xb3/0x190 [zfs]
Nov 11 21:33:58 storage1 kernel: [] spa_export_common+0x15e/0x310 [zfs]
Nov 11 21:33:58 storage1 kernel: [] spa_destroy+0x1a/0x20 [zfs]
Nov 11 21:33:58 storage1 kernel: [] zfs_ioc_pool_destroy+0x1e/0x40 [zfs]
Nov 11 21:33:58 storage1 kernel: [] zfsdev_ioctl+0x48e/0x500 [zfs]
Nov 11 21:33:58 storage1 kernel: [] ? call_rcu_sched+0x15/0x20
Nov 11 21:33:58 storage1 kernel: [] ? call_rcu+0xe/0x10
Nov 11 21:33:58 storage1 kernel: [] ? put_filp+0x52/0x60
Nov 11 21:33:58 storage1 kernel: [] ? release_open_intent+0x25/0x30
Nov 11 21:33:58 storage1 kernel: [] ? do_filp_open+0x68e/0xdc0
Nov 11 21:33:58 storage1 kernel: [] ? __do_page_fault+0x1ec/0x480
Nov 11 21:33:58 storage1 kernel: [] vfs_ioctl+0x22/0xa0
Nov 11 21:33:58 storage1 kernel: [] ? unmap_region+0x110/0x130
Nov 11 21:33:58 storage1 kernel: [] do_vfs_ioctl+0x84/0x580
Nov 11 21:33:58 storage1 kernel: [] ? alloc_fd+0x92/0x160
Nov 11 21:33:58 storage1 kernel: [] sys_ioctl+0x81/0xa0
Nov 11 21:33:58 storage1 kernel: [] ? __audit_syscall_exit+0x265/0x290
Nov 11 21:33:58 storage1 kernel: [] system_call_fastpath+0x16/0x1b
Nov 11 21:33:58 storage1 kernel: INFO: task zpool:18256 blocked for more than 120 seconds.
Nov 11 21:33:58 storage1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 11 21:33:58 storage1 kernel: zpool D 0000000000000008 0 18256 3516 0x00000080
Nov 11 21:33:58 storage1 kernel: ffff88086e5abbe8 0000000000000086 0000000000000000 0000000000000230
Nov 11 21:33:58 storage1 kernel: 0000000000000018 0000000000000230 ffff880514964ae0 0000000000000000
Nov 11 21:33:58 storage1 kernel: ffff880514965098 ffff88086e5abfd8 000000000000fb88 ffff880514965098
Nov 11 21:33:58 storage1 kernel: Call Trace:
Nov 11 21:33:58 storage1 kernel: [] __mutex_lock_slowpath+0x13e/0x180
Nov 11 21:33:58 storage1 kernel: [] ? nv_mem_zalloc+0x38/0x50 [znvpair]
Nov 11 21:33:58 storage1 kernel: [] mutex_lock+0x2b/0x50
Nov 11 21:33:58 storage1 kernel: [] spa_all_configs+0x52/0x120 [zfs]
Nov 11 21:33:58 storage1 kernel: [] zfs_ioc_pool_configs+0x2e/0x60 [zfs]
Nov 11 21:33:58 storage1 kernel: [] zfsdev_ioctl+0x48e/0x500 [zfs]
Nov 11 21:33:58 storage1 kernel: [] ? handle_mm_fault+0x23a/0x310
Nov 11 21:33:58 storage1 kernel: [] ? __do_page_fault+0x1ec/0x480
Nov 11 21:33:58 storage1 kernel: [] vfs_ioctl+0x22/0xa0
Nov 11 21:33:58 storage1 kernel: [] ? unmap_region+0xcd/0x130
Nov 11 21:33:58 storage1 kernel: [] do_vfs_ioctl+0x84/0x580
Nov 11 21:33:58 storage1 kernel: [] ? remove_vma+0x6e/0x90
Nov 11 21:33:58 storage1 kernel: [] sys_ioctl+0x81/0xa0
Nov 11 21:33:58 storage1 kernel: [] system_call_fastpath+0x16/0x1b
The text was updated successfully, but these errors were encountered: