Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zpool clear hang when resuming suspended pool #6709

Closed
bjquinn opened this issue Oct 2, 2017 · 16 comments
Closed

zpool clear hang when resuming suspended pool #6709

bjquinn opened this issue Oct 2, 2017 · 16 comments
Labels
Status: Inactive Not being actively updated Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@bjquinn
Copy link

bjquinn commented Oct 2, 2017

System information

Type Version/Name
Distribution Name CentOS
Distribution Version 6.3
Linux Kernel 4.0.1 (4.0.1-1.el6.elrepo.x86_64)
Architecture amd64
ZFS Version 0.7.1-1 (zfs-dkms-0.7.1-1.el6.noarch)
SPL Version 0.7.1-1 (spl-0.7.1-1.el6.x86_64)

Describe the problem you're observing

Similar to #3256. Flaky USB enclosure (I assume) causes USB drive to temporarily disconnect and reconnect (in this specific case, disconnects as sdd, reconnects as sde). Pool is built with by-id names. Making sure drive is reconnected and issuing a zpool clear just hangs, even if the drive is disconnected long enough to bring it back up as sdd again.

Re-opening a new bug as that was what was suggested in #3256 if this issue recurred with 0.7.1. Would really prefer zpool export -F for suspended pools, but zpool clear would be sufficient if it would work.

EDIT: I should add that the current state is still an improvement over 0.6.5.x, as a system in this state can run zpool/zfs commands on other pools, as well as a zpool list or zpool status affecting the suspended pool, which was not always the case when this happened previously. Also, for the first time, I experienced a clean, normal reboot that didn't hang, whereas previously a hard reboot was always required due to hanging on shutdown while trying to unmount the suspended pool.

Describe how to reproduce the problem

I have a backup script that runs fairly heavly zfs send and rsync to an external SATA drive in a USB enclosure. Once every month or two, the drive will drop out during the backup process and get into the state as described above. When this happens, you can see something like the following logged at the console --

blk_update_request: I/O error, dev sdd, sector 25528
sd 12:0:0:0: [sde] No Caching mode page found
sd 12:0:0:0: [sde] Assuming drive cache: write through

... which is an indication that the drive dropped, but came back.

Include any warning/errors/backtraces from the system logs

Sep 29 17:32:28 server zed[8335]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
Sep 29 17:32:28 server zed[8335]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
Sep 29 17:32:28 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 0
Sep 29 17:32:28 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Sep 29 17:32:28 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 11610466796109647632
Sep 29 17:32:28 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Sep 29 17:32:28 server zed[8335]: Diagnosis Engine: opening case for vdev 0 due to 'ereport.fs.zfs.io_failure'
Sep 29 17:32:28 server zed[8335]: Diagnosis Engine: error event 'ereport.fs.zfs.io_failure'
Sep 29 17:32:28 server zed[8335]: Diagnosis Engine: solving fault 'fault.fs.zfs.io_failure_wait'
Sep 29 17:32:28 server zed[8335]: #011class: fault.fs.zfs.io_failure_wait
Sep 29 17:32:28 server zed[8335]: #011scheme: zfs
Sep 29 17:32:28 server zed[8335]: Retire Agent: zfs_retire_recv: 'list.suspect'
Sep 29 17:32:48 server zed[8335]: zfs_slm_event: EC_dev_add.disk
Sep 29 17:32:48 server zed[8335]: zfs_deliver_add: adding usb-Generic_External_57442D575834314432354C31-0:0-part1 (pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0) (is_slice 1)
Sep 29 17:32:48 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by devid)
Sep 29 17:32:48 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapool (by devid)
Sep 29 17:32:48 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapoolssd (by devid)
Sep 29 17:32:48 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by phys_path)
Sep 29 17:32:48 server zed[8335]:   zfs_iter_vdev: matched phys_path on pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0
Sep 29 17:32:48 server zed[8335]: zfs_process_add: /dev/disk/by-path/pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0-part1 is already healthy, skip it.
Sep 29 17:32:48 server zed[8335]: zfs_slm_event: EC_dev_add.disk
Sep 29 17:32:48 server zed[8335]: zfs_deliver_add: adding usb-Generic_External_57442D575834314432354C31-0:0-part9 (pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0) (is_slice 1)
Sep 29 17:32:48 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by devid)
Sep 29 17:32:48 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapool (by devid)
Sep 29 17:32:48 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapoolssd (by devid)
Sep 29 17:32:48 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by phys_path)
Sep 29 17:32:48 server zed[8335]:   zfs_iter_vdev: matched phys_path on pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0
Sep 29 17:32:48 server zed[8335]: zfs_process_add: /dev/disk/by-path/pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0-part1 is already healthy, skip it.
Oct  2 18:32:26 server zed[8335]: Diagnosis Engine: opening case for vdev 11610466796109647632 due to 'ereport.fs.zfs.vdev.too_small'
Oct  2 18:32:26 server zed[8335]: Diagnosis Engine: error event 'ereport.fs.zfs.vdev.too_small'
Oct  2 18:32:26 server zed[8335]: Diagnosis Engine: solving fault 'fault.fs.zfs.device'
Oct  2 18:32:26 server zed[8335]: #011class: fault.fs.zfs.device
Oct  2 18:32:26 server zed[8335]: #011scheme: zfs
Oct  2 18:32:26 server zed[8335]: Retire Agent: zfs_retire_recv: 'list.suspect'
Oct  2 18:32:26 server zed[8335]: Diagnosis Engine: resource event 'resource.fs.zfs.statechange'
Oct  2 18:32:26 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:32:26 server zed[8335]: Diagnosis Engine: resource event 'resource.fs.zfs.statechange'
Oct  2 18:32:26 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:32:26 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:32:26 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:35:04 server zed[8335]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
Oct  2 18:35:04 server zed[8335]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
Oct  2 18:35:04 server zed[8335]: Diagnosis Engine: resource event 'resource.fs.zfs.removed'
Oct  2 18:35:04 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Oct  2 18:35:04 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 1955849306711511642
Oct  2 18:35:04 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Oct  2 18:35:25 server zed[8335]: zfs_slm_event: EC_dev_add.disk
Oct  2 18:35:25 server zed[8335]: zfs_deliver_add: adding usb-Generic_External_57442D575834314432354C31-0:0-part1 (pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0) (is_slice 1)
Oct  2 18:35:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by devid)
Oct  2 18:35:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapool (by devid)
Oct  2 18:35:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapoolssd (by devid)
Oct  2 18:35:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by phys_path)
Oct  2 18:35:25 server zed[8335]:   zfs_iter_vdev: matched phys_path on pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0
Oct  2 18:35:25 server zed[8335]: zfs_process_add: pool 'backup' vdev '/dev/disk/by-path/pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0-part1', phys 'pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0' wholedisk 1, dm 0 (11610466796109647632)
Oct  2 18:35:25 server zed[8335]: zfs_slm_event: EC_dev_add.disk
Oct  2 18:35:25 server zed[8335]: zfs_deliver_add: adding usb-Generic_External_57442D575834314432354C31-0:0-part9 (pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0) (is_slice 1)
Oct  2 18:35:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by devid)
Oct  2 18:35:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapool (by devid)
Oct  2 18:35:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapoolssd (by devid)
Oct  2 18:35:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by phys_path)
Oct  2 18:35:25 server zed[8335]:   zfs_iter_vdev: matched phys_path on pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0
Oct  2 18:35:25 server zed[8335]: zfs_process_add: pool 'backup' vdev '/dev/disk/by-path/pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0-part1', phys 'pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0' wholedisk 1, dm 0 (11610466796109647632)
Oct  2 18:35:45 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:35:45 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:35:45 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:35:45 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:35:45 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:35:45 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:35:45 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:35:45 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:42:38 server zed[8335]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
Oct  2 18:42:38 server zed[8335]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
Oct  2 18:42:38 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 1955849306711511642
Oct  2 18:42:38 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Oct  2 18:42:38 server zed[8335]: Diagnosis Engine: resource event 'resource.fs.zfs.removed'
Oct  2 18:42:38 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Oct  2 18:42:58 server zed[8335]: zfs_slm_event: EC_dev_add.disk
Oct  2 18:42:58 server zed[8335]: zfs_deliver_add: adding usb-Generic_External_323733384B344B3046353644-0:0-part1 (pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0) (is_slice 1)
Oct  2 18:42:58 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by devid)
Oct  2 18:42:58 server zed[8335]:   zfs_iter_vdev: matched devid on usb-Generic_External_323733384B344B3046353644-0:0-part1
Oct  2 18:42:58 server zed[8335]: zfs_process_add: pool 'backup' vdev '/dev/disk/by-path/pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0-part1', phys 'pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0' wholedisk 1, dm 0 (11610466796109647632)
Oct  2 18:42:58 server zed[8335]: zfs_slm_event: EC_dev_add.disk
Oct  2 18:42:58 server zed[8335]: zfs_deliver_add: adding usb-Generic_External_323733384B344B3046353644-0:0-part9 (pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0) (is_slice 1)
Oct  2 18:42:58 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by devid)
Oct  2 18:42:58 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapool (by devid)
Oct  2 18:42:58 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapoolssd (by devid)
Oct  2 18:42:58 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by phys_path)
Oct  2 18:42:58 server zed[8335]:   zfs_iter_vdev: matched phys_path on pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0
Oct  2 18:42:58 server zed[8335]: zfs_process_add: pool 'backup' vdev '/dev/disk/by-path/pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0-part1', phys 'pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0' wholedisk 1, dm 0 (11610466796109647632)
Oct  2 18:43:11 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:43:11 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:43:11 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:43:11 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:43:11 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:43:11 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:43:11 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:43:11 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:43:52 server zed[8335]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
Oct  2 18:43:52 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 11610466796109647632
Oct  2 18:43:52 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Oct  2 18:43:52 server zed[8335]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
Oct  2 18:43:52 server zed[8335]: Diagnosis Engine: resource event 'resource.fs.zfs.removed'
Oct  2 18:43:52 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Oct  2 18:44:29 server zed[8335]: zfs_slm_event: EC_dev_add.disk
Oct  2 18:44:29 server zed[8335]: zfs_deliver_add: adding usb-ASMT_2105_AC0000000001-0:0-part1 (pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0) (is_slice 1)
Oct  2 18:44:29 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by devid)
Oct  2 18:44:29 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapool (by devid)
Oct  2 18:44:29 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapoolssd (by devid)
Oct  2 18:44:29 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by phys_path)
Oct  2 18:44:29 server zed[8335]:   zfs_iter_vdev: matched phys_path on pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0
Oct  2 18:44:29 server zed[8335]: zfs_process_add: pool 'backup' vdev '/dev/disk/by-path/pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0-part1', phys 'pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0' wholedisk 1, dm 0 (11610466796109647632)
Oct  2 18:44:59 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:44:59 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:44:59 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:44:59 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:44:59 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:44:59 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:44:59 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:44:59 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:51:23 server zed[8335]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
Oct  2 18:51:23 server zed[8335]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
Oct  2 18:51:23 server zed[8335]: Diagnosis Engine: resource event 'resource.fs.zfs.removed'
Oct  2 18:51:23 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Oct  2 18:51:23 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 6792593964022512890
Oct  2 18:51:23 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Oct  2 18:51:46 server zed[8335]: zfs_slm_event: EC_dev_add.disk
Oct  2 18:51:46 server zed[8335]: zfs_deliver_add: adding usb-Generic_External_323733384B344B3046353644-0:0-part1 (pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0) (is_slice 1)
Oct  2 18:51:46 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by devid)
Oct  2 18:51:46 server zed[8335]:   zfs_iter_vdev: matched devid on usb-Generic_External_323733384B344B3046353644-0:0-part1
Oct  2 18:51:46 server zed[8335]: zfs_process_add: pool 'backup' vdev '/dev/disk/by-path/pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0-part1', phys 'pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0' wholedisk 1, dm 0 (11610466796109647632)
Oct  2 18:51:46 server zed[8335]: zfs_slm_event: EC_dev_add.disk
Oct  2 18:51:46 server zed[8335]: zfs_deliver_add: adding usb-Generic_External_323733384B344B3046353644-0:0-part9 (pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0) (is_slice 1)
Oct  2 18:51:46 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by devid)
Oct  2 18:51:46 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapool (by devid)
Oct  2 18:51:46 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapoolssd (by devid)
Oct  2 18:51:46 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by phys_path)
Oct  2 18:51:46 server zed[8335]:   zfs_iter_vdev: matched phys_path on pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0
Oct  2 18:51:46 server zed[8335]: zfs_process_add: pool 'backup' vdev '/dev/disk/by-path/pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0-part1', phys 'pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0' wholedisk 1, dm 0 (11610466796109647632)
Oct  2 18:56:03 server zed[8335]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
Oct  2 18:56:03 server zed[8335]: agent post event: mapping 'EC_dev_remove' to 'resource.fs.zfs.removed'
Oct  2 18:56:03 server zed[8335]: Diagnosis Engine: resource event 'resource.fs.zfs.removed'
Oct  2 18:56:03 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Oct  2 18:56:03 server zed[8335]: Diagnosis Engine: discarding 'resource.fs.zfs.removed for vdev 11610466796109647632
Oct  2 18:56:03 server zed[8335]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.removed'
Oct  2 18:56:25 server zed[8335]: zfs_slm_event: EC_dev_add.disk
Oct  2 18:56:25 server zed[8335]: zfs_deliver_add: adding usb-Generic_External_57442D575834314432354C31-0:0-part1 (pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0) (is_slice 1)
Oct  2 18:56:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by devid)
Oct  2 18:56:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapool (by devid)
Oct  2 18:56:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapoolssd (by devid)
Oct  2 18:56:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by phys_path)
Oct  2 18:56:25 server zed[8335]:   zfs_iter_vdev: matched phys_path on pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0
Oct  2 18:56:25 server zed[8335]: zfs_process_add: pool 'backup' vdev '/dev/disk/by-path/pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0-part1', phys 'pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0' wholedisk 1, dm 0 (11610466796109647632)
Oct  2 18:56:25 server zed[8335]: zfs_slm_event: EC_dev_add.disk
Oct  2 18:56:25 server zed[8335]: zfs_deliver_add: adding usb-Generic_External_57442D575834314432354C31-0:0-part9 (pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0) (is_slice 1)
Oct  2 18:56:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by devid)
Oct  2 18:56:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapool (by devid)
Oct  2 18:56:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on datapoolssd (by devid)
Oct  2 18:56:25 server zed[8335]: zfs_iter_pool: evaluating vdevs on backup (by phys_path)
Oct  2 18:56:25 server zed[8335]:   zfs_iter_vdev: matched phys_path on pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0
Oct  2 18:56:25 server zed[8335]: zfs_process_add: pool 'backup' vdev '/dev/disk/by-path/pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0-part1', phys 'pci-0000:00:14.0-usb-0:8:1.0-scsi-0:0:0:0' wholedisk 1, dm 0 (11610466796109647632)
Oct  2 18:59:14 server zed[7824]: zfs_unavail_pool: examining 'datapool' (state 7)
Oct  2 18:59:14 server zed[7824]: zfs_unavail_pool: examining 'datapoolssd' (state 7)
Oct  2 18:59:14 server zed[7824]: Diagnosis Engine: discarding 'resource.fs.zfs.statechange for vdev 11610466796109647632
Oct  2 18:59:14 server zed[7824]: Retire Agent: zfs_retire_recv: 'resource.fs.zfs.statechange'
Oct  2 18:59:14 server zed[7824]: Diagnosis Engine: ignoring 'ereport.fs.zfs.zpool' during import
Oct  2 19:16:58 server kernel: [<ffffffffa04e07c2>] zvol_create_minors_impl+0xe2/0x220 [zfs]
Oct  2 19:16:58 server kernel: [<ffffffffa04e1bac>] zvol_task_cb+0x10c/0x150 [zfs]
Oct  2 19:24:58 server kernel: [<ffffffffa04e07c2>] zvol_create_minors_impl+0xe2/0x220 [zfs]
Oct  2 19:24:58 server kernel: [<ffffffffa04e1bac>] zvol_task_cb+0x10c/0x150 [zfs]
@loli10K
Copy link
Contributor

loli10K commented Oct 7, 2017

Please don't remove the issue template.

Making sure drive is reconnected and issuing a zpool clear just hangs, even if the drive is disconnected long enough to bring it back up as sdd again.

This should be fixed in 3f759c0, trying to zpool clear a suspended pool doesn't hang on my box:

root@linux:~# zpool create testpool /dev/sdc
[  418.107528]  sdc: sdc1 sdc9
[  418.189504]  sdc: sdc1 sdc9
root@linux:~# echo 1 > /sys/block/sdc/device/delete
[  424.689862] sd 8:0:0:0: [sdc] Synchronizing SCSI cache
root@linux:~# (zpool scrub testpool) &
[1] 2243
root@linux:~# [  429.550517] WARNING: Pool 'testpool' has encountered an uncorrectable I/O failure and has been suspended.
[  429.550517] 

root@linux:~# zpool status
  pool: testpool
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: scrub in progress since Sat Oct  7 06:47:19 2017
	79.5K scanned out of 97.5K at 9.94K/s, 0h0m to go
	0B repaired, 81.54% done
config:

	NAME        STATE     READ WRITE CKSUM
	testpool    UNAVAIL      0     0     0  insufficient replicas
	  sdc       UNAVAIL      0     0     0  corrupted data
errors: List of errors unavailable: pool I/O is currently suspended

errors: 39 data errors, use '-v' for a list
root@linux:~# time zpool clear testpool
[  443.227400] WARNING: Pool 'testpool' has encountered an uncorrectable I/O failure and has been suspended.
[  443.227400] 
cannot clear errors for testpool: I/O error

real	0m1.602s
user	0m0.000s
sys	0m0.004s
root@linux:~# 

If the disk gets back online i can reopen the pool.

@bjquinn
Copy link
Author

bjquinn commented Oct 7, 2017

@loli10K apologies, I hadn't posted an issue in a while and didn't read the template instructions carefully enough. I've attempted to resolve that in my original post.

Thanks for the response. #6399 seemed to indicate that the fix you referenced was in 0.7.1, but the commit does appear to be tagged 0.7.2. If that's correct, then I'll just upgrade to 0.7.2, see what happens, and we can close the issue.

@loli10K
Copy link
Contributor

loli10K commented Oct 7, 2017

3f759c0 landed in zfs-0.7.0, you should have that commit.

When zpool clear hangs get its stack (via /proc/$pid/stack) and post it here.

@bjquinn
Copy link
Author

bjquinn commented Oct 7, 2017

Got it, I'll do that.

@bjquinn
Copy link
Author

bjquinn commented Dec 29, 2017

Ok, when it happened this time, a "zpool clear backup" just gave me "cannot clear errors for backup: I/O error". Is that right? I thought we'd be able to clear and remove pools with this issue.

@loli10K
Copy link
Contributor

loli10K commented Dec 30, 2017

@bjquinn that's expected, zpool clear should not block and instead return with an explicit I/O error, unless the device is reattached in which case zpool clear will work.

Removing a suspended pool from the system without reattaching the USB device is not supported at this time, #5242.

root@linux:~# zpool create testpool /dev/sdb -f
root@linux:~# echo 1 > /sys/block/sdb/device/delete
root@linux:~# (zpool scrub testpool) &
[1] 2442
root@linux:~# zpool status
  pool: testpool
 state: UNAVAIL
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: http://zfsonlinux.org/msg/ZFS-8000-HC
  scan: scrub in progress since Sat Dec 30 09:17:19 2017
	82.5K scanned out of 656K at 20.6K/s, 0h0m to go
	0B repaired, 12.59% done
config:

	NAME        STATE     READ WRITE CKSUM
	testpool    UNAVAIL      0     0     0  insufficient replicas
	  sdb       UNAVAIL      0     0     0  corrupted data
errors: List of errors unavailable: pool I/O is currently suspended

errors: 42 data errors, use '-v' for a list
root@linux:~# dmesg -c
[  193.809806]  sdb: sdb1 sdb9
[  193.840895]  sdb: sdb1 sdb9
[  202.294131] sd 8:0:0:0: [sdb] Synchronizing SCSI cache
[  209.347455] WARNING: Pool 'testpool' has encountered an uncorrectable I/O failure and has been suspended.

root@linux:~# echo '- - -' > /sys/class/scsi_host/host8/scan 
root@linux:~# time zpool clear testpool

real	0m0.074s
user	0m0.000s
sys	0m0.004s
root@linux:~# dmesg -c
[  228.301102] scsi 8:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    1.6. PQ: 0 ANSI: 5
[  228.302885] sd 8:0:0:0: Attached scsi generic sg2 type 0
[  228.306383] sd 8:0:0:0: [sdb] 10485760 512-byte logical blocks: (5.37 GB/5.00 GiB)
[  228.309404] sd 8:0:0:0: [sdb] Write Protect is off
[  228.309410] sd 8:0:0:0: [sdb] Mode Sense: 63 00 00 08
[  228.310950] sd 8:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[  228.341669]  sdb: sdb1 sdb9
[  228.349276] sd 8:0:0:0: [sdb] Attached SCSI disk
[1]+  Done                    ( zpool scrub testpool )
root@linux:~# zpool status
  pool: testpool
 state: ONLINE
  scan: scrub repaired 0B in 0h0m with 0 errors on Sat Dec 30 09:17:47 2017
config:

	NAME        STATE     READ WRITE CKSUM
	testpool    ONLINE       0     0     0
	  sdb       ONLINE       0     0     0

errors: No known data errors
root@linux:~# 
root@linux:~# cat /sys/module/zfs/version 
0.7.0-1

@bjquinn
Copy link
Author

bjquinn commented Dec 30, 2017

Ok thanks. In this case, I did have the same device reattached, and still got the I/O error. It took a reboot to clear up the issue.

@mailinglists35
Copy link

EDIT: I should add that the current state is still an improvement over 0.6.5.x

not for me. using 0.7.5:

I've just encountered a case where after the kernel message "WARNING: Pool 'name' has encountered an uncorrectable I/O failure and has been suspended" makes zpool status hang forever

setup: three way mirror on top of veracrypt volumes

  1. attempt zpool export, forgot a mounted dataset by samba. stop samba, but after that forget to export pool.
  2. on veracrypt: dismount all volumes (however "dmsetup info" shows ACTIVE and "Open count: 1")
  3. on system: hot-unplug the drives (echo 1 > /sys/block/sdh/device/delete)
  4. zpool export -af (oooops, too late! hangs...)

result:
5. zpool status is hanging in there forever. rebooting the machine is the only way to resume using any zfs feature.

@bjquinn
Copy link
Author

bjquinn commented Feb 3, 2018

Interestingly this happened to me again this week. Initially, a zpool clear immediately recovered the pool, which at that point said it had some errors and started a scrub. That's the first time I've seen a pool recover from a suspended state like that. I don't know what the result of the scrub would have been, since it ran for a couple hours, and then the zpool commands started hanging again and I had to do a hard reboot. I'm still on 0.7.1.

@loli10K
Copy link
Contributor

loli10K commented Feb 3, 2018

I've just encountered a case where after the kernel message "WARNING: Pool 'name' has encountered an uncorrectable I/O failure and has been suspended" makes zpool status hang forever

setup: three way mirror on top of veracrypt volumes

  1. attempt zpool export, forgot a mounted dataset by samba. stop samba, but after that forget to export pool.
  2. on veracrypt: dismount all volumes (however "dmsetup info" shows ACTIVE and "Open count: 1")
  3. on system: hot-unplug the drives (echo 1 > /sys/block/sdh/device/delete)
  4. zpool export -af (oooops, too late! hangs...)

result:
5. zpool status is hanging in there forever. rebooting the machine is the only way to resume using any zfs feature.

I believe this has already been reported and is being tracked in #6649.

@zviratko
Copy link

zviratko commented Mar 7, 2018

I believe this only allows zpool clear to not block, but doesn't fix the real issue, that when the device is reattached the pool I/O cannot be resumed.

It is most likely because the reattached device will end up with different minor number (e.g. sdb instead of sda). How does it get "reattached" to the pool then? There needs to be a way to either replace the device with the new instance, or to export and import the pool without rebooting (preferably both).

@bjquinn
Copy link
Author

bjquinn commented Mar 7, 2018

@zviratko Yes, I believe you're right that this doesn't fix the real issue. Additionally, I have definitely run into the issue where zpool clear doesn't hang anymore, but the device still won't reattach even though I'm sure it came up as the same device name (sdc/sdd, etc.), and anyway I'm obviously not using the /dev/sdb device names to create the pool.

@mailinglists35
Copy link

mailinglists35 commented Mar 7, 2018

@zviratko the root cause is that zfs code lacks the ability to drop on demand a suspended pool from memory, and in my understanding it's something tricky to accomplish.

you need to follow this issue then only when that issue is solved, all other suspended/unavail pool related issues will be solved.

@sneak
Copy link

sneak commented Oct 10, 2018

root@las1:~# ps auxww | grep zpool
root      1177  0.0  0.0  37424  4124 pts/3    D+   19:18   0:00 zpool clear testing
root      3131  0.0  0.0  37396  3948 pts/4    D+   19:19   0:00 zpool status
root      3944  0.0  0.0  37488  2144 pts/0    D+   19:19   0:00 zpool status
root      5282  0.0  0.0  37488  2132 pts/5    D+   19:20   0:00 zpool import -d /dev/disk/by-id/usb-SanDisk_Ultra_USB_3.0_4C530001150825122224-0:0 /dev/disk/by-id/usb-SanDisk_Ultra_USB_3.0_4C530001180825122185-0:0 /dev/disk/by-id/usb-SanDisk_Ultra_USB_3.0_4C530001210825122232-0:0
root      6638  0.0  0.0  13136  1064 pts/6    S+   19:21   0:00 grep --color=auto zpool
root@las1:~# cd /proc/1177
root@las1:/proc/1177# ls
attr        cmdline          environ  io         mem         ns             pagemap      sched      smaps_rollup  syscall        wchan
autogroup   comm             exe      limits     mountinfo   numa_maps      patch_state  schedstat  stack         task
auxv        coredump_filter  fd       loginuid   mounts      oom_adj        personality  sessionid  stat          timers
cgroup      cpuset           fdinfo   map_files  mountstats  oom_score      projid_map   setgroups  statm         timerslack_ns
clear_refs  cwd              gid_map  maps       net         oom_score_adj  root         smaps      status        uid_map
root@las1:/proc/1177# cat stack
[<0>] cv_wait_common+0x11e/0x140 [spl]
[<0>] __cv_wait+0x15/0x20 [spl]
[<0>] txg_wait_synced+0xdd/0x130 [zfs]
[<0>] spa_vdev_state_exit+0x8a/0x160 [zfs]
[<0>] zfs_ioc_clear+0x181/0x300 [zfs]
[<0>] zfsdev_ioctl+0x1e0/0x610 [zfs]
[<0>] do_vfs_ioctl+0xa8/0x630
[<0>] SyS_ioctl+0x79/0x90
[<0>] do_syscall_64+0x73/0x130
[<0>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[<0>] 0xffffffffffffffff
root@las1:/proc/1177# 

@sneak
Copy link

sneak commented Oct 10, 2018

Mine's whatever ships with ubuntu 18.04.1 on 4.15.0-36-generic.

libzfs2linux/bionic-updates,now 0.7.5-1ubuntu16.4 amd64 [installed,automatic]
zfs-zed/bionic-updates,now 0.7.5-1ubuntu16.4 amd64 [installed,automatic]
zfsutils-linux/bionic-updates,now 0.7.5-1ubuntu16.4 amd64 [installed]

@glibg10b
Copy link
Contributor

$ sudo cat /proc/$(pidof zpool)/stack
[<0>] cv_wait_common+0xab/0x140 [spl]
[<0>] __cv_wait_io+0x18/0x30 [spl]
[<0>] txg_wait_synced_impl+0x105/0x150 [zfs]
[<0>] txg_wait_synced+0x10/0x60 [zfs]
[<0>] dmu_tx_wait+0x1da/0x1e0 [zfs]
[<0>] dmu_tx_assign+0x48/0x80 [zfs]
[<0>] spa_history_log_nvl+0x94/0x1f0 [zfs]
[<0>] spa_history_log+0x39/0x60 [zfs]
[<0>] zfs_ioc_log_history+0xca/0x150 [zfs]
[<0>] zfsdev_ioctl_common+0x334/0x760 [zfs]
[<0>] zfsdev_ioctl+0x57/0xf0 [zfs]
[<0>] __x64_sys_ioctl+0xa3/0xf0
[<0>] x64_sys_call+0x1d12/0x21c0
[<0>] do_syscall_64+0x56/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x73/0xdd
$ zpool --version
zfs-2.2.0-0ubuntu1~23.10.3
zfs-kmod-2.2.0-0ubuntu1~23.10.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Inactive Not being actively updated Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

7 participants
@sneak @mailinglists35 @zviratko @loli10K @bjquinn @glibg10b and others