deadlock running zpool attach or zpool replace #9256

gyakovlev · 2019-08-30T06:26:06Z

System information

Type	Version/Name
Distribution Name	Gentoo
Distribution Version	~ppc64le
Linux Kernel	5.2.10-gentoo
Architecture	ppc64le
ZFS Version	0.8.0-217_ge6cebbf8
SPL Version	0.8.0-217_ge6cebbf8

Describe the problem you're observing

system is ppc64le, running with 64k pages. 512G memory.

as soon as I run zpool attach zroot /dev/sda3 /dev/nvme0n1p3 (not the actual command, I use by-id links, simplified for readability), all IO on the system hangs.

I have a mirror rootfs pool consisting of 2 sata ssd drives.
Also I have couple of nvme drives I'd like to attach to pool.

After attaching nvme devices plan is to remove sata devices from the pool.
This is the plan:

zpool attach zroot /dev/sda3 /dev/nvme0n1p3
zpool attach zroot /dev/sdb3 /dev/nvme1n1p3
zpool remove zroot /dev/sda3
zpool remove zroot /dev/sdb3

but as soon as I run zpool attach zroot /dev/sda3 /dev/nvme0n1p3 (not the actual command, I use by-id links, simplified for readability), all IO on the system hangs.
I still can write non-zfs filesystems, but eventually every read hangs, because system is on that pool.

Describe how to reproduce the problem

running zpool attach zroot /dev/sda3 /dev/nvme0n1p3 or
zpool replace zroot /dev/old /dev/new is enough to completely deadlock the system.

Include any warning/errors/backtraces from the system logs

nothing ends up in zpool history.

NAME                                                       USED  AVAIL     REFER  MOUNTPOINT
zroot                                                     18.5G   105G       96K  none

I'm at 0.8.1 feature set, no new things enabled.

NAME   PROPERTY                       VALUE                          SOURCE
zroot  size                           127G                           -
zroot  capacity                       14%                            -
zroot  altroot                        -                              default
zroot  health                         ONLINE                         -
zroot  guid                           14075524484036841250           -
zroot  version                        -                              default
zroot  bootfs                         -                              default
zroot  delegation                     on                             default
zroot  autoreplace                    off                            default
zroot  cachefile                      -                              default
zroot  failmode                       wait                           default
zroot  listsnapshots                  off                            default
zroot  autoexpand                     off                            default
zroot  dedupratio                     1.00x                          -
zroot  free                           108G                           -
zroot  allocated                      18.5G                          -
zroot  readonly                       off                            -
zroot  ashift                         12                             local
zroot  comment                        -                              default
zroot  expandsize                     -                              -
zroot  freeing                        0                              -
zroot  fragmentation                  14%                            -
zroot  leaked                         0                              -
zroot  multihost                      off                            default
zroot  checkpoint                     -                              -
zroot  load_guid                      18211364347063391026           -
zroot  autotrim                       on                             local
zroot  feature@async_destroy          enabled                        local
zroot  feature@empty_bpobj            active                         local
zroot  feature@lz4_compress           active                         local
zroot  feature@multi_vdev_crash_dump  enabled                        local
zroot  feature@spacemap_histogram     active                         local
zroot  feature@enabled_txg            active                         local
zroot  feature@hole_birth             active                         local
zroot  feature@extensible_dataset     active                         local
zroot  feature@embedded_data          active                         local
zroot  feature@bookmarks              enabled                        local
zroot  feature@filesystem_limits      enabled                        local
zroot  feature@large_blocks           enabled                        local
zroot  feature@large_dnode            active                         local
zroot  feature@sha512                 enabled                        local
zroot  feature@skein                  enabled                        local
zroot  feature@edonr                  enabled                        local
zroot  feature@userobj_accounting     active                         local
zroot  feature@encryption             enabled                        local
zroot  feature@project_quota          active                         local
zroot  feature@device_removal         enabled                        local
zroot  feature@obsolete_counts        enabled                        local
zroot  feature@zpool_checkpoint       enabled                        local
zroot  feature@spacemap_v2            active                         local
zroot  feature@allocation_classes     enabled                        local
zroot  feature@resilver_defer         enabled                        local
zroot  feature@bookmark_v2            enabled                        local
zroot  feature@redaction_bookmarks    disabled                       local
zroot  feature@redacted_datasets      disabled                       local
zroot  feature@bookmark_written       disabled                       local
zroot  feature@log_spacemap           disabled                       local
zroot  feature@livelist               disabled                       local

INFO: task txg_sync:2541 blocked for more than 122 seconds.
      Tainted: P           O    T 5.2.10-gentoo #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
txg_sync        D    0  2540      2 0x00000808
Call Trace:
[c000003fce40b5c0] [c000003fce40b610] 0xc000003fce40b610 (unreliable)
[c000003fce40b7a0] [c00000000001ec9c] __switch_to+0x2ec/0x460
[c000003fce40b800] [c0000000007850f0] __schedule+0x230/0x640
[c000003fce40b8c0] [c00000000078553c] schedule+0x3c/0x100
[c000003fce40b8f0] [c00800000dd90694] cv_wait_common+0x23c/0x450 [spl]
[c000003fce40b9c0] [c008000013430480] spa_config_enter+0x1e8/0x350 [zfs]
[c000003fce40ba80] [c00800001343a138] spa_txg_history_fini_io+0x70/0x348 [zfs]
[c000003fce40bb80] [c00800001344090c] txg_sync_thread+0x484/0x670 [zfs]
[c000003fce40bd20] [c00800000dd9f748] thread_generic_wrapper+0xb0/0x130 [spl]
[c000003fce40bdb0] [c0000000000e94ac] kthread+0x18c/0x1a0
[c000003fce40be20] [c00000000000bc94] ret_from_kernel_thread+0x5c/0x68
INFO: task mmp:2541 blocked for more than 122 seconds.
      Tainted: P           O    T 5.2.10-gentoo #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mmp             D    0  2541      2 0x00000800
Call Trace:
[c000003fce41b870] [c00000000001ec9c] __switch_to+0x2ec/0x460
[c000003fce41b8d0] [c0000000007850f0] __schedule+0x230/0x640
[c000003fce41b990] [c00000000078553c] schedule+0x3c/0x100
[c000003fce41b9c0] [c00800000dd90694] cv_wait_common+0x23c/0x450 [spl]
[c000003fce41ba90] [c008000013430480] spa_config_enter+0x1e8/0x350 [zfs]
[c000003fce41bb50] [c008000013445490] vdev_count_leaves+0x38/0x80 [zfs]
[c000003fce41bb90] [c0080000133fb398] mmp_thread+0x370/0xab0 [zfs]
[c000003fce41bd20] [c00800000dd9f748] thread_generic_wrapper+0xb0/0x130 [spl]
[c000003fce41bdb0] [c0000000000e94ac] kthread+0x18c/0x1a0
[c000003fce41be20] [c00000000000bc94] ret_from_kernel_thread+0x5c/0x68
INFO: task zed:7338 blocked for more than 122 seconds.
      Tainted: P           O    T 5.2.10-gentoo #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
zed             D    0  7338      1 0x00040000
Call Trace:
[c000003f741878d0] [c00000000001ec9c] __switch_to+0x2ec/0x460
[c000003f74187930] [c0000000007850f0] __schedule+0x230/0x640
[c000003f741879f0] [c00000000078553c] schedule+0x3c/0x100
[c000003f74187a20] [c000000000785ae8] schedule_preempt_disabled+0x18/0x30
[c000003f74187a40] [c000000000787d6c] __mutex_lock.isra.0+0x2dc/0x710
[c000003f74187ae0] [c008000013428674] spa_all_configs+0x7c/0x260 [zfs]
[c000003f74187b80] [c0080000134b4030] zfs_ioc_pool_configs+0x28/0xd0 [zfs]
[c000003f74187bb0] [c0080000134bcef4] zfsdev_ioctl+0xb9c/0xf90 [zfs]
[c000003f74187d00] [c0000000002ee53c] do_vfs_ioctl+0x9ac/0xc60
[c000003f74187db0] [c0000000002ee8a4] ksys_ioctl+0xb4/0x100
[c000003f74187e00] [c0000000002ee910] sys_ioctl+0x20/0x80
[c000003f74187e20] [c00000000000b8ac] system_call+0x5c/0x70
INFO: task zpool:128406 blocked for more than 122 seconds.
      Tainted: P           O    T 5.2.10-gentoo #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
zpool           D    0 128406 126688 0x00040008
Call Trace:
[c000002e0b5e6730] [c00000000001ec9c] __switch_to+0x2ec/0x460
[c000002e0b5e6790] [c0000000007850f0] __schedule+0x230/0x640
[c000002e0b5e6850] [c00000000078553c] schedule+0x3c/0x100
[c000002e0b5e6880] [c00800000dd90694] cv_wait_common+0x23c/0x450 [spl]
[c000002e0b5e6950] [c008000013430480] spa_config_enter+0x1e8/0x350 [zfs]
[c000002e0b5e6a10] [c0080000134f07e4] zfs_blkptr_verify+0x33c/0x4f0 [zfs]
[c000002e0b5e6ab0] [c0080000134f8fc4] zio_read+0x6c/0x140 [zfs]
[c000002e0b5e6ba0] [c008000013338688] arc_read+0x8d0/0x20b0 [zfs]
[c000002e0b5e6d00] [c008000013353c84] dbuf_read_impl.constprop.0+0x2dc/0xea0 [zfs]
[c000002e0b5e6e80] [c008000013354afc] dbuf_read+0x2b4/0x7f0 [zfs]
[c000002e0b5e6f60] [c008000013367aac] dmu_buf_hold_array_by_dnode+0x194/0x710 [zfs]
[c000002e0b5e7050] [c00800001336a158] dmu_read_uio_dnode+0x70/0x1c0 [zfs]
[c000002e0b5e7110] [c00800001336a324] dmu_read_uio_dbuf+0x7c/0xc0 [zfs]
[c000002e0b5e7150] [c0080000134d9b28] zfs_read+0x170/0x5e0 [zfs]
[c000002e0b5e7240] [c00800001350e964] zpl_read_common_iovec+0xac/0x1d0 [zfs]
[c000002e0b5e7320] [c00800001350eb9c] zpl_iter_read+0x114/0x1e0 [zfs]
[c000002e0b5e7400] [c0000000002cc6a4] new_sync_read+0x164/0x1f0
[c000002e0b5e74b0] [c0000000002cf66c] vfs_read+0xfc/0x1e0
[c000002e0b5e7500] [c0000000002cf7a0] kernel_read+0x50/0x90
[c000002e0b5e7530] [c00800000dda2934] vn_rdwr+0x10c/0x210 [spl]
[c000002e0b5e75d0] [c00800000dd97448] kobj_read_file+0x60/0xc0 [spl]
[c000002e0b5e7660] [c00800000dd921ac] zone_get_hostid+0x104/0x180 [spl]
[c000002e0b5e76f0] [c008000013438844] spa_get_hostid+0x1c/0x38 [zfs]
[c000002e0b5e7710] [c0080000134289f8] spa_config_generate+0x1a0/0x610 [zfs]
[c000002e0b5e77e0] [c008000013464210] vdev_label_init+0x1b8/0xc80 [zfs]
[c000002e0b5e7910] [c0080000134640f8] vdev_label_init+0xa0/0xc80 [zfs]
[c000002e0b5e7a40] [c008000013452490] vdev_create+0x98/0xe0 [zfs]
[c000002e0b5e7a80] [c0080000134247e4] spa_vdev_attach+0x14c/0xb40 [zfs]
[c000002e0b5e7b50] [c0080000134b0404] zfs_ioc_vdev_attach+0xec/0x120 [zfs]
[c000002e0b5e7bb0] [c0080000134bcef4] zfsdev_ioctl+0xb9c/0xf90 [zfs]
[c000002e0b5e7d00] [c0000000002ee53c] do_vfs_ioctl+0x9ac/0xc60
[c000002e0b5e7db0] [c0000000002ee8a4] ksys_ioctl+0xb4/0x100
[c000002e0b5e7e00] [c0000000002ee910] sys_ioctl+0x20/0x80
[c000002e0b5e7e20] [c00000000000b8ac] system_call+0x5c/0x70

The text was updated successfully, but these errors were encountered:

gyakovlev · 2019-08-30T18:59:27Z

just to add some info, system at time of attempt is 99.999 idle, not any serious IO, except usual background daemons like syslog, cron etc.
system has 44 cores, 176 threads.

the error triggers every time, very reproducible.
if I create pool on NVME drives it operates normally.
I will try attaching devices non-root pool and see if it hangs.

loli10K · 2019-09-01T09:34:28Z

It seems this was accidentally introduced with dc04a8c, we take a read lock in zfs_blkptr_verify() while already holding a write lock taken in spa_vdev_attach() -> spa_vdev_enter() -> spa_vdev_config_enter():

INFO: task zpool:128406 blocked for more than 122 seconds.
      Tainted: P           O    T 5.2.10-gentoo #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
zpool           D    0 128406 126688 0x00040008
Call Trace:
[c000002e0b5e6730] [c00000000001ec9c] __switch_to+0x2ec/0x460
[c000002e0b5e6790] [c0000000007850f0] __schedule+0x230/0x640
[c000002e0b5e6850] [c00000000078553c] schedule+0x3c/0x100
[c000002e0b5e6880] [c00800000dd90694] cv_wait_common+0x23c/0x450 [spl]
[c000002e0b5e6950] [c008000013430480] spa_config_enter+0x1e8/0x350 [zfs]
[c000002e0b5e6a10] [c0080000134f07e4] zfs_blkptr_verify+0x33c/0x4f0 [zfs]   <--- trying read lock
[c000002e0b5e6ab0] [c0080000134f8fc4] zio_read+0x6c/0x140 [zfs]
[c000002e0b5e6ba0] [c008000013338688] arc_read+0x8d0/0x20b0 [zfs]
[c000002e0b5e6d00] [c008000013353c84] dbuf_read_impl.constprop.0+0x2dc/0xea0 [zfs]
[c000002e0b5e6e80] [c008000013354afc] dbuf_read+0x2b4/0x7f0 [zfs]
[c000002e0b5e6f60] [c008000013367aac] dmu_buf_hold_array_by_dnode+0x194/0x710 [zfs]
[c000002e0b5e7050] [c00800001336a158] dmu_read_uio_dnode+0x70/0x1c0 [zfs]
[c000002e0b5e7110] [c00800001336a324] dmu_read_uio_dbuf+0x7c/0xc0 [zfs]
[c000002e0b5e7150] [c0080000134d9b28] zfs_read+0x170/0x5e0 [zfs]
[c000002e0b5e7240] [c00800001350e964] zpl_read_common_iovec+0xac/0x1d0 [zfs]
[c000002e0b5e7320] [c00800001350eb9c] zpl_iter_read+0x114/0x1e0 [zfs]
[c000002e0b5e7400] [c0000000002cc6a4] new_sync_read+0x164/0x1f0
[c000002e0b5e74b0] [c0000000002cf66c] vfs_read+0xfc/0x1e0
[c000002e0b5e7500] [c0000000002cf7a0] kernel_read+0x50/0x90
[c000002e0b5e7530] [c00800000dda2934] vn_rdwr+0x10c/0x210 [spl]
[c000002e0b5e75d0] [c00800000dd97448] kobj_read_file+0x60/0xc0 [spl]
[c000002e0b5e7660] [c00800000dd921ac] zone_get_hostid+0x104/0x180 [spl]
[c000002e0b5e76f0] [c008000013438844] spa_get_hostid+0x1c/0x38 [zfs]
[c000002e0b5e7710] [c0080000134289f8] spa_config_generate+0x1a0/0x610 [zfs]
[c000002e0b5e77e0] [c008000013464210] vdev_label_init+0x1b8/0xc80 [zfs]
[c000002e0b5e7910] [c0080000134640f8] vdev_label_init+0xa0/0xc80 [zfs]
[c000002e0b5e7a40] [c008000013452490] vdev_create+0x98/0xe0 [zfs]
[c000002e0b5e7a80] [c0080000134247e4] spa_vdev_attach+0x14c/0xb40 [zfs]    <--- grabbed write lock
[c000002e0b5e7b50] [c0080000134b0404] zfs_ioc_vdev_attach+0xec/0x120 [zfs]
[c000002e0b5e7bb0] [c0080000134bcef4] zfsdev_ioctl+0xb9c/0xf90 [zfs]
[c000002e0b5e7d00] [c0000000002ee53c] do_vfs_ioctl+0x9ac/0xc60
[c000002e0b5e7db0] [c0000000002ee8a4] ksys_ioctl+0xb4/0x100
[c000002e0b5e7e00] [c0000000002ee910] sys_ioctl+0x20/0x80
[c000002e0b5e7e20] [c00000000000b8ac] system_call+0x5c/0x70

@gyakovlev you should be able to workaround this temporarily booting with spl.spl_hostid=<your-hostid> added to the kernel cmdline.

gyakovlev · 2019-09-01T10:05:50Z

@loli10K I confirm workaround works, was able to attach partition normally after booting with spl.spl_hostid=0x<id>
Thanks!

behlendorf · 2019-09-05T18:11:10Z

@loli10K thanks for looking in to this. I've proposed a fix for the issue in #9285. @gyakovlev if it's not too much to ask, it would be great if you could verify it resolves the issue without the need for the suggested workaround.

Accidentally introduced by dc04a8c which now takes the SCL_VDEV lock as a reader in zfs_blkptr_verify(). A deadlock can occur if the /etc/hostid file resides on a dataset in the same pool. This is because reading the /etc/hostid file may occur while the caller is holding the SCL_VDEV lock as a writer. For example, to perform a `zpool attach` as shown in the abbreviated stack below. To resolve the issue we cache the system's hostid when initializing the spa_t, or when modifying the multihost property. The cached value is then relied upon for subsequent accesses. Call Trace: spa_config_enter+0x1e8/0x350 [zfs] zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock zio_read+0x6c/0x140 [zfs] ... vfs_read+0xfc/0x1e0 kernel_read+0x50/0x90 ... spa_get_hostid+0x1c/0x38 [zfs] spa_config_generate+0x1a0/0x610 [zfs] vdev_label_init+0xa0/0xc80 [zfs] vdev_create+0x98/0xe0 [zfs] spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#9256 Closes openzfs#9285

Accidentally introduced by dc04a8c which now takes the SCL_VDEV lock as a reader in zfs_blkptr_verify(). A deadlock can occur if the /etc/hostid file resides on a dataset in the same pool. This is because reading the /etc/hostid file may occur while the caller is holding the SCL_VDEV lock as a writer. For example, to perform a `zpool attach` as shown in the abbreviated stack below. To resolve the issue we cache the system's hostid when initializing the spa_t, or when modifying the multihost property. The cached value is then relied upon for subsequent accesses. Call Trace: spa_config_enter+0x1e8/0x350 [zfs] zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock zio_read+0x6c/0x140 [zfs] ... vfs_read+0xfc/0x1e0 kernel_read+0x50/0x90 ... spa_get_hostid+0x1c/0x38 [zfs] spa_config_generate+0x1a0/0x610 [zfs] vdev_label_init+0xa0/0xc80 [zfs] vdev_create+0x98/0xe0 [zfs] spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #9256 Closes #9285

behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Sep 3, 2019

behlendorf mentioned this issue Sep 4, 2019

Fix /etc/hostid on root pool deadlock #9285

Merged

12 tasks

behlendorf closed this as completed in 25f06d6 Sep 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deadlock running zpool attach or zpool replace #9256

deadlock running zpool attach or zpool replace #9256

gyakovlev commented Aug 30, 2019 •

edited

Loading

gyakovlev commented Aug 30, 2019

loli10K commented Sep 1, 2019 •

edited

Loading

gyakovlev commented Sep 1, 2019 •

edited

Loading

behlendorf commented Sep 5, 2019

deadlock running zpool attach or zpool replace #9256

deadlock running zpool attach or zpool replace #9256

Comments

gyakovlev commented Aug 30, 2019 • edited Loading

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

gyakovlev commented Aug 30, 2019

loli10K commented Sep 1, 2019 • edited Loading

gyakovlev commented Sep 1, 2019 • edited Loading

behlendorf commented Sep 5, 2019

gyakovlev commented Aug 30, 2019 •

edited

Loading

loli10K commented Sep 1, 2019 •

edited

Loading

gyakovlev commented Sep 1, 2019 •

edited

Loading