Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PANIC at dmu_recv.c on receiving snapshot to encrypted file system #12732

Open
scratchings opened this issue Nov 8, 2021 · 12 comments
Open

PANIC at dmu_recv.c on receiving snapshot to encrypted file system #12732

scratchings opened this issue Nov 8, 2021 · 12 comments
Labels
Component: Encryption "native encryption" feature Component: Send/Recv "zfs send/recv" feature Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@scratchings
Copy link

System information

Type Version/Name
Distribution Name Centos
Distribution Version 7.9.2009
Kernel Version 3.10.0-1160.42.2.el7.x86_64
Architecture x86_64
OpenZFS Version 2.0.6-1

Describe the problem you're observing

This issue seems very similar to #12270 and #12001 but has a different stack trace.

The system is a backup target and uses Sanoid/Syncoid, to manage and transfer snapshots from several systems and has been in operation for several years. Whilst running 0.8.6 of ZFS it suffered from regular OOM killer events on the zpool command but was otherwise stable, so was upgraded to 2.0.6 from 0.8.6 on mid-October with the hope that this would improve reliability. It hasn't!

The system has ca 150 ZFS file systems that either ZFS receive snapshots,SSH/rsync targets or run a minio server (as an experiment with object stores, but currently unused). All file systems locally are encrypted aes-256-ccm. Remote sources are a mixture of unencrypted, aes-256-ccm and aes-256-gcm (depending on host ZFS capabilities and age). In addition an offsite ZFS target pulls these snapshots to provide for disaster recovery. This remote system is stable and is on largely identical hardware but only pulls from the host (no transfers out).

zpool list
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
backup_pool   654T   388T   266T        -         -     6%    59%  1.00x    ONLINE  -

Host has 64GB RAM.

Snapshots are being created/destroyed on the source and destination throughout long transfers of which we have several that are failing to complete between hangs due to being multi-TB in size.

On hang:

  • System load continues to rise
  • zpool status reports permanent data errors although no disks show errors
  • zpool status -v hangs machine

Syncoid is called with:
--no-sync-snap --create-bookmark -r --skip-parent
(have experimented with changing the mbuffer-size with little effect)

At the time of the hang the system log contains:

Nov  6 20:05:07 fs3 kernel: VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
Nov  6 20:05:07 fs3 kernel: PANIC at dmu_recv.c:1811:receive_object()
Nov  6 20:05:07 fs3 kernel: Showing stack for process 12169
Nov  6 20:05:07 fs3 kernel: CPU: 15 PID: 12169 Comm: receive_writer Kdump: loaded Tainted: P           OE  ------------   3.10.0-1160.42.2.el7.x86_64 #1
Nov  6 20:05:07 fs3 kernel: Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 2.1 06/14/2018
Nov  6 20:05:07 fs3 kernel: Call Trace:
Nov  6 20:05:07 fs3 kernel: [<ffffffff85583539>] dump_stack+0x19/0x1b
Nov  6 20:05:07 fs3 kernel: [<ffffffffc025ac5b>] spl_dumpstack+0x2b/0x30 [spl]
Nov  6 20:05:07 fs3 kernel: [<ffffffffc025ad29>] spl_panic+0xc9/0x110 [spl]
Nov  6 20:05:07 fs3 kernel: [<ffffffffc0aa476e>] ? arc_space_consume+0x5e/0x110 [zfs]
Nov  6 20:05:07 fs3 kernel: [<ffffffff85588632>] ? down_read+0x12/0x40
Nov  6 20:05:07 fs3 kernel: [<ffffffffc0abd30f>] ? dbuf_read+0xff/0x570 [zfs]
Nov  6 20:05:07 fs3 kernel: [<ffffffffc0ac45ee>] ? dmu_bonus_hold_by_dnode+0xde/0x1b0 [zfs]
Nov  6 20:05:07 fs3 kernel: [<ffffffffc025c6b5>] ? spl_kmem_cache_free+0x185/0x210 [spl]
Nov  6 20:05:07 fs3 kernel: [<ffffffffc0ad12da>] receive_object+0x64a/0xc80 [zfs]
Nov  6 20:05:07 fs3 kernel: [<ffffffffc0ae403b>] ? dnode_rele_and_unlock+0x6b/0xf0 [zfs]
Nov  6 20:05:07 fs3 kernel: [<ffffffffc0ac4d60>] ? dmu_free_long_range+0x370/0x510 [zfs]
Nov  6 20:05:07 fs3 kernel: [<ffffffffc025b7b5>] ? spl_kmem_free+0x35/0x40 [spl]
Nov  6 20:05:07 fs3 kernel: [<ffffffff851a67fd>] ? list_del+0xd/0x30
Nov  6 20:05:07 fs3 kernel: [<ffffffffc0ad4904>] receive_writer_thread+0x4e4/0xa30 [zfs]
Nov  6 20:05:07 fs3 kernel: [<ffffffff84ede305>] ? sched_clock_cpu+0x85/0xc0
Nov  6 20:05:07 fs3 kernel: [<ffffffffc0ad4420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov  6 20:05:07 fs3 kernel: [<ffffffffc0ad4420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov  6 20:05:07 fs3 kernel: [<ffffffffc0261873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov  6 20:05:07 fs3 kernel: [<ffffffffc0261800>] ? __thread_exit+0x20/0x20 [spl]
Nov  6 20:05:07 fs3 kernel: [<ffffffff84ec5e61>] kthread+0xd1/0xe0
Nov  6 20:05:07 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov  6 20:05:07 fs3 kernel: [<ffffffff85595ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov  6 20:05:07 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov  6 20:05:07 fs3 zed: eid=44379 class=authentication pool='backup_pool' bookmark=185192:0:0:0
Nov  6 20:06:01 fs3 systemd: Started Session 3231 of user root.
Nov  6 20:07:01 fs3 systemd: Started Session 3232 of user root.
Nov  6 20:08:01 fs3 systemd: Started Session 3233 of user root.
Nov  6 20:08:05 fs3 kernel: INFO: task txg_quiesce:7399 blocked for more than 120 seconds.
Nov  6 20:08:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 20:08:05 fs3 kernel: txg_quiesce     D ffff96bc71d09660     0  7399      2 0x00000000
Nov  6 20:08:05 fs3 kernel: Call Trace:
Nov  6 20:08:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov  6 20:08:05 fs3 kernel: [<ffffffffc025a325>] cv_wait_common+0x125/0x150 [spl]
Nov  6 20:08:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  6 20:08:05 fs3 kernel: [<ffffffffc025a365>] __cv_wait+0x15/0x20 [spl]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b5493b>] txg_quiesce_thread+0x2db/0x3e0 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b54660>] ? txg_init+0x2b0/0x2b0 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0261873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0261800>] ? __thread_exit+0x20/0x20 [spl]
Nov  6 20:08:05 fs3 kernel: [<ffffffff84ec5e61>] kthread+0xd1/0xe0
Nov  6 20:08:05 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov  6 20:08:05 fs3 kernel: [<ffffffff85595ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov  6 20:08:05 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov  6 20:08:05 fs3 kernel: INFO: task receive_writer:12169 blocked for more than 120 seconds.
Nov  6 20:08:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 20:08:05 fs3 kernel: receive_writer  D ffff96add74f5280     0 12169      2 0x00000080
Nov  6 20:08:05 fs3 kernel: Call Trace:
Nov  6 20:08:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov  6 20:08:05 fs3 kernel: [<ffffffffc025ad55>] spl_panic+0xf5/0x110 [spl]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0aa476e>] ? arc_space_consume+0x5e/0x110 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffff85588632>] ? down_read+0x12/0x40
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0abd30f>] ? dbuf_read+0xff/0x570 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0ac45ee>] ? dmu_bonus_hold_by_dnode+0xde/0x1b0 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc025c6b5>] ? spl_kmem_cache_free+0x185/0x210 [spl]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0ad12da>] receive_object+0x64a/0xc80 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0ae403b>] ? dnode_rele_and_unlock+0x6b/0xf0 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0ac4d60>] ? dmu_free_long_range+0x370/0x510 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc025b7b5>] ? spl_kmem_free+0x35/0x40 [spl]
Nov  6 20:08:05 fs3 kernel: [<ffffffff851a67fd>] ? list_del+0xd/0x30
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0ad4904>] receive_writer_thread+0x4e4/0xa30 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffff84ede305>] ? sched_clock_cpu+0x85/0xc0
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0ad4420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0ad4420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0261873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0261800>] ? __thread_exit+0x20/0x20 [spl]
Nov  6 20:08:05 fs3 kernel: [<ffffffff84ec5e61>] kthread+0xd1/0xe0
Nov  6 20:08:05 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov  6 20:08:05 fs3 kernel: [<ffffffff85595ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov  6 20:08:05 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov  6 20:08:05 fs3 kernel: INFO: task zfs:12640 blocked for more than 120 seconds.
Nov  6 20:08:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 20:08:05 fs3 kernel: zfs             D ffff96bc6b63a6e0     0 12640  12634 0x00000080
Nov  6 20:08:05 fs3 kernel: Call Trace:
Nov  6 20:08:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov  6 20:08:05 fs3 kernel: [<ffffffff85586e41>] schedule_timeout+0x221/0x2d0
Nov  6 20:08:05 fs3 kernel: [<ffffffff84edae02>] ? default_wake_function+0x12/0x20
Nov  6 20:08:05 fs3 kernel: [<ffffffff84ec6f7b>] ? autoremove_wake_function+0x2b/0x40
Nov  6 20:08:05 fs3 kernel: [<ffffffff84f06992>] ? ktime_get_ts64+0x52/0xf0
Nov  6 20:08:05 fs3 kernel: [<ffffffff85588a2d>] io_schedule_timeout+0xad/0x130
Nov  6 20:08:05 fs3 kernel: [<ffffffff84ec6ad6>] ? prepare_to_wait_exclusive+0x56/0x90
Nov  6 20:08:05 fs3 kernel: [<ffffffff85588ac8>] io_schedule+0x18/0x20
Nov  6 20:08:05 fs3 kernel: [<ffffffffc025a2b2>] cv_wait_common+0xb2/0x150 [spl]
Nov  6 20:08:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  6 20:08:05 fs3 kernel: [<ffffffffc025a388>] __cv_wait_io+0x18/0x20 [spl]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b54155>] txg_wait_synced_impl+0xe5/0x130 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b17d40>] ? dsl_dataset_user_hold_check_one+0x170/0x170 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b541b0>] txg_wait_synced+0x10/0x40 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b1683c>] dsl_sync_task_common+0xdc/0x2b0 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b17660>] ? dsl_onexit_hold_cleanup+0xd0/0xd0 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b17d40>] ? dsl_dataset_user_hold_check_one+0x170/0x170 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b17660>] ? dsl_onexit_hold_cleanup+0xd0/0xd0 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b16a36>] dsl_sync_task+0x26/0x30 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b18124>] dsl_dataset_user_hold+0x94/0xf0 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b95ba1>] zfs_ioc_hold+0xc1/0x1b0 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0b9d42b>] zfsdev_ioctl_common+0x51b/0x820 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffffc0bc8ff6>] zfsdev_ioctl+0x56/0xf0 [zfs]
Nov  6 20:08:05 fs3 kernel: [<ffffffff85063ab0>] do_vfs_ioctl+0x3a0/0x5b0
Nov  6 20:08:05 fs3 kernel: [<ffffffff85590678>] ? __do_page_fault+0x238/0x500
Nov  6 20:08:05 fs3 kernel: [<ffffffff85063d61>] SyS_ioctl+0xa1/0xc0
Nov  6 20:08:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
Nov  6 20:09:01 fs3 systemd: Started Session 3234 of user root.
Nov  6 20:10:01 fs3 systemd: Started Session 3235 of user root.
Nov  6 20:10:01 fs3 systemd: Started Session 3236 of user root.
Nov  6 20:10:01 fs3 systemd: Started Session 3237 of user root.
Nov  6 20:10:02 fs3 systemd-logind: New session 3238 of user hbabk0.
Nov  6 20:10:02 fs3 systemd: Started Session 3238 of user hbabk0.
Nov  6 20:10:05 fs3 kernel: INFO: task txg_quiesce:7399 blocked for more than 120 seconds.
Nov  6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 20:10:05 fs3 kernel: txg_quiesce     D ffff96bc71d09660     0  7399      2 0x00000000
Nov  6 20:10:05 fs3 kernel: Call Trace:
Nov  6 20:10:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov  6 20:10:05 fs3 kernel: [<ffffffffc025a325>] cv_wait_common+0x125/0x150 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  6 20:10:05 fs3 kernel: [<ffffffffc025a365>] __cv_wait+0x15/0x20 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0b5493b>] txg_quiesce_thread+0x2db/0x3e0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0b54660>] ? txg_init+0x2b0/0x2b0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0261873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0261800>] ? __thread_exit+0x20/0x20 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffff84ec5e61>] kthread+0xd1/0xe0
Nov  6 20:10:05 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov  6 20:10:05 fs3 kernel: [<ffffffff85595ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov  6 20:10:05 fs3 kernel: [<ffffffff84ec5d90>] ? insert_kthread_work+0x40/0x40
Nov  6 20:10:05 fs3 kernel: INFO: task minio:9923 blocked for more than 120 seconds.
Nov  6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 20:10:05 fs3 kernel: minio           D ffff96bc96255860     0  9923      1 0x00000080
Nov  6 20:10:05 fs3 kernel: Call Trace:
Nov  6 20:10:05 fs3 kernel: [<ffffffff8558a099>] schedule_preempt_disabled+0x29/0x70
Nov  6 20:10:05 fs3 kernel: [<ffffffff855881e7>] __mutex_lock_slowpath+0xc7/0x1d0
Nov  6 20:10:05 fs3 kernel: [<ffffffff855875bf>] mutex_lock+0x1f/0x2f
Nov  6 20:10:05 fs3 kernel: [<ffffffff855800d6>] lookup_slow+0x33/0xa7
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505cd9f>] link_path_walk+0x80f/0x8b0
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505cfaa>] path_lookupat+0x7a/0x8d0
Nov  6 20:10:05 fs3 kernel: [<ffffffff84ede305>] ? sched_clock_cpu+0x85/0xc0
Nov  6 20:10:05 fs3 kernel: [<ffffffff85059137>] ? path_get+0x27/0x30
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505d82b>] filename_lookup+0x2b/0xc0
Nov  6 20:10:05 fs3 kernel: [<ffffffff85061537>] user_path_at_empty+0x67/0xc0
Nov  6 20:10:05 fs3 kernel: [<ffffffff850615a1>] user_path_at+0x11/0x20
Nov  6 20:10:05 fs3 kernel: [<ffffffff85053fe3>] vfs_fstatat+0x63/0xc0
Nov  6 20:10:05 fs3 kernel: [<ffffffff85054454>] SYSC_newfstatat+0x24/0x60
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505487e>] SyS_newfstatat+0xe/0x10
Nov  6 20:10:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
Nov  6 20:10:05 fs3 kernel: INFO: task minio:9924 blocked for more than 120 seconds.
Nov  6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 20:10:05 fs3 kernel: minio           D ffff96bc6ab4c7e0     0  9924      1 0x00000080
Nov  6 20:10:05 fs3 kernel: Call Trace:
Nov  6 20:10:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov  6 20:10:05 fs3 kernel: [<ffffffffc025a325>] cv_wait_common+0x125/0x150 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  6 20:10:05 fs3 kernel: [<ffffffffc025a365>] __cv_wait+0x15/0x20 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0adf90b>] dmu_tx_wait+0xbb/0x3b0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0adfc91>] dmu_tx_assign+0x91/0x490 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0ac4c8b>] dmu_free_long_range+0x29b/0x510 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0bd77e9>] zfs_trunc+0x79/0x210 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0bdc3a2>] zfs_freesp+0xa2/0x510 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0bc1693>] ? zfs_zaccess_aces_check+0x1b3/0x3e0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0bc1986>] ? zfs_zaccess_common+0xc6/0x230 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0bc2f36>] ? zfs_zaccess+0x106/0x3d0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0bd1daf>] zfs_setattr+0x92f/0x2360 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffff855875b2>] ? mutex_lock+0x12/0x2f
Nov  6 20:10:05 fs3 kernel: [<ffffffff8506cbec>] ? inode_add_lru+0x3c/0xa0
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0be3c1d>] ? zpl_xattr_get_dir+0x9d/0x1c0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffff855875b2>] ? mutex_lock+0x12/0x2f
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0b2d63b>] ? rrm_exit+0x4b/0x90 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffff85028f01>] ? __kmalloc_node+0x1d1/0x2b0
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0be432c>] ? zpl_xattr_get+0x12c/0x1d0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc025ba84>] ? spl_kmem_zalloc+0xc4/0x140 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc025ba84>] ? spl_kmem_zalloc+0xc4/0x140 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0be249b>] zpl_setattr+0x10b/0x1a0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffff8506e17c>] notify_change+0x30c/0x4d0
Nov  6 20:10:05 fs3 kernel: [<ffffffff8504c2a5>] do_truncate+0x75/0xc0
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505e637>] do_last+0x627/0x1340
Nov  6 20:10:05 fs3 kernel: [<ffffffff84ede305>] ? sched_clock_cpu+0x85/0xc0
Nov  6 20:10:05 fs3 kernel: [<ffffffff850293a6>] ? kmem_cache_alloc_trace+0x1d6/0x200
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505f41d>] path_openat+0xcd/0x5a0
Nov  6 20:10:05 fs3 kernel: [<ffffffff8506166d>] do_filp_open+0x4d/0xb0
Nov  6 20:10:05 fs3 kernel: [<ffffffff8506f767>] ? __alloc_fd+0x47/0x170
Nov  6 20:10:05 fs3 kernel: [<ffffffff8504d404>] do_sys_open+0x124/0x220
Nov  6 20:10:05 fs3 kernel: [<ffffffff8504d534>] SyS_openat+0x14/0x20
Nov  6 20:10:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
Nov  6 20:10:05 fs3 kernel: INFO: task minio:9926 blocked for more than 120 seconds.
Nov  6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 20:10:05 fs3 kernel: minio           D ffff96bc962547e0     0  9926      1 0x00000080
Nov  6 20:10:05 fs3 kernel: Call Trace:
Nov  6 20:10:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov  6 20:10:05 fs3 kernel: [<ffffffffc025a325>] cv_wait_common+0x125/0x150 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  6 20:10:05 fs3 kernel: [<ffffffffc025a365>] __cv_wait+0x15/0x20 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0adf90b>] dmu_tx_wait+0xbb/0x3b0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0bcecf9>] zfs_mkdir+0x229/0x720 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffff85124c0d>] ? security_transition_sid+0x2d/0x40
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0be2ff0>] zpl_mkdir+0xc0/0x1a0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505b80f>] vfs_mkdir+0x10f/0x1d0
Nov  6 20:10:05 fs3 kernel: [<ffffffff85061b0a>] SyS_mkdirat+0xca/0x100
Nov  6 20:10:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
Nov  6 20:10:05 fs3 kernel: INFO: task minio:9927 blocked for more than 120 seconds.
Nov  6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 20:10:05 fs3 kernel: minio           D ffff96bc6ab4d860     0  9927      1 0x00000080
Nov  6 20:10:05 fs3 kernel: Call Trace:
Nov  6 20:10:05 fs3 kernel: [<ffffffff8558a099>] schedule_preempt_disabled+0x29/0x70
Nov  6 20:10:05 fs3 kernel: [<ffffffff855881e7>] __mutex_lock_slowpath+0xc7/0x1d0
Nov  6 20:10:05 fs3 kernel: [<ffffffff855875bf>] mutex_lock+0x1f/0x2f
Nov  6 20:10:05 fs3 kernel: [<ffffffff855800d6>] lookup_slow+0x33/0xa7
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505cd9f>] link_path_walk+0x80f/0x8b0
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505cfaa>] path_lookupat+0x7a/0x8d0
Nov  6 20:10:05 fs3 kernel: [<ffffffff84ede305>] ? sched_clock_cpu+0x85/0xc0
Nov  6 20:10:05 fs3 kernel: [<ffffffff85059137>] ? path_get+0x27/0x30
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505d82b>] filename_lookup+0x2b/0xc0
Nov  6 20:10:05 fs3 kernel: [<ffffffff85061537>] user_path_at_empty+0x67/0xc0
Nov  6 20:10:05 fs3 kernel: [<ffffffff850615a1>] user_path_at+0x11/0x20
Nov  6 20:10:05 fs3 kernel: [<ffffffff85053fe3>] vfs_fstatat+0x63/0xc0
Nov  6 20:10:05 fs3 kernel: [<ffffffff85054454>] SYSC_newfstatat+0x24/0x60
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505487e>] SyS_newfstatat+0xe/0x10
Nov  6 20:10:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
Nov  6 20:10:05 fs3 kernel: INFO: task minio:10031 blocked for more than 120 seconds.
Nov  6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 20:10:05 fs3 kernel: minio           D ffff96bc66ca5860     0 10031      1 0x00000080
Nov  6 20:10:05 fs3 kernel: Call Trace:
Nov  6 20:10:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov  6 20:10:05 fs3 kernel: [<ffffffffc025a325>] cv_wait_common+0x125/0x150 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  6 20:10:05 fs3 kernel: [<ffffffffc025a365>] __cv_wait+0x15/0x20 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0adf90b>] dmu_tx_wait+0xbb/0x3b0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0bcecf9>] zfs_mkdir+0x229/0x720 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffff85124c0d>] ? security_transition_sid+0x2d/0x40
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0be2ff0>] zpl_mkdir+0xc0/0x1a0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505b80f>] vfs_mkdir+0x10f/0x1d0
Nov  6 20:10:05 fs3 kernel: [<ffffffff85061b0a>] SyS_mkdirat+0xca/0x100
Nov  6 20:10:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a
Nov  6 20:10:05 fs3 kernel: INFO: task minio:10032 blocked for more than 120 seconds.
Nov  6 20:10:05 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  6 20:10:05 fs3 kernel: minio           D ffff96bc60da1660     0 10032      1 0x00000080
Nov  6 20:10:05 fs3 kernel: Call Trace:
Nov  6 20:10:05 fs3 kernel: [<ffffffff85589179>] schedule+0x29/0x70
Nov  6 20:10:05 fs3 kernel: [<ffffffffc025a325>] cv_wait_common+0x125/0x150 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffff84ec6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  6 20:10:05 fs3 kernel: [<ffffffffc025a365>] __cv_wait+0x15/0x20 [spl]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0adf90b>] dmu_tx_wait+0xbb/0x3b0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0bd02a2>] zfs_rmdir+0x102/0x510 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffffc0be259d>] zpl_rmdir+0x6d/0xc0 [zfs]
Nov  6 20:10:05 fs3 kernel: [<ffffffff8505bcbc>] vfs_rmdir+0xdc/0x150
Nov  6 20:10:05 fs3 kernel: [<ffffffff85060941>] do_rmdir+0x1f1/0x220
Nov  6 20:10:05 fs3 kernel: [<ffffffff85061ba5>] SyS_unlinkat+0x25/0x40
Nov  6 20:10:05 fs3 kernel: [<ffffffff85595f92>] system_call_fastpath+0x25/0x2a

The minio hangs I believe are just a symptom of the underlying issue, not the cause.

Post reboot the permanent errors are in typically in snapshots (or files within snapshots) and mostly from deleted snapshots:

[root@fs3 cron.d]# zpool status -v
  pool: backup_pool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 0B in 00:00:20 with 0 errors on Thu Nov  4 12:17:50 2021
config:

	NAME                   STATE     READ WRITE CKSUM
	backup_pool            ONLINE       0     0     0
	  raidz2-0             ONLINE       0     0     0
	    35000cca2537ad22c  ONLINE       0     0     0
	    35000cca25380af3c  ONLINE       0     0     0
	    35000cca2701aa128  ONLINE       0     0     0
	    35000cca2701b8930  ONLINE       0     0     0
	    35000cca2701bd878  ONLINE       0     0     0
	    35000cca2701f5750  ONLINE       0     0     0
	    35000cca2702028f4  ONLINE       0     0     0
	    35000cca270209964  ONLINE       0     0     0
	    35000cca27020d35c  ONLINE       0     0     0
	    35000cca27020d3c0  ONLINE       0     0     0
	    35000cca27020d558  ONLINE       0     0     0
	    35000cca270226748  ONLINE       0     0     0
	  raidz2-1             ONLINE       0     0     0
	    35000cca27022d244  ONLINE       0     0     0
	    35000cca270238244  ONLINE       0     0     0
	    35000cca2702477ac  ONLINE       0     0     0
	    35000cca27024830c  ONLINE       0     0     0
	    35000cca27024f468  ONLINE       0     0     0
	    35000cca2702597b4  ONLINE       0     0     0
	    35000cca27025ade8  ONLINE       0     0     0
	    35000cca27025b110  ONLINE       0     0     0
	    35000cca27025fc94  ONLINE       0     0     0
	    35000cca2702693c0  ONLINE       0     0     0
	    35000cca270269408  ONLINE       0     0     0
	    35000cca27026b290  ONLINE       0     0     0
	  raidz2-2             ONLINE       0     0     0
	    35000cca27026b320  ONLINE       0     0     0
	    35000cca27026b428  ONLINE       0     0     0
	    35000cca27026b56c  ONLINE       0     0     0
	    35000cca27026b5dc  ONLINE       0     0     0
	    35000cca27026b978  ONLINE       0     0     0
	    35000cca27026b9a8  ONLINE       0     0     0
	    35000cca27026ba40  ONLINE       0     0     0
	    35000cca270272e50  ONLINE       0     0     0
	    35000cca270273c2c  ONLINE       0     0     0
	    35000cca270273e98  ONLINE       0     0     0
	    35000cca270274174  ONLINE       0     0     0
	    35000cca27027452c  ONLINE       0     0     0
	  raidz2-3             ONLINE       0     0     0
	    35000cca2702745c4  ONLINE       0     0     0
	    35000cca270274b50  ONLINE       0     0     0
	    35000cca270275010  ONLINE       0     0     0
	    35000cca270275d14  ONLINE       0     0     0
	    35000cca2702765b8  ONLINE       0     0     0
	    35000cca2702767e0  ONLINE       0     0     0
	    35000cca270276d84  ONLINE       0     0     0
	    35000cca270278108  ONLINE       0     0     0
	    35000cca270278dfc  ONLINE       0     0     0
	    35000cca270278eb0  ONLINE       0     0     0
	    35000cca27027b57c  ONLINE       0     0     0
	    35000cca27027b584  ONLINE       0     0     0
	  raidz2-4             ONLINE       0     0     0
	    35000cca27027c418  ONLINE       0     0     0
	    35000cca270280e58  ONLINE       0     0     0
	    35000cca270281e58  ONLINE       0     0     0
	    35000cca270282c64  ONLINE       0     0     0
	    35000cca27028d27c  ONLINE       0     0     0
	    35000cca27028d748  ONLINE       0     0     0
	    35000cca27028d758  ONLINE       0     0     0
	    35000cca27028d8fc  ONLINE       0     0     0
	    35000cca27028e5d4  ONLINE       0     0     0
	    35000cca27028e758  ONLINE       0     0     0
	    35000cca27028e9dc  ONLINE       0     0     0
	    35000cca27028eba0  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x2d368>:<0x0>

This happened again today (after a reboot this morning to recover from the above).

Nov  8 14:31:52 fs3 kernel: VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
Nov  8 14:31:52 fs3 kernel: PANIC at dmu_recv.c:1811:receive_object()
Nov  8 14:31:52 fs3 kernel: Showing stack for process 24727
Nov  8 14:31:52 fs3 kernel: CPU: 7 PID: 24727 Comm: receive_writer Kdump: loaded Tainted: P           OE  ------------   3.10.0-1160.42.2.el7.x86_64 #1
Nov  8 14:31:52 fs3 kernel: Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 2.1 06/14/2018
Nov  8 14:31:52 fs3 kernel: Call Trace:
Nov  8 14:31:52 fs3 kernel: [<ffffffff83983539>] dump_stack+0x19/0x1b
Nov  8 14:31:52 fs3 kernel: [<ffffffffc0521c5b>] spl_dumpstack+0x2b/0x30 [spl]
Nov  8 14:31:52 fs3 kernel: [<ffffffffc0521d29>] spl_panic+0xc9/0x110 [spl]
Nov  8 14:31:52 fs3 kernel: [<ffffffffc0b6276e>] ? arc_space_consume+0x5e/0x110 [zfs]
Nov  8 14:31:52 fs3 kernel: [<ffffffff83988632>] ? down_read+0x12/0x40
Nov  8 14:31:52 fs3 kernel: [<ffffffffc0b7b30f>] ? dbuf_read+0xff/0x570 [zfs]
Nov  8 14:31:52 fs3 kernel: [<ffffffffc0b825ee>] ? dmu_bonus_hold_by_dnode+0xde/0x1b0 [zfs]
Nov  8 14:31:52 fs3 kernel: [<ffffffffc05236b5>] ? spl_kmem_cache_free+0x185/0x210 [spl]
Nov  8 14:31:52 fs3 kernel: [<ffffffffc0b8f2da>] receive_object+0x64a/0xc80 [zfs]
Nov  8 14:31:52 fs3 kernel: [<ffffffff832d31f1>] ? __wake_up_common_lock+0x91/0xc0
Nov  8 14:31:52 fs3 kernel: [<ffffffff832d2c60>] ? task_rq_unlock+0x20/0x20
Nov  8 14:31:52 fs3 kernel: [<ffffffff835a67fd>] ? list_del+0xd/0x30
Nov  8 14:31:52 fs3 kernel: [<ffffffffc0b92904>] receive_writer_thread+0x4e4/0xa30 [zfs]
Nov  8 14:31:52 fs3 kernel: [<ffffffff832de305>] ? sched_clock_cpu+0x85/0xc0
Nov  8 14:31:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov  8 14:31:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov  8 14:31:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov  8 14:31:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov  8 14:31:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov  8 14:31:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:31:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov  8 14:31:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
<repeated many times>
Nov  8 14:31:54 fs3 zed: eid=33710 class=authentication pool='backup_pool' bookmark=176724:0:0:840
</repeated>
Nov  8 14:32:01 fs3 systemd: Started Session 1159 of user root.
Nov  8 14:33:01 fs3 systemd: Started Session 1160 of user root.
Nov  8 14:33:52 fs3 kernel: INFO: task txg_quiesce:8808 blocked for more than 120 seconds.
Nov  8 14:33:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 14:33:52 fs3 kernel: txg_quiesce     D ffff8d4fe8f8e8e0     0  8808      2 0x00000000
Nov  8 14:33:52 fs3 kernel: Call Trace:
Nov  8 14:33:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0521325>] cv_wait_common+0x125/0x150 [spl]
Nov  8 14:33:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0521365>] __cv_wait+0x15/0x20 [spl]
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0c1293b>] txg_quiesce_thread+0x2db/0x3e0 [zfs]
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0c12660>] ? txg_init+0x2b0/0x2b0 [zfs]
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov  8 14:33:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov  8 14:33:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:33:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov  8 14:33:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:33:52 fs3 kernel: INFO: task receive_writer:24727 blocked for more than 120 seconds.
Nov  8 14:33:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 14:33:52 fs3 kernel: receive_writer  D ffff8d501bafb760     0 24727      2 0x00000080
Nov  8 14:33:52 fs3 kernel: Call Trace:
Nov  8 14:33:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0521d55>] spl_panic+0xf5/0x110 [spl]
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0b6276e>] ? arc_space_consume+0x5e/0x110 [zfs]
Nov  8 14:33:52 fs3 kernel: [<ffffffff83988632>] ? down_read+0x12/0x40
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0b7b30f>] ? dbuf_read+0xff/0x570 [zfs]
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0b825ee>] ? dmu_bonus_hold_by_dnode+0xde/0x1b0 [zfs]
Nov  8 14:33:52 fs3 kernel: [<ffffffffc05236b5>] ? spl_kmem_cache_free+0x185/0x210 [spl]
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0b8f2da>] receive_object+0x64a/0xc80 [zfs]
Nov  8 14:33:52 fs3 kernel: [<ffffffff832d31f1>] ? __wake_up_common_lock+0x91/0xc0
Nov  8 14:33:52 fs3 kernel: [<ffffffff832d2c60>] ? task_rq_unlock+0x20/0x20
Nov  8 14:33:52 fs3 kernel: [<ffffffff835a67fd>] ? list_del+0xd/0x30
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0b92904>] receive_writer_thread+0x4e4/0xa30 [zfs]
Nov  8 14:33:52 fs3 kernel: [<ffffffff832de305>] ? sched_clock_cpu+0x85/0xc0
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov  8 14:33:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov  8 14:33:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov  8 14:33:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:33:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov  8 14:33:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:34:01 fs3 systemd: Started Session 1161 of user root.
Nov  8 14:35:01 fs3 systemd: Started Session 1162 of user root.
Nov  8 14:35:01 fs3 systemd: Started Session 1163 of user root.
Nov  8 14:35:52 fs3 kernel: INFO: task txg_quiesce:8808 blocked for more than 120 seconds.
Nov  8 14:35:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 14:35:52 fs3 kernel: txg_quiesce     D ffff8d4fe8f8e8e0     0  8808      2 0x00000000
Nov  8 14:35:52 fs3 kernel: Call Trace:
Nov  8 14:35:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0521325>] cv_wait_common+0x125/0x150 [spl]
Nov  8 14:35:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0521365>] __cv_wait+0x15/0x20 [spl]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0c1293b>] txg_quiesce_thread+0x2db/0x3e0 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0c12660>] ? txg_init+0x2b0/0x2b0 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov  8 14:35:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov  8 14:35:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:35:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov  8 14:35:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:35:52 fs3 kernel: INFO: task receive_writer:24727 blocked for more than 120 seconds.
Nov  8 14:35:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 14:35:52 fs3 kernel: receive_writer  D ffff8d501bafb760     0 24727      2 0x00000080
Nov  8 14:35:52 fs3 kernel: Call Trace:
Nov  8 14:35:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0521d55>] spl_panic+0xf5/0x110 [spl]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0b6276e>] ? arc_space_consume+0x5e/0x110 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffff83988632>] ? down_read+0x12/0x40
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0b7b30f>] ? dbuf_read+0xff/0x570 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0b825ee>] ? dmu_bonus_hold_by_dnode+0xde/0x1b0 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc05236b5>] ? spl_kmem_cache_free+0x185/0x210 [spl]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0b8f2da>] receive_object+0x64a/0xc80 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffff832d31f1>] ? __wake_up_common_lock+0x91/0xc0
Nov  8 14:35:52 fs3 kernel: [<ffffffff832d2c60>] ? task_rq_unlock+0x20/0x20
Nov  8 14:35:52 fs3 kernel: [<ffffffff835a67fd>] ? list_del+0xd/0x30
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0b92904>] receive_writer_thread+0x4e4/0xa30 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffff832de305>] ? sched_clock_cpu+0x85/0xc0
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov  8 14:35:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov  8 14:35:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:35:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov  8 14:35:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:35:52 fs3 kernel: INFO: task zfs:25014 blocked for more than 120 seconds.
Nov  8 14:35:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 14:35:52 fs3 kernel: zfs             D ffff8d502d519660     0 25014   7344 0x00000080
Nov  8 14:35:52 fs3 kernel: Call Trace:
Nov  8 14:35:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov  8 14:35:52 fs3 kernel: [<ffffffff83986e41>] schedule_timeout+0x221/0x2d0
Nov  8 14:35:52 fs3 kernel: [<ffffffffc05236cf>] ? spl_kmem_cache_free+0x19f/0x210 [spl]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc05236cf>] ? spl_kmem_cache_free+0x19f/0x210 [spl]
Nov  8 14:35:52 fs3 kernel: [<ffffffff83988a2d>] io_schedule_timeout+0xad/0x130
Nov  8 14:35:52 fs3 kernel: [<ffffffff832c6ad6>] ? prepare_to_wait_exclusive+0x56/0x90
Nov  8 14:35:52 fs3 kernel: [<ffffffff83988ac8>] io_schedule+0x18/0x20
Nov  8 14:35:52 fs3 kernel: [<ffffffffc05212b2>] cv_wait_common+0xb2/0x150 [spl]
Nov  8 14:35:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0521388>] __cv_wait_io+0x18/0x20 [spl]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0c12155>] txg_wait_synced_impl+0xe5/0x130 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0c121b0>] txg_wait_synced+0x10/0x40 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0c691dc>] zil_close+0x17c/0x240 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0c88b61>] zfsvfs_teardown+0xc1/0x2e0 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0c88d90>] zfs_suspend_fs+0x10/0x20 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0c58189>] zfs_ioc_rollback+0xd9/0x180 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0c5b42b>] zfsdev_ioctl_common+0x51b/0x820 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffffc0c86ff6>] zfsdev_ioctl+0x56/0xf0 [zfs]
Nov  8 14:35:52 fs3 kernel: [<ffffffff83463ab0>] do_vfs_ioctl+0x3a0/0x5b0
Nov  8 14:35:52 fs3 kernel: [<ffffffff833fc0c7>] ? do_munmap+0x327/0x480
Nov  8 14:35:52 fs3 kernel: [<ffffffff83463d61>] SyS_ioctl+0xa1/0xc0
Nov  8 14:35:52 fs3 kernel: [<ffffffff83995f92>] system_call_fastpath+0x25/0x2a
Nov  8 14:36:01 fs3 systemd: Started Session 1165 of user root.
Nov  8 14:37:01 fs3 systemd: Started Session 1166 of user root.
Nov  8 14:37:52 fs3 kernel: INFO: task txg_quiesce:8808 blocked for more than 120 seconds.
Nov  8 14:37:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 14:37:52 fs3 kernel: txg_quiesce     D ffff8d4fe8f8e8e0     0  8808      2 0x00000000
Nov  8 14:37:52 fs3 kernel: Call Trace:
Nov  8 14:37:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0521325>] cv_wait_common+0x125/0x150 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0521365>] __cv_wait+0x15/0x20 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0c1293b>] txg_quiesce_thread+0x2db/0x3e0 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0c12660>] ? txg_init+0x2b0/0x2b0 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov  8 14:37:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:37:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov  8 14:37:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:37:52 fs3 kernel: INFO: task receive_writer:24727 blocked for more than 120 seconds.
Nov  8 14:37:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 14:37:52 fs3 kernel: receive_writer  D ffff8d501bafb760     0 24727      2 0x00000080
Nov  8 14:37:52 fs3 kernel: Call Trace:
Nov  8 14:37:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0521d55>] spl_panic+0xf5/0x110 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0b6276e>] ? arc_space_consume+0x5e/0x110 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffff83988632>] ? down_read+0x12/0x40
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0b7b30f>] ? dbuf_read+0xff/0x570 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0b825ee>] ? dmu_bonus_hold_by_dnode+0xde/0x1b0 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc05236b5>] ? spl_kmem_cache_free+0x185/0x210 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0b8f2da>] receive_object+0x64a/0xc80 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffff832d31f1>] ? __wake_up_common_lock+0x91/0xc0
Nov  8 14:37:52 fs3 kernel: [<ffffffff832d2c60>] ? task_rq_unlock+0x20/0x20
Nov  8 14:37:52 fs3 kernel: [<ffffffff835a67fd>] ? list_del+0xd/0x30
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0b92904>] receive_writer_thread+0x4e4/0xa30 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffff832de305>] ? sched_clock_cpu+0x85/0xc0
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0b92420>] ? receive_process_write_record+0x180/0x180 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov  8 14:37:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:37:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov  8 14:37:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:37:52 fs3 kernel: INFO: task zfs:25014 blocked for more than 120 seconds.
Nov  8 14:37:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 14:37:52 fs3 kernel: zfs             D ffff8d502d519660     0 25014   7344 0x00000080
Nov  8 14:37:52 fs3 kernel: Call Trace:
Nov  8 14:37:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov  8 14:37:52 fs3 kernel: [<ffffffff83986e41>] schedule_timeout+0x221/0x2d0
Nov  8 14:37:52 fs3 kernel: [<ffffffffc05236cf>] ? spl_kmem_cache_free+0x19f/0x210 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc05236cf>] ? spl_kmem_cache_free+0x19f/0x210 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffff83988a2d>] io_schedule_timeout+0xad/0x130
Nov  8 14:37:52 fs3 kernel: [<ffffffff832c6ad6>] ? prepare_to_wait_exclusive+0x56/0x90
Nov  8 14:37:52 fs3 kernel: [<ffffffff83988ac8>] io_schedule+0x18/0x20
Nov  8 14:37:52 fs3 kernel: [<ffffffffc05212b2>] cv_wait_common+0xb2/0x150 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0521388>] __cv_wait_io+0x18/0x20 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0c12155>] txg_wait_synced_impl+0xe5/0x130 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0c121b0>] txg_wait_synced+0x10/0x40 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0c691dc>] zil_close+0x17c/0x240 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0c88b61>] zfsvfs_teardown+0xc1/0x2e0 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0c88d90>] zfs_suspend_fs+0x10/0x20 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0c58189>] zfs_ioc_rollback+0xd9/0x180 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0c5b42b>] zfsdev_ioctl_common+0x51b/0x820 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0c86ff6>] zfsdev_ioctl+0x56/0xf0 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffff83463ab0>] do_vfs_ioctl+0x3a0/0x5b0
Nov  8 14:37:52 fs3 kernel: [<ffffffff833fc0c7>] ? do_munmap+0x327/0x480
Nov  8 14:37:52 fs3 kernel: [<ffffffff83463d61>] SyS_ioctl+0xa1/0xc0
Nov  8 14:37:52 fs3 kernel: [<ffffffff83995f92>] system_call_fastpath+0x25/0x2a
Nov  8 14:37:52 fs3 kernel: INFO: task sshd:221387 blocked for more than 120 seconds.
Nov  8 14:37:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 14:37:52 fs3 kernel: sshd            D ffff8d50007726e0     0 221387  11295 0x00000080
Nov  8 14:37:52 fs3 kernel: Call Trace:
Nov  8 14:37:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0521325>] cv_wait_common+0x125/0x150 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0521365>] __cv_wait+0x15/0x20 [spl]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0beaef3>] rrw_enter_read_impl+0x53/0x170 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0beb59f>] rrm_enter_read+0x3f/0x50 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffffc0ca1454>] zpl_show_devname+0x34/0x100 [zfs]
Nov  8 14:37:52 fs3 kernel: [<ffffffff83495f46>] show_mountinfo+0x1d6/0x2d0
Nov  8 14:37:52 fs3 kernel: [<ffffffff834707b9>] m_show+0x19/0x20
Nov  8 14:37:52 fs3 kernel: [<ffffffff83476cf0>] seq_read+0x130/0x450
Nov  8 14:37:52 fs3 kernel: [<ffffffff8344e3df>] vfs_read+0x9f/0x170
Nov  8 14:37:52 fs3 kernel: [<ffffffff8344f25f>] SyS_read+0x7f/0xf0
Nov  8 14:37:52 fs3 kernel: [<ffffffff83995f92>] system_call_fastpath+0x25/0x2a
Nov  8 14:38:01 fs3 systemd: Started Session 1167 of user root.
Nov  8 14:39:01 fs3 systemd: Started Session 1170 of user root.
Nov  8 14:39:52 fs3 kernel: INFO: task txg_quiesce:8808 blocked for more than 120 seconds.
Nov  8 14:39:52 fs3 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov  8 14:39:52 fs3 kernel: txg_quiesce     D ffff8d4fe8f8e8e0     0  8808      2 0x00000000
Nov  8 14:39:52 fs3 kernel: Call Trace:
Nov  8 14:39:52 fs3 kernel: [<ffffffff83989179>] schedule+0x29/0x70
Nov  8 14:39:52 fs3 kernel: [<ffffffffc0521325>] cv_wait_common+0x125/0x150 [spl]
Nov  8 14:39:52 fs3 kernel: [<ffffffff832c6f50>] ? wake_up_atomic_t+0x30/0x30
Nov  8 14:39:52 fs3 kernel: [<ffffffffc0521365>] __cv_wait+0x15/0x20 [spl]
Nov  8 14:39:52 fs3 kernel: [<ffffffffc0c1293b>] txg_quiesce_thread+0x2db/0x3e0 [zfs]
Nov  8 14:39:52 fs3 kernel: [<ffffffffc0c12660>] ? txg_init+0x2b0/0x2b0 [zfs]
Nov  8 14:39:52 fs3 kernel: [<ffffffffc0528873>] thread_generic_wrapper+0x73/0x80 [spl]
Nov  8 14:39:52 fs3 kernel: [<ffffffffc0528800>] ? __thread_exit+0x20/0x20 [spl]
Nov  8 14:39:52 fs3 kernel: [<ffffffff832c5e61>] kthread+0xd1/0xe0
Nov  8 14:39:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40
Nov  8 14:39:52 fs3 kernel: [<ffffffff83995ddd>] ret_from_fork_nospec_begin+0x7/0x21
Nov  8 14:39:52 fs3 kernel: [<ffffffff832c5d90>] ? insert_kthread_work+0x40/0x40

At this point, zfs sends appear to be still in operation (remote target continues to receive data) but anything involving writes is hung.

# zfs iostat 1
               capacity     operations     bandwidth 
pool         alloc   free   read  write   read  write
-----------  -----  -----  -----  -----  -----  -----
backup_pool   388T   266T  6.02K      0   101M      0
backup_pool   388T   266T  6.18K      0  99.0M      0
backup_pool   388T   266T  6.94K      0   114M      0
backup_pool   388T   266T  5.42K      0  92.9M      0

Describe how to reproduce the problem

Use sanoid/syncoid to generate and transfer (push and pull) snapshots between systems. Wait...

Include any warning/errors/backtraces from the system logs

@scratchings scratchings added the Type: Defect Incorrect behavior (e.g. crash, hang) label Nov 8, 2021
@rincebrain
Copy link
Contributor

I think this is just another flavor of #11679, where a NULL winds up where a valid buffer is expected, and fireworks ensue.

@rincebrain rincebrain added Component: Encryption "native encryption" feature Component: Send/Recv "zfs send/recv" feature labels Nov 11, 2021
@scratchings
Copy link
Author

Update on this, I'm seeing this at least once a day now and have had it occur on a host that receives from this host. As it stands the backup host is effectively non-functional.

@scratchings
Copy link
Author

Further update

After running two complete scrubs of the filesystem the corrupted data errors went away and we were able to resume transfers. As a test Sanoid was disabled, so that there should be snapshot deletion on a file system whilst it was being modified by a receive and things were much more stable. However, leaving it running the large (~20TB) transfer over the weekend I returned this morning to find it hung up again, but this time with a different error:

Nov 20 12:41:15 fs3 kernel: Modules linked in: 8021q garp mrp bonding ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat n
f_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conn
track ip_set ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter drbg ansi_cprng dm_crypt iTCO_wdt iTCO_vendor_support skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqb
ypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr ses enclosure dm_service_time ipmi_ssif joydev lpc_ich sg i2c_i801 mei_me mei wmi ipmi_si ipmi_devintf
Nov 20 12:41:15 fs3 kernel: ipmi_msghandler acpi_power_meter acpi_pad dm_multipath dm_mod binfmt_misc ip_tables xfs libcrc32c zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) ic
p(POE) spl(OE) raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mpt3sas drm crct10dif_pclmul ahci crct10dif_common i40e crc32c_inte
l libahci libata raid_class scsi_transport_sas ptp pps_core nfit drm_panel_orientation_quirks libnvdimm
Nov 20 12:41:15 fs3 kernel: CPU: 5 PID: 49866 Comm: zfs Kdump: loaded Tainted: P OE ------------ 3.10.0-1160.42.2.el7.x86_64 #1
Nov 20 12:41:15 fs3 kernel: Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 2.1 06/14/2018
Nov 20 12:41:15 fs3 kernel: task: ffff9209afde2100 ti: ffff9201d797c000 task.ti: ffff9201d797c000
Nov 20 12:41:15 fs3 kernel: RIP: 0010:[] [] dsl_pool_zrele_taskq+0xa/0x10 [zfs]
Nov 20 12:41:15 fs3 kernel: RSP: 0018:ffff9201d797f938 EFLAGS: 00000286
Nov 20 12:41:15 fs3 kernel: RAX: ffff920afbc34300 RBX: ffff9201d797f8c8 RCX: 0000000000000001
Nov 20 12:41:15 fs3 kernel: RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffff920af74d1800
Nov 20 12:41:15 fs3 kernel: RBP: ffff9201d797f970 R08: 0000000000000101 R09: 0000000000000001
Nov 20 12:41:15 fs3 kernel: R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000010
Nov 20 12:41:15 fs3 kernel: R13: ffffffff8e78bafb R14: ffffffffffffff10 R15: ffff920afbc34300
Nov 20 12:41:15 fs3 kernel: FS: 00007fd2bdd167c0(0000) GS:ffff920afbd40000(0000) knlGS:0000000000000000
Nov 20 12:41:15 fs3 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 20 12:41:15 fs3 kernel: CR2: 00007f0d554b57f0 CR3: 00000009399cc000 CR4: 00000000007607e0
Nov 20 12:41:15 fs3 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 20 12:41:15 fs3 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 20 12:41:15 fs3 kernel: PKRU: 55555554
Nov 20 12:41:15 fs3 kernel: Call Trace:
Nov 20 12:41:15 fs3 kernel: [] ? zfsvfs_teardown+0x4f/0x2e0 [zfs]
Nov 20 12:41:15 fs3 kernel: [] zfs_umount+0x39/0x120 [zfs]
Nov 20 12:41:15 fs3 kernel: [] zfs_resume_fs+0x106/0x340 [zfs]
Nov 20 12:41:15 fs3 kernel: [] zfs_ioc_recv_impl+0xa57/0xfd0 [zfs]
Nov 20 12:41:15 fs3 kernel: [] ? dbuf_read+0x3d6/0x570 [zfs]
Nov 20 12:41:15 fs3 kernel: [] zfs_ioc_recv_new+0x2b4/0x330 [zfs]
Nov 20 12:41:15 fs3 kernel: [] ? spl_vmem_alloc+0x19/0x20 [spl]
Nov 20 12:41:15 fs3 kernel: [] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
Nov 20 12:41:15 fs3 kernel: [] ? nv_mem_zalloc.isra.13+0x30/0x40 [znvpair]
Nov 20 12:41:15 fs3 kernel: [] ? nvlist_xalloc.part.14+0x97/0x190 [znvpair]
Nov 20 12:41:15 fs3 kernel: [] zfsdev_ioctl_common+0x51b/0x820 [zfs]
Nov 20 12:41:15 fs3 kernel: [] zfsdev_ioctl+0x56/0xf0 [zfs]
Nov 20 12:41:15 fs3 kernel: [] do_vfs_ioctl+0x3a0/0x5b0
Nov 20 12:41:15 fs3 kernel: [] ? __do_page_fault+0x238/0x500
Nov 20 12:41:15 fs3 kernel: [] SyS_ioctl+0xa1/0xc0
Nov 20 12:41:15 fs3 kernel: [] system_call_fastpath+0x25/0x2a
Nov 20 12:41:15 fs3 kernel: Code: cf d0 c0 48 c7 c7 e8 ca d2 c0 31 c0 e8 40 0e 70 ff e9 50 fe ff ff 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 47 40 <48> 89 e5 5d c3 90 0f 1f 44 00 00 55 48 8b
47 48 48 89 e5 5d c3

Data errors are back (on different file systems), so I have some hope that scrubs will clear these.

@rincebrain
Copy link
Contributor

Is there an error message above the "Modules linked in: ..."? Because I would usually expect one, I think.

@scratchings
Copy link
Author

...and a bit later on in the logs...


Nov 20 17:47:31 fs3 kernel: NMI watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [zfs:49866]
Nov 20 17:47:31 fs3 kernel: Modules linked in: 8021q garp mrp bonding ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack ip_set ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter drbg ansi_cprng dm_crypt iTCO_wdt iTCO_vendor_support skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr ses enclosure dm_service_time ipmi_ssif joydev lpc_ich sg i2c_i801 mei_me mei wmi ipmi_si ipmi_devintf
Nov 20 17:47:31 fs3 kernel: ipmi_msghandler acpi_power_meter acpi_pad dm_multipath dm_mod binfmt_misc ip_tables xfs libcrc32c zfs(POE) zunicode(POE) zzstd(OE) zlua(OE) zcommon(POE) znvpair(POE) zavl(POE) icp(POE) spl(OE) raid1 sd_mod crc_t10dif crct10dif_generic ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm mpt3sas drm crct10dif_pclmul ahci crct10dif_common i40e crc32c_intel libahci libata raid_class scsi_transport_sas ptp pps_core nfit drm_panel_orientation_quirks libnvdimm
Nov 20 17:47:31 fs3 kernel: CPU: 1 PID: 49866 Comm: zfs Kdump: loaded Tainted: P           OEL ------------   3.10.0-1160.42.2.el7.x86_64 #1
Nov 20 17:47:31 fs3 kernel: Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 2.1 06/14/2018
Nov 20 17:47:31 fs3 kernel: task: ffff9209afde2100 ti: ffff9201d797c000 task.ti: ffff9201d797c000
Nov 20 17:47:31 fs3 kernel: RIP: 0010:[<ffffffff8e78b795>]  [<ffffffff8e78b795>] _raw_spin_unlock_irqrestore+0x15/0x20
Nov 20 17:47:31 fs3 kernel: RSP: 0018:ffff9201d797f8d8  EFLAGS: 00000246
Nov 20 17:47:31 fs3 kernel: RAX: 0000000000000246 RBX: 0000000000000001 RCX: dead000000000200
Nov 20 17:47:31 fs3 kernel: RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
Nov 20 17:47:31 fs3 kernel: RBP: ffff9201d797f8d8 R08: ffff9201d797f900 R09: 0000000000000001
Nov 20 17:47:31 fs3 kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffff91fe822abe80
Nov 20 17:47:31 fs3 kernel: R13: ffff9201d797f8d8 R14: ffff91fe822abe80 R15: 0000000000000000
Nov 20 17:47:31 fs3 kernel: FS:  00007fd2bdd167c0(0000) GS:ffff920afbc40000(0000) knlGS:0000000000000000
Nov 20 17:47:31 fs3 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 20 17:47:31 fs3 kernel: CR2: 00007f16d72c2000 CR3: 00000009399cc000 CR4: 00000000007607e0
Nov 20 17:47:31 fs3 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Nov 20 17:47:31 fs3 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Nov 20 17:47:31 fs3 kernel: PKRU: 55555554
Nov 20 17:47:31 fs3 kernel: Call Trace:
Nov 20 17:47:31 fs3 kernel: [<ffffffffc02ccc3d>] taskq_wait_outstanding+0x4d/0xf0 [spl]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c85af9>] zfsvfs_teardown+0x59/0x2e0 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c86419>] zfs_umount+0x39/0x120 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c88fd6>] zfs_resume_fs+0x106/0x340 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c562e7>] zfs_ioc_recv_impl+0xa57/0xfd0 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0b785e6>] ? dbuf_read+0x3d6/0x570 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c56dd4>] zfs_ioc_recv_new+0x2b4/0x330 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc02ceb59>] ? spl_vmem_alloc+0x19/0x20 [spl]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc02e933f>] ? nv_alloc_sleep_spl+0x1f/0x30 [znvpair]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc02e3760>] ? nv_mem_zalloc.isra.13+0x30/0x40 [znvpair]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc02e3a17>] ? nvlist_xalloc.part.14+0x97/0x190 [znvpair]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c5842b>] zfsdev_ioctl_common+0x51b/0x820 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffffc0c83ff6>] zfsdev_ioctl+0x56/0xf0 [zfs]
Nov 20 17:47:31 fs3 kernel: [<ffffffff8e263ab0>] do_vfs_ioctl+0x3a0/0x5b0
Nov 20 17:47:31 fs3 kernel: [<ffffffff8e790678>] ? __do_page_fault+0x238/0x500
Nov 20 17:47:31 fs3 kernel: [<ffffffff8e263d61>] SyS_ioctl+0xa1/0xc0
Nov 20 17:47:31 fs3 kernel: [<ffffffff8e795f92>] system_call_fastpath+0x25/0x2a
Nov 20 17:47:31 fs3 kernel: Code: 07 00 0f 1f 40 00 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 48 

In answer to your question, no, it just looks like the zfs command that was being run at the time froze up. I had to trigger a hardware reset as a shutdown -r now did very little.

@scratchings
Copy link
Author

Correction, there was a NMI watchdog line, must have failed to copy that the first time, it's for the same process ID as the second call trace.

@rincebrain
Copy link
Contributor

Huh. I wonder what state is wrong that it's stuck forever on, because I don't think I expect ordinary "blocked taskq" to trigger one of those.

@stale
Copy link

stale bot commented Nov 22, 2022

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Nov 22, 2022
@w23
Copy link

w23 commented Nov 25, 2022

Bumping this as this is something I've been running into with exactly the same symptoms as above. This essentially makes native zfs encryption completely unusable for me.

Also, there's another somewhat similar issue: sending from encrypted fs also leads to seemingly recoverable "permanent data errors" on intermediate from-snapshots. It doesn't lead to any kernel errors, and recovering from it is simply deleting an offending snapshot.

@stale stale bot removed the Status: Stale No recent activity for issue label Nov 25, 2022
@rincebrain
Copy link
Contributor

You might find the fix for #13709 useful for the latter, w23, because I don't know what else might break those.

@w23
Copy link

w23 commented Nov 28, 2022

Thanks for the link. After reading it a bit I'm not sure if that's relevant to issues I'm seeing. I.e. the pools and encrypted filesystems get unlocked and mounted just fine. It's send/recv that has issues.

It feels like it's some sort of pool-wide race condition, as doing more ops to completely different filesystem subtrees (i.e. unrelated by nesting, and having different keys) severely exacerbates the problem. These ops include stuff like:

  • copying big files to a different filesystem
  • just zfs listing snapshots on an unrelated filesystem

Transfer speed also affects the probability of hitting it. It can be fine for weeks if transfers are around few hundred KiB/sec. If it gets to a few dozen MiB/sec, it usually panics within a few minutes.

I contemplated making a standalone test, e.g. a minimal qemu image with a script doing zfs send|zfs recv in a loop, to get a clear and consistent reproduction scenario, but so far haven't been able to find time to prepare it.

@rincebrain
Copy link
Contributor

I was talking about the intermediate snapshots failing to unlock, as I was remarking that it might be that problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Encryption "native encryption" feature Component: Send/Recv "zfs send/recv" feature Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

3 participants