VERIFY3 PANIC during zfs receive #9741

ktry · 2019-12-17T20:42:43Z

This might be related to issue #2335

System information

Type	Version/Name
Distribution Name	OpenSuSE Leap
Distribution Version	42.3
Linux Kernel	4.4159-73-default
Architecture	x86_64
ZFS Version	0.8.1-1
SPL Version	0.8.1-1

Describe the problem you're observing

We have two VMs, each running on a separate OpenSuSE 42.3 KVM/libvirt hypervisor on enterprise-class servers. There are two raidz3 zpools on the first VM and one raidz3 pool on the second. Vdevs from all are supplied by iSCSI targets (each an 8TB or 10TB enterprise drive) over 10G ethernet (10 Vdevs/pool).

The first VM is running two zfs incremental send processes on the first zpool, started 4 hours apart. One is piped to a local zfs recv process to "mirror" the first zpool to the second. The second zfs send process is piped over ssh (on a 1G link) to a zfs recv process on the other, remote VM to make a second "mirror" of the first pool. We've successfully used this process on multiple systems for a couple of years now without incident. We recently upgraded zfs from 0.7.13 to 0.8.1. In this system, the pools have on the order of 191 file systems.

Over the weekend, the first VM reported a VERIFY3(insert_inode_locked(ip) == 0) failed (-16 == 0) followed by a PANIC at zfs_znode.c:611:zfs_znode_alloc(). The second VM reported the same error 6 hours later. The backtraces are 100% identical on both VMs. We found that the zfs recv processes remained present on both VMs and the first zfs send process was also present. Both recv process are blocked on taskq_wait_id, [z_unlinked_drai] is blocked on spl_panic, and zfs send is blocked on pipe_wait.

Cron-scheduled scrubs ran after these PANICs on the following Sunday night on all three zpools and reported no errors while the zfs send/recv process were still present.

We use zfs-auto-snapshot to manage periodic snapshots. We increased the number of snapshots to retain (--keep) by a factor of 2 to 12 two days before this panic. For example, we increased the hourly snapshots from 24 to 48, the daily from 31 to 90, and the 15 minute snapshots from 8 to 96. The pool now has on the order of 173 snapshots per filesystem for pool total of 50000. The first pool is 72.5T in size and currently only has 9.45T allocated. The other two pools are 91TB in size. lz4 compression is use, no dedup.

Describe how to reproduce the problem

So far this is a single incident. Our other systems have not encountered this problem although they have identical configurations with the exception that the --keep values on zfs-auto-snapshot have not been changed.

Include any warning/errors/backtraces from the system logs

The text was updated successfully, but these errors were encountered:

behlendorf · 2019-12-17T21:00:09Z

Thanks for the detailed bug report. As it happens @snajpa has been working on some related changes in #9600 which are directly related to this issue. The one question I have is do you also make use of zfs rollback in your environment? Or only zfs send/recv?

ktry · 2019-12-17T22:12:13Z

We have used rollback incidentally in the past, but I don't believe that it has been run on this system since the VMs and pools were built. On another system I use zfs ZVOLs to provide backing for a WSE2016 VM. In that case I use both zfs send/recv over a WAN and I have had a few opportunities to fix WSE2016 blunders via zfs rollback.

I should have noted that ZFS (and especially ZFS on Linux) is an awesome product and I am so grateful for the excellent work that you all have done for the community. THANKS SO MUCH!

siv0 · 2020-02-20T17:20:11Z

We have reports about a similar/the same issue (at https://bugzilla.proxmox.com/show_bug.cgi?id=2546 (contains a stacktrace of the panic) observed both with ZFS 0.8.2 on kernel 5.0 (based on Ubuntu disco) and ZFS 0.8.3 on kernel 5.3 (eoan)
rollbacks are involved insofar in at least 2 of 3 cases this has happend as the the recv was issued with the '-F' flag.

The issue does not happen deterministically but it recurs (I could not yet get a small reproducer).

Glad to provide more information if needed, and also to test ideas for reproducers and/or potential fixes.

liszca · 2020-05-01T19:46:05Z

I am able to reproduce.

# zfs version
zfs-0.8.3-pve1
zfs-kmod-0.8.3-pve1

The migration between the hosts doesn't always trigger the Failure, but I noticed that I get a syslog message before my later initiated by "Schedule now" button never finishes. The symptom is always the same "zfs recv -F" will hang with the "STAT D" and will not finish even after days.

Output ps aux:

root     26749  0.0  0.0  11500  4584 ?        D    21:20   0:00 zfs recv -F -- rpool/data/subvol-110-dis
k-0

syslog message got revealed on active terminal, this is before "zfs recv" will hang on next run:

Message from syslogd@phobos at May  1 21:19:05 ...
 kernel:[ 1290.155811] VERIFY3(insert_inode_locked(ip) == 0) failed (-16 == 0)

Message from syslogd@phobos at May  1 21:19:05 ...
 kernel:[ 1290.155871] PANIC at zfs_znode.c:611:zfs_znode_alloc()

Same type incedent, but another time, its always on the receiving Host:

[Fri May  1 21:19:05 2020] VERIFY3(insert_inode_locked(ip) == 0) failed (-16 == 0)
[Fri May  1 21:19:05 2020] PANIC at zfs_znode.c:611:zfs_znode_alloc()
[Fri May  1 21:19:05 2020] Showing stack for process 417
[Fri May  1 21:19:05 2020] CPU: 0 PID: 417 Comm: z_unlinked_drai Tainted: P           O      5.3.18-3-pve #1
[Fri May  1 21:19:05 2020] Hardware name: BIOSTAR Group A10N-8800E/A10N-8800E, BIOS 5.11 07/24/2019
[Fri May  1 21:19:05 2020] Call Trace:
[Fri May  1 21:19:05 2020]  dump_stack+0x6d/0x9a
[Fri May  1 21:19:05 2020]  spl_dumpstack+0x29/0x2b [spl]
[Fri May  1 21:19:05 2020]  spl_panic+0xd3/0xfb [spl]
[Fri May  1 21:19:05 2020]  ? sa_cache_constructor+0x27/0x50 [zfs]
[Fri May  1 21:19:05 2020]  ? spl_kmem_cache_alloc+0x114/0x780 [spl]
[Fri May  1 21:19:05 2020]  ? sa_lookup_locked+0x66/0xb0 [zfs]
[Fri May  1 21:19:05 2020]  ? iput+0x7a/0x1f0
[Fri May  1 21:19:05 2020]  ? insert_inode_locked+0x1d8/0x1e0
[Fri May  1 21:19:05 2020]  zfs_znode_alloc+0x63d/0x6a0 [zfs]
[Fri May  1 21:19:05 2020]  zfs_zget+0x1a7/0x250 [zfs]
[Fri May  1 21:19:05 2020]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
[Fri May  1 21:19:05 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:19:05 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:19:05 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:19:05 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:19:05 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:19:05 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:19:05 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:19:05 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:19:05 2020]  ? __switch_to+0x15a/0x480
[Fri May  1 21:19:05 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:19:05 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:19:05 2020]  ? __schedule+0x2c3/0x660
[Fri May  1 21:19:05 2020]  ? remove_wait_queue+0x4d/0x60
[Fri May  1 21:19:05 2020]  taskq_thread+0x2ec/0x4d0 [spl]
[Fri May  1 21:19:05 2020]  ? wake_up_q+0x80/0x80
[Fri May  1 21:19:05 2020]  kthread+0x120/0x140
[Fri May  1 21:19:05 2020]  ? task_done+0xb0/0xb0 [spl]
[Fri May  1 21:19:05 2020]  ? __kthread_parkme+0x70/0x70
[Fri May  1 21:19:05 2020]  ret_from_fork+0x22/0x40
[Fri May  1 21:21:46 2020] INFO: task z_unlinked_drai:417 blocked for more than 120 seconds.
[Fri May  1 21:21:46 2020]       Tainted: P           O      5.3.18-3-pve #1
[Fri May  1 21:21:46 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri May  1 21:21:46 2020] z_unlinked_drai D    0   417      2 0x80004000
[Fri May  1 21:21:46 2020] Call Trace:
[Fri May  1 21:21:46 2020]  __schedule+0x2bb/0x660
[Fri May  1 21:21:46 2020]  schedule+0x33/0xa0
[Fri May  1 21:21:46 2020]  spl_panic+0xf9/0xfb [spl]
[Fri May  1 21:21:46 2020]  ? sa_cache_constructor+0x27/0x50 [zfs]
[Fri May  1 21:21:46 2020]  ? spl_kmem_cache_alloc+0x114/0x780 [spl]
[Fri May  1 21:21:46 2020]  ? sa_lookup_locked+0x66/0xb0 [zfs]
[Fri May  1 21:21:46 2020]  ? iput+0x7a/0x1f0
[Fri May  1 21:21:46 2020]  ? insert_inode_locked+0x1d8/0x1e0
[Fri May  1 21:21:46 2020]  zfs_znode_alloc+0x63d/0x6a0 [zfs]
[Fri May  1 21:21:46 2020]  zfs_zget+0x1a7/0x250 [zfs]
[Fri May  1 21:21:46 2020]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
[Fri May  1 21:21:46 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:21:46 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:21:46 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:21:46 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:21:46 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:21:46 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:21:46 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:21:46 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:21:46 2020]  ? __switch_to+0x15a/0x480
[Fri May  1 21:21:46 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:21:46 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:21:46 2020]  ? __schedule+0x2c3/0x660
[Fri May  1 21:21:46 2020]  ? remove_wait_queue+0x4d/0x60
[Fri May  1 21:21:46 2020]  taskq_thread+0x2ec/0x4d0 [spl]
[Fri May  1 21:21:46 2020]  ? wake_up_q+0x80/0x80
[Fri May  1 21:21:46 2020]  kthread+0x120/0x140
[Fri May  1 21:21:46 2020]  ? task_done+0xb0/0xb0 [spl]
[Fri May  1 21:21:46 2020]  ? __kthread_parkme+0x70/0x70
[Fri May  1 21:21:46 2020]  ret_from_fork+0x22/0x40
[Fri May  1 21:23:47 2020] INFO: task z_unlinked_drai:417 blocked for more than 241 seconds.
[Fri May  1 21:23:47 2020]       Tainted: P           O      5.3.18-3-pve #1
[Fri May  1 21:23:47 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri May  1 21:23:47 2020] z_unlinked_drai D    0   417      2 0x80004000
[Fri May  1 21:23:47 2020] Call Trace:
[Fri May  1 21:23:47 2020]  __schedule+0x2bb/0x660
[Fri May  1 21:23:47 2020]  schedule+0x33/0xa0
[Fri May  1 21:23:47 2020]  spl_panic+0xf9/0xfb [spl]
[Fri May  1 21:23:47 2020]  ? sa_cache_constructor+0x27/0x50 [zfs]
[Fri May  1 21:23:47 2020]  ? spl_kmem_cache_alloc+0x114/0x780 [spl]
[Fri May  1 21:23:47 2020]  ? sa_lookup_locked+0x66/0xb0 [zfs]
[Fri May  1 21:23:47 2020]  ? iput+0x7a/0x1f0
[Fri May  1 21:23:47 2020]  ? insert_inode_locked+0x1d8/0x1e0
[Fri May  1 21:23:47 2020]  zfs_znode_alloc+0x63d/0x6a0 [zfs]
[Fri May  1 21:23:47 2020]  zfs_zget+0x1a7/0x250 [zfs]
[Fri May  1 21:23:47 2020]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
[Fri May  1 21:23:47 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:23:47 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:23:47 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:23:47 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:23:47 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:23:47 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:23:47 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:23:47 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:23:47 2020]  ? __switch_to+0x15a/0x480
[Fri May  1 21:23:47 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:23:47 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:23:47 2020]  ? __schedule+0x2c3/0x660
[Fri May  1 21:23:47 2020]  ? remove_wait_queue+0x4d/0x60
[Fri May  1 21:23:47 2020]  taskq_thread+0x2ec/0x4d0 [spl]
[Fri May  1 21:23:47 2020]  ? wake_up_q+0x80/0x80
[Fri May  1 21:23:47 2020]  kthread+0x120/0x140
[Fri May  1 21:23:47 2020]  ? task_done+0xb0/0xb0 [spl]
[Fri May  1 21:23:47 2020]  ? __kthread_parkme+0x70/0x70
[Fri May  1 21:23:47 2020]  ret_from_fork+0x22/0x40
[Fri May  1 21:23:47 2020] INFO: task zfs:26749 blocked for more than 120 seconds.
[Fri May  1 21:23:47 2020]       Tainted: P           O      5.3.18-3-pve #1
[Fri May  1 21:23:47 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri May  1 21:23:47 2020] zfs             D    0 26749  26726 0x80004000
[Fri May  1 21:23:47 2020] Call Trace:
[Fri May  1 21:23:47 2020]  __schedule+0x2bb/0x660
[Fri May  1 21:23:47 2020]  ? tsd_hash_dtor+0x78/0x90 [spl]
[Fri May  1 21:23:47 2020]  schedule+0x33/0xa0
[Fri May  1 21:23:47 2020]  taskq_wait_id+0x90/0xe0 [spl]
[Fri May  1 21:23:47 2020]  ? wait_woken+0x80/0x80
[Fri May  1 21:23:47 2020]  taskq_cancel_id+0xdb/0x110 [spl]
[Fri May  1 21:23:47 2020]  zfs_unlinked_drain_stop_wait+0x47/0x70 [zfs]
[Fri May  1 21:23:47 2020]  zfsvfs_teardown+0x25/0x320 [zfs]
[Fri May  1 21:23:47 2020]  zfs_suspend_fs+0x10/0x20 [zfs]
[Fri May  1 21:23:47 2020]  zfs_ioc_recv_impl+0x6b2/0x10e0 [zfs]
[Fri May  1 21:23:47 2020]  zfs_ioc_recv+0x19a/0x340 [zfs]
[Fri May  1 21:23:47 2020]  ? ns_capable_common+0x2f/0x50
[Fri May  1 21:23:47 2020]  ? capable+0x19/0x20
[Fri May  1 21:23:47 2020]  ? priv_policy.isra.3.part.4+0x11/0x20 [zfs]
[Fri May  1 21:23:47 2020]  ? secpolicy_zinject+0x3a/0x40 [zfs]
[Fri May  1 21:23:47 2020]  ? _cond_resched+0x19/0x30
[Fri May  1 21:23:47 2020]  ? __kmalloc+0x183/0x270
[Fri May  1 21:23:47 2020]  ? strdup+0x45/0x70 [spl]
[Fri May  1 21:23:47 2020]  zfsdev_ioctl+0x6db/0x8f0 [zfs]
[Fri May  1 21:23:47 2020]  ? lru_cache_add_active_or_unevictable+0x39/0xb0
[Fri May  1 21:23:47 2020]  do_vfs_ioctl+0xa9/0x640
[Fri May  1 21:23:47 2020]  ? handle_mm_fault+0xc5/0x1e0
[Fri May  1 21:23:47 2020]  ksys_ioctl+0x67/0x90
[Fri May  1 21:23:47 2020]  __x64_sys_ioctl+0x1a/0x20
[Fri May  1 21:23:47 2020]  do_syscall_64+0x5a/0x130
[Fri May  1 21:23:47 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Fri May  1 21:23:47 2020] RIP: 0033:0x7fa3960c0427
[Fri May  1 21:23:47 2020] Code: Bad RIP value.
[Fri May  1 21:23:47 2020] RSP: 002b:00007fffc9d072a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Fri May  1 21:23:47 2020] RAX: ffffffffffffffda RBX: 00007fffc9d07450 RCX: 00007fa3960c0427
[Fri May  1 21:23:47 2020] RDX: 00007fffc9d07450 RSI: 0000000000005a1b RDI: 0000000000000006
[Fri May  1 21:23:47 2020] RBP: 00007fffc9d0ba40 R08: 0000000000000003 R09: 00007fa39618be40
[Fri May  1 21:23:47 2020] R10: 0000557f7de5b010 R11: 0000000000000246 R12: 00007fffc9d0aa00
[Fri May  1 21:23:47 2020] R13: 00007fffc9d16028 R14: 0000557f7de5eaf0 R15: 00007fffc9d118c0
[Fri May  1 21:25:48 2020] INFO: task z_unlinked_drai:417 blocked for more than 362 seconds.
[Fri May  1 21:25:48 2020]       Tainted: P           O      5.3.18-3-pve #1
[Fri May  1 21:25:48 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri May  1 21:25:48 2020] z_unlinked_drai D    0   417      2 0x80004000
[Fri May  1 21:25:48 2020] Call Trace:
[Fri May  1 21:25:48 2020]  __schedule+0x2bb/0x660
[Fri May  1 21:25:48 2020]  schedule+0x33/0xa0
[Fri May  1 21:25:48 2020]  spl_panic+0xf9/0xfb [spl]
[Fri May  1 21:25:48 2020]  ? sa_cache_constructor+0x27/0x50 [zfs]
[Fri May  1 21:25:48 2020]  ? spl_kmem_cache_alloc+0x114/0x780 [spl]
[Fri May  1 21:25:48 2020]  ? sa_lookup_locked+0x66/0xb0 [zfs]
[Fri May  1 21:25:48 2020]  ? iput+0x7a/0x1f0
[Fri May  1 21:25:48 2020]  ? insert_inode_locked+0x1d8/0x1e0
[Fri May  1 21:25:48 2020]  zfs_znode_alloc+0x63d/0x6a0 [zfs]
[Fri May  1 21:25:48 2020]  zfs_zget+0x1a7/0x250 [zfs]
[Fri May  1 21:25:48 2020]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
[Fri May  1 21:25:48 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:25:48 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:25:48 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:25:48 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:25:48 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:25:48 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:25:48 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:25:48 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:25:48 2020]  ? __switch_to+0x15a/0x480
[Fri May  1 21:25:48 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:25:48 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:25:48 2020]  ? __schedule+0x2c3/0x660
[Fri May  1 21:25:48 2020]  ? remove_wait_queue+0x4d/0x60
[Fri May  1 21:25:48 2020]  taskq_thread+0x2ec/0x4d0 [spl]
[Fri May  1 21:25:48 2020]  ? wake_up_q+0x80/0x80
[Fri May  1 21:25:48 2020]  kthread+0x120/0x140
[Fri May  1 21:25:48 2020]  ? task_done+0xb0/0xb0 [spl]
[Fri May  1 21:25:48 2020]  ? __kthread_parkme+0x70/0x70
[Fri May  1 21:25:48 2020]  ret_from_fork+0x22/0x40
[Fri May  1 21:25:48 2020] INFO: task zfs:26749 blocked for more than 241 seconds.
[Fri May  1 21:25:48 2020]       Tainted: P           O      5.3.18-3-pve #1
[Fri May  1 21:25:48 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri May  1 21:25:48 2020] zfs             D    0 26749  26726 0x80004000
[Fri May  1 21:25:48 2020] Call Trace:
[Fri May  1 21:25:48 2020]  __schedule+0x2bb/0x660
[Fri May  1 21:25:48 2020]  ? tsd_hash_dtor+0x78/0x90 [spl]
[Fri May  1 21:25:48 2020]  schedule+0x33/0xa0
[Fri May  1 21:25:48 2020]  taskq_wait_id+0x90/0xe0 [spl]
[Fri May  1 21:25:48 2020]  ? wait_woken+0x80/0x80
[Fri May  1 21:25:48 2020]  taskq_cancel_id+0xdb/0x110 [spl]
[Fri May  1 21:25:48 2020]  zfs_unlinked_drain_stop_wait+0x47/0x70 [zfs]
[Fri May  1 21:25:48 2020]  zfsvfs_teardown+0x25/0x320 [zfs]
[Fri May  1 21:25:48 2020]  zfs_suspend_fs+0x10/0x20 [zfs]
[Fri May  1 21:25:48 2020]  zfs_ioc_recv_impl+0x6b2/0x10e0 [zfs]
[Fri May  1 21:25:48 2020]  zfs_ioc_recv+0x19a/0x340 [zfs]
[Fri May  1 21:25:48 2020]  ? ns_capable_common+0x2f/0x50
[Fri May  1 21:25:48 2020]  ? capable+0x19/0x20
[Fri May  1 21:25:48 2020]  ? priv_policy.isra.3.part.4+0x11/0x20 [zfs]
[Fri May  1 21:25:48 2020]  ? secpolicy_zinject+0x3a/0x40 [zfs]
[Fri May  1 21:25:48 2020]  ? _cond_resched+0x19/0x30
[Fri May  1 21:25:48 2020]  ? __kmalloc+0x183/0x270
[Fri May  1 21:25:48 2020]  ? strdup+0x45/0x70 [spl]
[Fri May  1 21:25:48 2020]  zfsdev_ioctl+0x6db/0x8f0 [zfs]
[Fri May  1 21:25:48 2020]  ? lru_cache_add_active_or_unevictable+0x39/0xb0
[Fri May  1 21:25:48 2020]  do_vfs_ioctl+0xa9/0x640
[Fri May  1 21:25:48 2020]  ? handle_mm_fault+0xc5/0x1e0
[Fri May  1 21:25:48 2020]  ksys_ioctl+0x67/0x90
[Fri May  1 21:25:48 2020]  __x64_sys_ioctl+0x1a/0x20
[Fri May  1 21:25:48 2020]  do_syscall_64+0x5a/0x130
[Fri May  1 21:25:48 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Fri May  1 21:25:48 2020] RIP: 0033:0x7fa3960c0427
[Fri May  1 21:25:48 2020] Code: Bad RIP value.
[Fri May  1 21:25:48 2020] RSP: 002b:00007fffc9d072a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Fri May  1 21:25:48 2020] RAX: ffffffffffffffda RBX: 00007fffc9d07450 RCX: 00007fa3960c0427
[Fri May  1 21:25:48 2020] RDX: 00007fffc9d07450 RSI: 0000000000005a1b RDI: 0000000000000006
[Fri May  1 21:25:48 2020] RBP: 00007fffc9d0ba40 R08: 0000000000000003 R09: 00007fa39618be40
[Fri May  1 21:25:48 2020] R10: 0000557f7de5b010 R11: 0000000000000246 R12: 00007fffc9d0aa00
[Fri May  1 21:25:48 2020] R13: 00007fffc9d16028 R14: 0000557f7de5eaf0 R15: 00007fffc9d118c0
[Fri May  1 21:27:48 2020] INFO: task z_unlinked_drai:417 blocked for more than 483 seconds.
[Fri May  1 21:27:48 2020]       Tainted: P           O      5.3.18-3-pve #1
[Fri May  1 21:27:48 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri May  1 21:27:48 2020] z_unlinked_drai D    0   417      2 0x80004000
[Fri May  1 21:27:48 2020] Call Trace:
[Fri May  1 21:27:48 2020]  __schedule+0x2bb/0x660
[Fri May  1 21:27:48 2020]  schedule+0x33/0xa0
[Fri May  1 21:27:48 2020]  spl_panic+0xf9/0xfb [spl]
[Fri May  1 21:27:48 2020]  ? sa_cache_constructor+0x27/0x50 [zfs]
[Fri May  1 21:27:48 2020]  ? spl_kmem_cache_alloc+0x114/0x780 [spl]
[Fri May  1 21:27:48 2020]  ? sa_lookup_locked+0x66/0xb0 [zfs]
[Fri May  1 21:27:48 2020]  ? iput+0x7a/0x1f0
[Fri May  1 21:27:48 2020]  ? insert_inode_locked+0x1d8/0x1e0
[Fri May  1 21:27:48 2020]  zfs_znode_alloc+0x63d/0x6a0 [zfs]
[Fri May  1 21:27:48 2020]  zfs_zget+0x1a7/0x250 [zfs]
[Fri May  1 21:27:48 2020]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
[Fri May  1 21:27:48 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:27:48 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:27:48 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:27:48 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:27:48 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:27:48 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:27:48 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:27:48 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:27:48 2020]  ? __switch_to+0x15a/0x480
[Fri May  1 21:27:48 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:27:48 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:27:48 2020]  ? __schedule+0x2c3/0x660
[Fri May  1 21:27:48 2020]  ? remove_wait_queue+0x4d/0x60
[Fri May  1 21:27:48 2020]  taskq_thread+0x2ec/0x4d0 [spl]
[Fri May  1 21:27:48 2020]  ? wake_up_q+0x80/0x80
[Fri May  1 21:27:48 2020]  kthread+0x120/0x140
[Fri May  1 21:27:48 2020]  ? task_done+0xb0/0xb0 [spl]
[Fri May  1 21:27:48 2020]  ? __kthread_parkme+0x70/0x70
[Fri May  1 21:27:48 2020]  ret_from_fork+0x22/0x40
[Fri May  1 21:27:48 2020] INFO: task zfs:26749 blocked for more than 362 seconds.
[Fri May  1 21:27:48 2020]       Tainted: P           O      5.3.18-3-pve #1
[Fri May  1 21:27:48 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri May  1 21:27:48 2020] zfs             D    0 26749  26726 0x80004000
[Fri May  1 21:27:48 2020] Call Trace:
[Fri May  1 21:27:48 2020]  __schedule+0x2bb/0x660
[Fri May  1 21:27:48 2020]  ? tsd_hash_dtor+0x78/0x90 [spl]
[Fri May  1 21:27:48 2020]  schedule+0x33/0xa0
[Fri May  1 21:27:48 2020]  taskq_wait_id+0x90/0xe0 [spl]
[Fri May  1 21:27:48 2020]  ? wait_woken+0x80/0x80
[Fri May  1 21:27:48 2020]  taskq_cancel_id+0xdb/0x110 [spl]
[Fri May  1 21:27:48 2020]  zfs_unlinked_drain_stop_wait+0x47/0x70 [zfs]
[Fri May  1 21:27:48 2020]  zfsvfs_teardown+0x25/0x320 [zfs]
[Fri May  1 21:27:48 2020]  zfs_suspend_fs+0x10/0x20 [zfs]
[Fri May  1 21:27:48 2020]  zfs_ioc_recv_impl+0x6b2/0x10e0 [zfs]
[Fri May  1 21:27:48 2020]  zfs_ioc_recv+0x19a/0x340 [zfs]
[Fri May  1 21:27:48 2020]  ? ns_capable_common+0x2f/0x50
[Fri May  1 21:27:48 2020]  ? capable+0x19/0x20
[Fri May  1 21:27:48 2020]  ? priv_policy.isra.3.part.4+0x11/0x20 [zfs]
[Fri May  1 21:27:48 2020]  ? secpolicy_zinject+0x3a/0x40 [zfs]
[Fri May  1 21:27:48 2020]  ? _cond_resched+0x19/0x30
[Fri May  1 21:27:48 2020]  ? __kmalloc+0x183/0x270
[Fri May  1 21:27:48 2020]  ? strdup+0x45/0x70 [spl]
[Fri May  1 21:27:48 2020]  zfsdev_ioctl+0x6db/0x8f0 [zfs]
[Fri May  1 21:27:48 2020]  ? lru_cache_add_active_or_unevictable+0x39/0xb0
[Fri May  1 21:27:48 2020]  do_vfs_ioctl+0xa9/0x640
[Fri May  1 21:27:48 2020]  ? handle_mm_fault+0xc5/0x1e0
[Fri May  1 21:27:48 2020]  ksys_ioctl+0x67/0x90
[Fri May  1 21:27:48 2020]  __x64_sys_ioctl+0x1a/0x20
[Fri May  1 21:27:48 2020]  do_syscall_64+0x5a/0x130
[Fri May  1 21:27:48 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Fri May  1 21:27:48 2020] RIP: 0033:0x7fa3960c0427
[Fri May  1 21:27:48 2020] Code: Bad RIP value.
[Fri May  1 21:27:48 2020] RSP: 002b:00007fffc9d072a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Fri May  1 21:27:48 2020] RAX: ffffffffffffffda RBX: 00007fffc9d07450 RCX: 00007fa3960c0427
[Fri May  1 21:27:48 2020] RDX: 00007fffc9d07450 RSI: 0000000000005a1b RDI: 0000000000000006
[Fri May  1 21:27:48 2020] RBP: 00007fffc9d0ba40 R08: 0000000000000003 R09: 00007fa39618be40
[Fri May  1 21:27:48 2020] R10: 0000557f7de5b010 R11: 0000000000000246 R12: 00007fffc9d0aa00
[Fri May  1 21:27:48 2020] R13: 00007fffc9d16028 R14: 0000557f7de5eaf0 R15: 00007fffc9d118c0
[Fri May  1 21:29:49 2020] INFO: task z_unlinked_drai:417 blocked for more than 604 seconds.
[Fri May  1 21:29:49 2020]       Tainted: P           O      5.3.18-3-pve #1
[Fri May  1 21:29:49 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri May  1 21:29:49 2020] z_unlinked_drai D    0   417      2 0x80004000
[Fri May  1 21:29:49 2020] Call Trace:
[Fri May  1 21:29:49 2020]  __schedule+0x2bb/0x660
[Fri May  1 21:29:49 2020]  schedule+0x33/0xa0
[Fri May  1 21:29:49 2020]  spl_panic+0xf9/0xfb [spl]
[Fri May  1 21:29:49 2020]  ? sa_cache_constructor+0x27/0x50 [zfs]
[Fri May  1 21:29:49 2020]  ? spl_kmem_cache_alloc+0x114/0x780 [spl]
[Fri May  1 21:29:49 2020]  ? sa_lookup_locked+0x66/0xb0 [zfs]
[Fri May  1 21:29:49 2020]  ? iput+0x7a/0x1f0
[Fri May  1 21:29:49 2020]  ? insert_inode_locked+0x1d8/0x1e0
[Fri May  1 21:29:49 2020]  zfs_znode_alloc+0x63d/0x6a0 [zfs]
[Fri May  1 21:29:49 2020]  zfs_zget+0x1a7/0x250 [zfs]
[Fri May  1 21:29:49 2020]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
[Fri May  1 21:29:49 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:29:49 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:29:49 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:29:49 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:29:49 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:29:49 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:29:49 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:29:49 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:29:49 2020]  ? __switch_to+0x15a/0x480
[Fri May  1 21:29:49 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:29:49 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:29:49 2020]  ? __schedule+0x2c3/0x660
[Fri May  1 21:29:49 2020]  ? remove_wait_queue+0x4d/0x60
[Fri May  1 21:29:49 2020]  taskq_thread+0x2ec/0x4d0 [spl]
[Fri May  1 21:29:49 2020]  ? wake_up_q+0x80/0x80
[Fri May  1 21:29:49 2020]  kthread+0x120/0x140
[Fri May  1 21:29:49 2020]  ? task_done+0xb0/0xb0 [spl]
[Fri May  1 21:29:49 2020]  ? __kthread_parkme+0x70/0x70
[Fri May  1 21:29:49 2020]  ret_from_fork+0x22/0x40
[Fri May  1 21:29:49 2020] INFO: task zfs:26749 blocked for more than 483 seconds.
[Fri May  1 21:29:49 2020]       Tainted: P           O      5.3.18-3-pve #1
[Fri May  1 21:29:49 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri May  1 21:29:49 2020] zfs             D    0 26749  26726 0x80004000
[Fri May  1 21:29:49 2020] Call Trace:
[Fri May  1 21:29:49 2020]  __schedule+0x2bb/0x660
[Fri May  1 21:29:49 2020]  ? tsd_hash_dtor+0x78/0x90 [spl]
[Fri May  1 21:29:49 2020]  schedule+0x33/0xa0
[Fri May  1 21:29:49 2020]  taskq_wait_id+0x90/0xe0 [spl]
[Fri May  1 21:29:49 2020]  ? wait_woken+0x80/0x80
[Fri May  1 21:29:49 2020]  taskq_cancel_id+0xdb/0x110 [spl]
[Fri May  1 21:29:49 2020]  zfs_unlinked_drain_stop_wait+0x47/0x70 [zfs]
[Fri May  1 21:29:49 2020]  zfsvfs_teardown+0x25/0x320 [zfs]
[Fri May  1 21:29:49 2020]  zfs_suspend_fs+0x10/0x20 [zfs]
[Fri May  1 21:29:49 2020]  zfs_ioc_recv_impl+0x6b2/0x10e0 [zfs]
[Fri May  1 21:29:49 2020]  zfs_ioc_recv+0x19a/0x340 [zfs]
[Fri May  1 21:29:49 2020]  ? ns_capable_common+0x2f/0x50
[Fri May  1 21:29:49 2020]  ? capable+0x19/0x20
[Fri May  1 21:29:49 2020]  ? priv_policy.isra.3.part.4+0x11/0x20 [zfs]
[Fri May  1 21:29:49 2020]  ? secpolicy_zinject+0x3a/0x40 [zfs]
[Fri May  1 21:29:49 2020]  ? _cond_resched+0x19/0x30
[Fri May  1 21:29:49 2020]  ? __kmalloc+0x183/0x270
[Fri May  1 21:29:49 2020]  ? strdup+0x45/0x70 [spl]
[Fri May  1 21:29:49 2020]  zfsdev_ioctl+0x6db/0x8f0 [zfs]
[Fri May  1 21:29:49 2020]  ? lru_cache_add_active_or_unevictable+0x39/0xb0
[Fri May  1 21:29:49 2020]  do_vfs_ioctl+0xa9/0x640
[Fri May  1 21:29:49 2020]  ? handle_mm_fault+0xc5/0x1e0
[Fri May  1 21:29:49 2020]  ksys_ioctl+0x67/0x90
[Fri May  1 21:29:49 2020]  __x64_sys_ioctl+0x1a/0x20
[Fri May  1 21:29:49 2020]  do_syscall_64+0x5a/0x130
[Fri May  1 21:29:49 2020]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Fri May  1 21:29:49 2020] RIP: 0033:0x7fa3960c0427
[Fri May  1 21:29:49 2020] Code: Bad RIP value.
[Fri May  1 21:29:49 2020] RSP: 002b:00007fffc9d072a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Fri May  1 21:29:49 2020] RAX: ffffffffffffffda RBX: 00007fffc9d07450 RCX: 00007fa3960c0427
[Fri May  1 21:29:49 2020] RDX: 00007fffc9d07450 RSI: 0000000000005a1b RDI: 0000000000000006
[Fri May  1 21:29:49 2020] RBP: 00007fffc9d0ba40 R08: 0000000000000003 R09: 00007fa39618be40
[Fri May  1 21:29:49 2020] R10: 0000557f7de5b010 R11: 0000000000000246 R12: 00007fffc9d0aa00
[Fri May  1 21:29:49 2020] R13: 00007fffc9d16028 R14: 0000557f7de5eaf0 R15: 00007fffc9d118c0
[Fri May  1 21:31:50 2020] INFO: task z_unlinked_drai:417 blocked for more than 724 seconds.
[Fri May  1 21:31:50 2020]       Tainted: P           O      5.3.18-3-pve #1
[Fri May  1 21:31:50 2020] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri May  1 21:31:50 2020] z_unlinked_drai D    0   417      2 0x80004000
[Fri May  1 21:31:50 2020] Call Trace:
[Fri May  1 21:31:50 2020]  __schedule+0x2bb/0x660
[Fri May  1 21:31:50 2020]  schedule+0x33/0xa0
[Fri May  1 21:31:50 2020]  spl_panic+0xf9/0xfb [spl]
[Fri May  1 21:31:50 2020]  ? sa_cache_constructor+0x27/0x50 [zfs]
[Fri May  1 21:31:50 2020]  ? spl_kmem_cache_alloc+0x114/0x780 [spl]
[Fri May  1 21:31:50 2020]  ? sa_lookup_locked+0x66/0xb0 [zfs]
[Fri May  1 21:31:50 2020]  ? iput+0x7a/0x1f0
[Fri May  1 21:31:50 2020]  ? insert_inode_locked+0x1d8/0x1e0
[Fri May  1 21:31:50 2020]  zfs_znode_alloc+0x63d/0x6a0 [zfs]
[Fri May  1 21:31:50 2020]  zfs_zget+0x1a7/0x250 [zfs]
[Fri May  1 21:31:50 2020]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
[Fri May  1 21:31:50 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:31:50 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:31:50 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:31:50 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:31:50 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:31:50 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:31:50 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:31:50 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:31:50 2020]  ? __switch_to+0x15a/0x480
[Fri May  1 21:31:50 2020]  ? __switch_to_asm+0x40/0x70
[Fri May  1 21:31:50 2020]  ? __switch_to_asm+0x34/0x70
[Fri May  1 21:31:50 2020]  ? __schedule+0x2c3/0x660
[Fri May  1 21:31:50 2020]  ? remove_wait_queue+0x4d/0x60
[Fri May  1 21:31:50 2020]  taskq_thread+0x2ec/0x4d0 [spl]
[Fri May  1 21:31:50 2020]  ? wake_up_q+0x80/0x80
[Fri May  1 21:31:50 2020]  kthread+0x120/0x140
[Fri May  1 21:31:50 2020]  ? task_done+0xb0/0xb0 [spl]
[Fri May  1 21:31:50 2020]  ? __kthread_parkme+0x70/0x70
[Fri May  1 21:31:50 2020]  ret_from_fork+0x22/0x40

Possible failing hardware?

bgly · 2020-06-16T16:12:49Z

This seems to be a prevalent issue on 0.8.3. Have seen it with zfs recv -sF, no usage of rollback in my environment

sdenovan · 2020-11-04T15:01:38Z

This happened to us last month so we did a full apt upgrade but whatever it is, it's still present in 0.8.4. As others have noted, it always occurs on the receiving side of a zfs send operation, the receive process hangs in 'D' state and only a reboot will get things moving again.

We do occasionally use rollback in our environments -- but I don't think any were performed in the last month.

# zfs version
zfs-0.8.4-pve2
zfs-kmod-0.8.4-pve1

[Mon Nov  2 07:07:06 2020] VERIFY3(insert_inode_locked(ip) == 0) failed (-16 == 0)
[Mon Nov  2 07:07:06 2020] PANIC at zfs_znode.c:611:zfs_znode_alloc()
[Mon Nov  2 07:07:06 2020] Showing stack for process 2995
[Mon Nov  2 07:07:06 2020] CPU: 4 PID: 2995 Comm: z_unlinked_drai Tainted: P           O      5.4.65-1-pve #1
[Mon Nov  2 07:07:06 2020] Hardware name: Dell Inc. PowerEdge R720xd/0020HJ, BIOS 2.9.0 12/06/2019
[Mon Nov  2 07:07:06 2020] Call Trace:
[Mon Nov  2 07:07:06 2020]  dump_stack+0x6d/0x9a
[Mon Nov  2 07:07:06 2020]  spl_dumpstack+0x29/0x2b [spl]
[Mon Nov  2 07:07:06 2020]  spl_panic+0xd3/0xfb [spl]
[Mon Nov  2 07:07:06 2020]  ? sa_cache_constructor+0x27/0x50 [zfs]
[Mon Nov  2 07:07:06 2020]  ? sa_lookup_locked+0x66/0xb0 [zfs]
[Mon Nov  2 07:07:06 2020]  ? iput+0x7d/0x210
[Mon Nov  2 07:07:06 2020]  ? insert_inode_locked+0x1d8/0x1e0
[Mon Nov  2 07:07:06 2020]  zfs_znode_alloc+0x63d/0x6a0 [zfs]
[Mon Nov  2 07:07:06 2020]  zfs_zget+0x1a7/0x250 [zfs]
[Mon Nov  2 07:07:06 2020]  zfs_unlinked_drain_task+0x74/0x100 [zfs]
[Mon Nov  2 07:07:06 2020]  ? __switch_to_asm+0x34/0x70
[Mon Nov  2 07:07:06 2020]  ? __switch_to_asm+0x40/0x70
[Mon Nov  2 07:07:06 2020]  ? __switch_to_asm+0x34/0x70
[Mon Nov  2 07:07:06 2020]  ? __switch_to_asm+0x40/0x70
[Mon Nov  2 07:07:06 2020]  ? __switch_to_asm+0x34/0x70
[Mon Nov  2 07:07:06 2020]  ? __switch_to_asm+0x40/0x70
[Mon Nov  2 07:07:06 2020]  ? __switch_to_asm+0x34/0x70
[Mon Nov  2 07:07:06 2020]  ? __switch_to_asm+0x40/0x70
[Mon Nov  2 07:07:06 2020]  ? __switch_to_asm+0x34/0x70
[Mon Nov  2 07:07:06 2020]  ? __switch_to+0x85/0x480
[Mon Nov  2 07:07:06 2020]  ? __switch_to_asm+0x40/0x70
[Mon Nov  2 07:07:06 2020]  ? __switch_to_asm+0x34/0x70
[Mon Nov  2 07:07:06 2020]  ? __schedule+0x2ee/0x6f0
[Mon Nov  2 07:07:06 2020]  ? remove_wait_queue+0x4d/0x60
[Mon Nov  2 07:07:06 2020]  taskq_thread+0x2ec/0x4d0 [spl]
[Mon Nov  2 07:07:06 2020]  ? wake_up_q+0x80/0x80
[Mon Nov  2 07:07:06 2020]  kthread+0x120/0x140
[Mon Nov  2 07:07:06 2020]  ? task_done+0xb0/0xb0 [spl]
[Mon Nov  2 07:07:06 2020]  ? kthread_park+0x90/0x90
[Mon Nov  2 07:07:06 2020]  ret_from_fork+0x35/0x40

liszca · 2020-11-04T22:12:22Z

This happened to us last month so we did a full apt upgrade but whatever it is, it's still present in 0.8.4. As others have noted, it always occurs on the receiving side of a zfs send operation, the receive process hangs in 'D' state and only a reboot will get things moving again.

We do occasionally use rollback in our environments -- but I don't think any were performed in the last month.

Does it also occure when you delete the target first (NOT the Volumes, actualy there can be files for them even if its normaly volumes)?

sdenovan · 2020-11-04T23:11:25Z

Does it also occure when you delete the target first (NOT the Volumes, actualy there can be files for them even if its normaly volumes)?

We only use ZFS filesystems (ie. no volumes / block devices) and it has only happened on incremental zfs send operations. Those operations always include the -F on the receiving side to undo any local changes that might have happened since the last snapshot. I wasn't thinking of that in the initial report so there is technically a rollback with every receive though it would just be for attributes like atime that might have been automatically updated as tape backup software worked through the filesystem.

liszca · 2020-11-05T20:33:23Z

Just have a look if there is something which shouldn't be there, I remember having this once.

dspindler · 2020-12-10T15:12:33Z

We also had this occur. We have a primary server (mirrored) that we backup every 3 hours using syncoid to a second machine(raidz3). The secondary server that does nothing but serve as the backup show syncoid and the receive process are hung with a crash in dmesg.
root 3783729 1 0 Dec09 ? 00:00:07 zfs receive -s -F zpool38/projects/potvin

Details:
Red Hat Enterprise Linux release 8.3 (Ootpa)
Kernel:4.18.0-193.28.1.el8_2.x86_64
zfs-0.8.5-1
zfs-kmod-0.8.5-1

panic.txt

Side2005 · 2021-01-05T22:07:01Z

We experienced the same Issue, replicating with sanoid/syncoid (both 2.0.3) between 2 Debian hosts (Stretch and Buster)

Stretch host (srv1 panicing here):
Debian 9
zfs-0.8.4-1 (from source)
zfs-kmod-0.8.4-1 (from source)

Target host:
Debian 10
zfs-0.8.6-1 ~ bpo10+1
zfs-kmod-0.8.6-1 ~ bpo10+1

We last replicated from Stretch host to the Buster host Monday evening, then started taking snapshots at the Buster host which we then later replicated back (new snapshot that only existed on Buster host) to the Stretch host. Before replicating, we rolled-back (zfs rollback -R ....) the Stretch host to the latest snapshot (the one on Monday evening)

While Replicating we got kernel panics:

Jan 5 16:43:39 srv1 kernel: [1149058.128485] PANIC at zfs_znode.c:611:zfs_znode_alloc()
Jan 5 16:43:39 srv1 kernel: [1149058.128613] ? sa_lookup_locked+0x89/0xb0 [zfs]
Jan 5 16:43:39 srv1 kernel: [1149058.128676] zfs_znode_alloc+0x5d1/0x6b0 [zfs]
Jan 5 16:43:39 srv1 kernel: [1149058.128733] zfs_zget+0x183/0x260 [zfs]
Jan 5 16:43:39 srv1 kernel: [1149058.128786] zfs_unlinked_drain_task+0x8e/0xf0 [zfs]
Jan 5 16:46:25 srv1 kernel: [1149224.277302] ? sa_lookup_locked+0x89/0xb0 [zfs]
Jan 5 16:46:25 srv1 kernel: [1149224.277397] zfs_znode_alloc+0x5d1/0x6b0 [zfs]
Jan 5 16:46:25 srv1 kernel: [1149224.277490] zfs_zget+0x183/0x260 [zfs]
Jan 5 16:46:25 srv1 kernel: [1149224.277576] zfs_unlinked_drain_task+0x8e/0xf0 [zfs]
Jan 5 16:46:25 srv1 kernel: [1149224.278144] INFO: task zfs:12041 blocked for more than 120 seconds.
Jan 5 16:46:25 srv1 kernel: [1149224.278151] zfs D 0 12041 12037 0x80000000
Jan 5 16:46:25 srv1 kernel: [1149224.278270] zfs_unlinked_drain_stop_wait+0x42/0x60 [zfs]
Jan 5 16:46:25 srv1 kernel: [1149224.278355] zfsvfs_teardown+0x18/0x330 [zfs]
Jan 5 16:46:25 srv1 kernel: [1149224.278441] zfs_ioc_recv_impl+0x511/0xf50 [zfs]
Jan 5 16:46:25 srv1 kernel: [1149224.278528] zfs_ioc_recv_new+0x2d8/0x370 [zfs]
Jan 5 16:46:25 srv1 kernel: [1149224.278663] zfsdev_ioctl+0x54e/0x910 [zfs]
Jan 5 16:48:26 srv1 kernel: [1149345.108861] ? sa_lookup_locked+0x89/0xb0 [zfs]
Jan 5 16:48:26 srv1 kernel: [1149345.108955] zfs_znode_alloc+0x5d1/0x6b0 [zfs]
Jan 5 16:48:26 srv1 kernel: [1149345.109048] zfs_zget+0x183/0x260 [zfs]
Jan 5 16:48:26 srv1 kernel: [1149345.109135] zfs_unlinked_drain_task+0x8e/0xf0 [zfs]
Jan 5 16:48:26 srv1 kernel: [1149345.109555] INFO: task zfs:12041 blocked for more than 120 seconds.
Jan 5 16:48:26 srv1 kernel: [1149345.109562] zfs D 0 12041 1 0x80000004
Jan 5 16:48:26 srv1 kernel: [1149345.109683] zfs_unlinked_drain_stop_wait+0x42/0x60 [zfs]
Jan 5 16:48:26 srv1 kernel: [1149345.109771] zfsvfs_teardown+0x18/0x330 [zfs]
Jan 5 16:48:26 srv1 kernel: [1149345.109857] zfs_ioc_recv_impl+0x511/0xf50 [zfs]
Jan 5 16:48:26 srv1 kernel: [1149345.109943] zfs_ioc_recv_new+0x2d8/0x370 [zfs]
Jan 5 16:48:26 srv1 kernel: [1149345.110079] zfsdev_ioctl+0x54e/0x910 [zfs]
Jan 5 16:50:27 srv1 kernel: [1149465.940387] ? sa_lookup_locked+0x89/0xb0 [zfs]
Jan 5 16:50:27 srv1 kernel: [1149465.940482] zfs_znode_alloc+0x5d1/0x6b0 [zfs]
Jan 5 16:50:27 srv1 kernel: [1149465.940575] zfs_zget+0x183/0x260 [zfs]
Jan 5 16:50:27 srv1 kernel: [1149465.940662] zfs_unlinked_drain_task+0x8e/0xf0 [zfs]
Jan 5 16:50:27 srv1 kernel: [1149465.941222] INFO: task zfs:12041 blocked for more than 120 seconds.
Jan 5 16:50:27 srv1 kernel: [1149465.941229] zfs D 0 12041 1 0x80000004
Jan 5 16:50:27 srv1 kernel: [1149465.941350] zfs_unlinked_drain_stop_wait+0x42/0x60 [zfs]
Jan 5 16:50:27 srv1 kernel: [1149465.941436] zfsvfs_teardown+0x18/0x330 [zfs]
Jan 5 16:50:27 srv1 kernel: [1149465.941521] zfs_ioc_recv_impl+0x511/0xf50 [zfs]
Jan 5 16:50:27 srv1 kernel: [1149465.941607] zfs_ioc_recv_new+0x2d8/0x370 [zfs]
Jan 5 16:50:27 srv1 kernel: [1149465.941743] zfsdev_ioctl+0x54e/0x910 [zfs]`

phreaker0 · 2021-03-09T12:17:07Z

Also happens randomly with zfs 2.0.3 (from debian buster-backports) replicating a local ssd based pool to an local hdd based pool with rollbacks:

zfs-2.0.3-1~bpo10+1
zfs-kmod-2.0.3-1~bpo10+1

PaulZ-98 · 2021-06-01T17:47:33Z

I was able to create a reproducer for this. The reason for the panic is that when you have a zfs recv or rollback, the code path zfs_suspend_fs -> zfsvfs_teardown is taken. In this path, we asynchronously release inodes via iput. If this async releasing happens to be slow, then inodes will still be extant while the zfs recv proceeds with processing the unlinked set. In this unlinked set processing, each znode is reconstituted so it can be freed (zfs_zget). If the znode / inode is present in the unlinked set AND still hanging around from zfsvfs_teardown -> iput, we can get this hash collision and panic.

PaulZ-98 · 2021-06-01T17:48:17Z

A code change is needed for the reproducer:

diff --git a/module/os/linux/zfs/zfs_vnops_os.c b/module/os/linux/zfs/zfs_vnops_os.c
index 24c016c5f..031727323 100644
--- a/module/os/linux/zfs/zfs_vnops_os.c
+++ b/module/os/linux/zfs/zfs_vnops_os.c
@@ -175,6 +175,8 @@
  *     return (error);                 // done, report error
  */
 
+unsigned long zfs_iput_delay = 0;
+
 /* ARGSUSED */
 int
 zfs_open(struct inode *ip, int mode, int flag, cred_t *cr)
@@ -367,6 +369,16 @@ zfs_write_simple(znode_t *zp, const void *data, size_t len,
        return (error);
 }
 
+static void
+iput_special(struct inode *ip)
+{
+       if (zfs_iput_delay) {
+               zfs_dbgmsg("delayed iput for %lld", ITOZ(ip)->z_id);
+               delay(4 * hz);
+       }
+       iput(ip);
+}
+
 void
 zfs_zrele_async(znode_t *zp)
 {
@@ -384,9 +396,10 @@ zfs_zrele_async(znode_t *zp)
         * For more information on the dangers of a synchronous iput, see the
         * header comment of this file.
         */
+       zfs_dbgmsg("async rele for %lld", (longlong_t)zp->z_id);
        if (!atomic_add_unless(&ip->i_count, -1, 1)) {
                VERIFY(taskq_dispatch(dsl_pool_zrele_taskq(dmu_objset_pool(os)),
-                   (task_func_t *)iput, ip, TQ_SLEEP) != TASKQID_INVALID);
+                   (task_func_t *)iput_special, ip, TQ_SLEEP) != TASKQID_INVALID);
        }
 }
 
@@ -3988,6 +4001,9 @@ EXPORT_SYMBOL(zfs_map);
 /* BEGIN CSTYLED */
 module_param(zfs_delete_blocks, ulong, 0644);
 MODULE_PARM_DESC(zfs_delete_blocks, "Delete files larger than N blocks async");
+
+module_param(zfs_iput_delay, ulong, 0644);
+MODULE_PARM_DESC(zfs_iput_delay, "Delay async zrele (iput)");
 /* END CSTYLED */
 
 #endif

PaulZ-98 · 2021-06-01T17:49:03Z

And the reproducer script:

zpool export -f tank > /dev/null 2>&1
truncate -s 4g /root/file1
zpool create -f tank /root/file1
zfs create tank/source_fs
for i in {1..4}
do
	cp /var/log/syslog /tank/source_fs/syslog.$i
done

# full send to populate recv_fs
zfs snapshot tank/source_fs@snap1
zfs send tank/source_fs@snap1 | zfs recv tank/recv_fs
# ensure entries in unlinked set
echo 1 > /sys/module/zfs/parameters/zfs_unlink_suspend_progress
for i in {1..2}
do
	rm /tank/source_fs/syslog.$i
done
zfs snapshot tank/source_fs@snap2
# artificially delay iputs from zfs_suspend_fs
echo 1 > /sys/module/zfs/parameters/zfs_iput_delay
# populate inodes so zfs_suspend_fs has work to do
# n.b. you need -l, ls alone isn't enough
ls -l /tank/recv_fs
# incremental send | online recv
zfs send -i @snap1 tank/source_fs@snap2 | zfs recv -F tank/recv_fs

In zfs_znode_alloc we always hash inodes. If the znode is unlinked, we do not need to hash it. This fixes the problem where zfs_suspend_fs is doing zrele (iput) in an async fashion, and zfs_resume_fs unlinked drain processing will try to hash an inode that could still be hashed, resulting in a panic. Fixes: openzfs#9741 Fixes: openzfs#11223 Fixes: openzfs#11648 Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>

In zfs_znode_alloc we always hash inodes. If the znode is unlinked, we do not need to hash it. This fixes the problem where zfs_suspend_fs is doing zrele (iput) in an async fashion, and zfs_resume_fs unlinked drain processing will try to hash an inode that could still be hashed, resulting in a panic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alan Somers <asomers@gmail.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes openzfs#9741 Closes openzfs#11223 Closes openzfs#11648 Closes openzfs#12210

In zfs_znode_alloc we always hash inodes. If the znode is unlinked, we do not need to hash it. This fixes the problem where zfs_suspend_fs is doing zrele (iput) in an async fashion, and zfs_resume_fs unlinked drain processing will try to hash an inode that could still be hashed, resulting in a panic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alan Somers <asomers@gmail.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #9741 Closes #11223 Closes #11648 Closes #12210

In zfs_znode_alloc we always hash inodes. If the znode is unlinked, we do not need to hash it. This fixes the problem where zfs_suspend_fs is doing zrele (iput) in an async fashion, and zfs_resume_fs unlinked drain processing will try to hash an inode that could still be hashed, resulting in a panic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alan Somers <asomers@gmail.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes openzfs#9741 Closes openzfs#11223 Closes openzfs#11648 Closes openzfs#12210

In zfs_znode_alloc we always hash inodes. If the znode is unlinked, we do not need to hash it. This fixes the problem where zfs_suspend_fs is doing zrele (iput) in an async fashion, and zfs_resume_fs unlinked drain processing will try to hash an inode that could still be hashed, resulting in a panic. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alan Somers <asomers@gmail.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #9741 Closes #11223 Closes #11648 Closes #12210

behlendorf added the Type: Defect Incorrect behavior (e.g. crash, hang) label Dec 17, 2019

behlendorf mentioned this issue Dec 17, 2019

Remove zpl_revalidate: fix snapshot rollback #9600

Closed

12 tasks

bgly mentioned this issue Jun 18, 2020

Fix zfs_rmnode() unlink / rollback issue #9739

Merged

12 tasks

jtyocum mentioned this issue Feb 24, 2021

PANIC / zfs receive hanging #11648

Closed

PaulZ-98 mentioned this issue Jun 1, 2021

Kernel Oops in 2.0rc4: VERIFY3(insert_inode_locked(ip) == 0) failed (-16 == 0) #11223

Closed

PaulZ-98 mentioned this issue Jun 9, 2021

Do not hash unlinked inodes in zfs_znode_alloc #12210

Merged

13 tasks

behlendorf closed this as completed in afa7b34 Jun 12, 2021

rotanid mentioned this issue Jun 12, 2021

unable to handle kernel paging request ... in zio_compress_data #8600

Open

Kayvlim mentioned this issue Nov 15, 2021

Idle system, no obvious cause: PANIC at zfs_znode.c:620:zfs_znode_alloc(), VERIFY3(insert_inode_locked(ip) == 0) failed (-16 == 0) #12768

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VERIFY3 PANIC during zfs receive #9741

VERIFY3 PANIC during zfs receive #9741

ktry commented Dec 17, 2019

behlendorf commented Dec 17, 2019

ktry commented Dec 17, 2019

siv0 commented Feb 20, 2020

liszca commented May 1, 2020

bgly commented Jun 16, 2020 •

edited

Loading

sdenovan commented Nov 4, 2020

liszca commented Nov 4, 2020 •

edited

Loading

sdenovan commented Nov 4, 2020

liszca commented Nov 5, 2020

dspindler commented Dec 10, 2020

Side2005 commented Jan 5, 2021 •

edited

Loading

phreaker0 commented Mar 9, 2021

PaulZ-98 commented Jun 1, 2021 •

edited

Loading

PaulZ-98 commented Jun 1, 2021

PaulZ-98 commented Jun 1, 2021 •

edited

Loading

VERIFY3 PANIC during zfs receive #9741

VERIFY3 PANIC during zfs receive #9741

Comments

ktry commented Dec 17, 2019

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

behlendorf commented Dec 17, 2019

ktry commented Dec 17, 2019

siv0 commented Feb 20, 2020

liszca commented May 1, 2020

bgly commented Jun 16, 2020 • edited Loading

sdenovan commented Nov 4, 2020

liszca commented Nov 4, 2020 • edited Loading

sdenovan commented Nov 4, 2020

liszca commented Nov 5, 2020

dspindler commented Dec 10, 2020

Side2005 commented Jan 5, 2021 • edited Loading

phreaker0 commented Mar 9, 2021

PaulZ-98 commented Jun 1, 2021 • edited Loading

PaulZ-98 commented Jun 1, 2021

PaulZ-98 commented Jun 1, 2021 • edited Loading

bgly commented Jun 16, 2020 •

edited

Loading

liszca commented Nov 4, 2020 •

edited

Loading

Side2005 commented Jan 5, 2021 •

edited

Loading

PaulZ-98 commented Jun 1, 2021 •

edited

Loading

PaulZ-98 commented Jun 1, 2021 •

edited

Loading