Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VERIFY(HDR_EMPTY(hdr)) failed PANIC at arc.c:6787:arc_release() #12020

Closed
rincebrain opened this issue May 10, 2021 · 9 comments
Closed

VERIFY(HDR_EMPTY(hdr)) failed PANIC at arc.c:6787:arc_release() #12020

rincebrain opened this issue May 10, 2021 · 9 comments
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@rincebrain
Copy link
Contributor

(This might be the same as #9897, but since the stack seemed entirely different, I figured I'd file it separately.)

System information

Type Version/Name
Distribution Name Debian
Distribution Version buster
Linux Kernel 4.19.0-14-amd64
Architecture x86_64
ZFS Version 0.8.6-1~bpo10+1

Describe the problem you're observing

While using my headless server with a couple of "long-running" (a day or so) kvm VMs, I got an exciting PANIC in my console, not obviously related to anything I was doing at the moment.

Describe how to reproduce the problem

???

Include any warning/errors/backtraces from the system logs

[5740457.383840] VERIFY(HDR_EMPTY(hdr)) failed
[5740457.383880] PANIC at arc.c:6787:arc_release()
[5740457.383912] Showing stack for process 101406
[5740457.383916] CPU: 14 PID: 101406 Comm: kvm Kdump: loaded Tainted: P           OE     4.19.0-14-amd64 #1 Debian 4.19.171-2
[5740457.383917] Hardware name: Supermicro Super Server/X10SDV-TLN4F, BIOS 2.1 11/22/2019
[5740457.383918] Call Trace:
[5740457.383940]  dump_stack+0x66/0x81
[5740457.383957]  spl_panic+0xd3/0xfb [spl]
[5740457.383961]  ? _cond_resched+0x15/0x30
[5740457.384045]  ? cityhash4+0x75/0x90 [zfs]
[5740457.384048]  ? _cond_resched+0x15/0x30
[5740457.384050]  ? mutex_lock+0xe/0x30
[5740457.384136]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[5740457.384240]  ? multilist_is_empty+0x83/0x150 [zfs]
[5740457.384345]  ? zfs_refcount_remove_many+0x118/0x1d0 [zfs]
[5740457.384348]  ? _cond_resched+0x15/0x30
[5740457.384437]  ? cityhash4+0x75/0x90 [zfs]
[5740457.384440]  ? _cond_resched+0x15/0x30
[5740457.384442]  ? mutex_lock+0xe/0x30
[5740457.384550]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[5740457.384657]  ? list_head+0x9/0x20 [zfs]
[5740457.384660]  ? _cond_resched+0x15/0x30
[5740457.384662]  ? mutex_lock+0xe/0x30
[5740457.384784]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[5740457.384793]  ? cv_destroy_wakeup+0x34/0x90 [spl]
[5740457.384802]  ? waitqueue_active+0x9/0x20 [spl]
[5740457.384811]  ? __cv_destroy+0xb0/0x230 [spl]
[5740457.384821]  ? spl_kmem_cache_free+0x11c/0x1c0 [spl]
[5740457.384907]  arc_release+0xb09/0xce0 [zfs]
[5740457.384997]  dbuf_redirty+0x3b/0xc0 [zfs]
[5740457.385086]  dmu_buf_will_dirty_impl+0xa9/0x1e0 [zfs]
[5740457.385177]  dmu_write_uio_dnode+0x6b/0x1d0 [zfs]
[5740457.385268]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[5740457.385381]  zfs_write+0x9fe/0xe80 [zfs]
[5740457.385387]  ? __switch_to_asm+0x35/0x70
[5740457.385496]  zpl_write_common_iovec+0xac/0x110 [zfs]
[5740457.385606]  zpl_iter_write_common+0x7f/0xa0 [zfs]
[5740457.385715]  zpl_iter_write+0x4a/0x70 [zfs]
[5740457.385720]  do_iter_readv_writev+0x13a/0x1b0
[5740457.385724]  do_iter_write+0x80/0x190
[5740457.385727]  vfs_writev+0xa7/0x100
[5740457.385733]  ? syscall_trace_enter+0x29a/0x2b0
[5740457.385736]  do_pwritev+0x89/0xe0
[5740457.385739]  do_syscall_64+0x53/0x110
[5740457.385742]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[5740457.385745] RIP: 0033:0x7fe4b18fd69a
[5740457.385748] Code: 89 d4 55 48 89 f5 53 89 fb 48 83 ec 18 e8 ce 60 01 00 45 31 c0 4d 89 ea 44 89 e2 41 89 c1 48 89 ee 89 df b8 28 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 32 44 89 cf 48 89 44 24 08 e8 01 61 01 00 48
[5740457.385749] RSP: 002b:00007fe4acdfe620 EFLAGS: 00000246 ORIG_RAX: 0000000000000128
[5740457.385752] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fe4b18fd69a
[5740457.385754] RDX: 0000000000000005 RSI: 00007fe3a010cf80 RDI: 000000000000000d
[5740457.385755] RBP: 00007fe3a010cf80 R08: 0000000000000000 R09: 0000000000000000
[5740457.385756] R10: 00000002ef06c000 R11: 0000000000000246 R12: 0000000000000005
[5740457.385757] R13: 00000002ef06c000 R14: 00007fe3a002f0f0 R15: 0000555c1a1a1820
@rincebrain rincebrain added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels May 10, 2021
@rincebrain
Copy link
Contributor Author

It happened again~

[282103.640279] VERIFY(HDR_EMPTY(hdr)) failed
[282103.640337] PANIC at arc.c:6787:arc_release()
[282103.640385] Showing stack for process 17089
[282103.640391] CPU: 10 PID: 17089 Comm: nfsd Kdump: loaded Tainted: P           OE     4.19.0-16-amd64 #1 Debian 4.19.181-1
[282103.640393] Hardware name: Supermicro Super Server/X10SDV-TLN4F, BIOS 2.1 11/22/2019
[282103.640394] Call Trace:
[282103.640408]  dump_stack+0x66/0x81
[282103.640432]  spl_panic+0xd3/0xfb [spl]
[282103.640438]  ? _cond_resched+0x15/0x30
[282103.640441]  ? _cond_resched+0x15/0x30
[282103.640445]  ? mutex_lock+0xe/0x30
[282103.640590]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[282103.640732]  ? zfs_refcount_remove_many+0x118/0x1d0 [zfs]
[282103.640738]  ? avl_find+0x63/0xe0 [zavl]
[282103.640844]  ? dbuf_rele_and_unlock+0x211/0x5e0 [zfs]
[282103.640948]  ? cityhash4+0x75/0x90 [zfs]
[282103.640952]  ? _cond_resched+0x15/0x30
[282103.640955]  ? mutex_lock+0xe/0x30
[282103.641096]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[282103.641230]  ? multilist_is_empty+0x83/0x150 [zfs]
[282103.641362]  ? zfs_refcount_remove_many+0x118/0x1d0 [zfs]
[282103.641367]  ? _cond_resched+0x15/0x30
[282103.641469]  ? cityhash4+0x75/0x90 [zfs]
[282103.641570]  arc_release+0xb09/0xce0 [zfs]
[282103.641577]  ? _cond_resched+0x15/0x30
[282103.641580]  ? mutex_lock+0xe/0x30
[282103.641712]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[282103.641845]  ? zfs_refcount_remove_many+0x118/0x1d0 [zfs]
[282103.641949]  dbuf_redirty+0x3b/0xc0 [zfs]
[282103.642054]  dbuf_dirty+0xcd6/0x1220 [zfs]
[282103.642059]  ? _cond_resched+0x15/0x30
[282103.642062]  ? mutex_lock+0xe/0x30
[282103.642145]  ? dbuf_noread+0xca/0x250 [zfs]
[282103.642253]  dmu_write_uio_dnode+0xf2/0x1d0 [zfs]
[282103.642364]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[282103.642515]  zfs_write+0x9fe/0xe80 [zfs]
[282103.642672]  zpl_write_common_iovec+0xac/0x110 [zfs]
[282103.642827]  zpl_iter_write_common+0x7f/0xa0 [zfs]
[282103.642979]  zpl_iter_write+0x4a/0x70 [zfs]
[282103.642984]  do_iter_readv_writev+0x13a/0x1b0
[282103.642988]  do_iter_write+0x80/0x190
[282103.643011]  nfsd_vfs_write+0xe6/0x460 [nfsd]
[282103.643033]  nfsd4_write+0xf2/0x200 [nfsd]
[282103.643054]  nfsd4_proc_compound+0x342/0x660 [nfsd]
[282103.643071]  nfsd_dispatch+0x9e/0x210 [nfsd]
[282103.643104]  svc_process_common+0x345/0x740 [sunrpc]
[282103.643122]  ? nfsd_destroy+0x50/0x50 [nfsd]
[282103.643149]  svc_process+0xb7/0xf0 [sunrpc]
[282103.643165]  nfsd+0xe3/0x140 [nfsd]
[282103.643171]  kthread+0x112/0x130
[282103.643175]  ? kthread_bind+0x30/0x30
[282103.643180]  ret_from_fork+0x35/0x40

@rincebrain
Copy link
Contributor Author

Well, there goes my "caused by server under heavy load" theory...

(It just happened right after a reboot, where the only thing running was an "apt install [ZFS 2.0.3]")

[  158.726568] VERIFY(HDR_EMPTY(hdr)) failed
[  158.726622] PANIC at arc.c:6787:arc_release()
[  158.726665] Showing stack for process 18
[  158.726671] CPU: 1 PID: 18 Comm: kworker/1:0 Kdump: loaded Tainted: P                                                                     OE     4.19.0-16-amd64 #1 Debian 4.19.181-1
[  158.726673] Hardware name: Supermicro Super Server/X10SDV-TLN4F, BIOS 2.1 11/                                                          22/2019
[  158.726684] Workqueue: events delayed_fput
[  158.726687] Call Trace:
[  158.726701]  dump_stack+0x66/0x81
[  158.726725]  spl_panic+0xd3/0xfb [spl]
[  158.726734]  ? update_dl_rq_load_avg+0x1a8/0x220
[  158.726739]  ? update_blocked_averages+0x861/0xa60
[  158.726745]  ? avl_find+0x63/0xe0 [zavl]
[  158.726882]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[  158.727002]  ? dnode_dirty_l1range+0x2c6/0x320 [zfs]
[  158.727007]  ? _cond_resched+0x15/0x30
[  158.727011]  ? _cond_resched+0x15/0x30
[  158.727014]  ? mutex_lock+0xe/0x30
[  158.727017]  ? _cond_resched+0x15/0x30
[  158.727020]  ? mutex_lock+0xe/0x30
[  158.727116]  arc_release+0xb09/0xce0 [zfs]
[  158.727218]  ? dbuf_rele_and_unlock+0x211/0x5e0 [zfs]
[  158.727222]  ? _cond_resched+0x15/0x30
[  158.727224]  ? _cond_resched+0x15/0x30
[  158.727327]  dbuf_undirty.isra.12+0x24c/0x6d0 [zfs]
[  158.727332]  ? mutex_lock+0xe/0x30
[  158.727431]  ? dbuf_destroy+0x399/0x6c0 [zfs]
[  158.727532]  dbuf_free_range+0x229/0x560 [zfs]
[  158.727667]  dnode_free_range+0x235/0x980 [zfs]
[  158.727831]  ? txg_rele_to_quiesce+0x39/0x70 [zfs]
[  158.727941]  dmu_free_long_range_impl+0x27e/0x440 [zfs]
[  158.728050]  dmu_free_long_range+0x67/0xb0 [zfs]
[  158.728230]  zfs_rmnode+0x297/0x420 [zfs]
[  158.728409]  ? zfs_zinactive+0x157/0x1a0 [zfs]
[  158.728585]  zfs_inactive+0x1c5/0x2c0 [zfs]
[  158.728590]  ? _cond_resched+0x15/0x30
[  158.728593]  ? down_write+0xe/0x40
[  158.728609]  ? unmap_mapping_pages+0x5e/0x130
[  158.728785]  zpl_evict_inode+0x36/0x50 [zfs]
[  158.728792]  evict+0xd2/0x1a0
[  158.728799]  __dentry_kill+0xd5/0x170
[  158.728805]  dentry_kill+0x4d/0x190
[  158.728811]  dput.part.34+0xf0/0x140
[  158.728817]  __fput+0xe7/0x1e0
[  158.728824]  delayed_fput+0x1c/0x30
[  158.728829]  process_one_work+0x1a7/0x3a0
[  158.728834]  worker_thread+0x30/0x390
[  158.728839]  ? create_worker+0x1a0/0x1a0
[  158.728845]  kthread+0x112/0x130
[  158.728851]  ? kthread_bind+0x30/0x30
[  158.728856]  ret_from_fork+0x35/0x40

@rincebrain
Copy link
Contributor Author

rincebrain commented May 13, 2021

...oh boy, this time on 2.0.X! (Specifically, 2.0.3-1~bpo10+1)

I'm pretty tempted to blame NFS for this, since using NFS aggressively is the only thing that's really changed recently...

[ 1263.078686] VERIFY(HDR_EMPTY(hdr)) failed
[ 1263.078726] PANIC at arc.c:6439:arc_release()
[ 1263.078758] Showing stack for process 7614
[ 1263.078763] CPU: 1 PID: 7614 Comm: nfsd Kdump: loaded Tainted: P           OE     4.19.0-16-amd64 #1 Debian 4.19.181-1
[ 1263.078765] Hardware name: Supermicro Super Server/X10SDV-TLN4F, BIOS 2.1 11/22/2019
[ 1263.078766] Call Trace:
[ 1263.078778]  dump_stack+0x66/0x81
[ 1263.078796]  spl_panic+0xd3/0xfb [spl]
[ 1263.078803]  ? __switch_to_asm+0x35/0x70
[ 1263.078901]  ? zfs_btree_find+0x1e0/0x2e0 [zfs]
[ 1263.078904]  ? _cond_resched+0x15/0x30
[ 1263.078907]  ? mutex_lock+0xe/0x30
[ 1263.079028]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 1263.079035]  ? cityhash4+0x75/0x90 [zcommon]
[ 1263.079038]  ? _cond_resched+0x15/0x30
[ 1263.079040]  ? mutex_lock+0xe/0x30
[ 1263.079043]  ? _cond_resched+0x15/0x30
[ 1263.079047]  ? kmem_cache_alloc+0x167/0x1d0
[ 1263.079052]  ? cityhash4+0x75/0x90 [zcommon]
[ 1263.079197]  ? zrl_add_impl+0x14d/0x150 [zfs]
[ 1263.079203]  ? recalibrate_cpu_khz+0x10/0x10
[ 1263.079280]  arc_release+0xad1/0xca0 [zfs]
[ 1263.079285]  ? _cond_resched+0x15/0x30
[ 1263.079287]  ? mutex_lock+0xe/0x30
[ 1263.079427]  ? __raw_spin_unlock+0x5/0x10 [zfs]
[ 1263.079546]  ? zfs_refcount_remove_many+0x112/0x1c0 [zfs]
[ 1263.079629]  dbuf_redirty+0x3b/0xc0 [zfs]
[ 1263.079724]  dbuf_dirty+0xdf8/0x1180 [zfs]
[ 1263.079728]  ? _cond_resched+0x15/0x30
[ 1263.079731]  ? mutex_lock+0xe/0x30
[ 1263.079819]  dmu_write_uio_dnode+0xf2/0x1d0 [zfs]
[ 1263.079908]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[ 1263.080051]  zfs_write+0x854/0xeb0 [zfs]
[ 1263.080200]  zpl_iter_write+0xd6/0x140 [zfs]
[ 1263.080207]  do_iter_readv_writev+0x13a/0x1b0
[ 1263.080212]  do_iter_write+0x80/0x190
[ 1263.080231]  nfsd_vfs_write+0xe6/0x460 [nfsd]
[ 1263.080249]  nfsd4_write+0xf2/0x200 [nfsd]
[ 1263.080265]  nfsd4_proc_compound+0x342/0x660 [nfsd]
[ 1263.080280]  ? nfsd4_read_rsize+0x20/0x20 [nfsd]
[ 1263.080291]  nfsd_dispatch+0x9e/0x210 [nfsd]
[ 1263.080316]  svc_process_common+0x345/0x740 [sunrpc]
[ 1263.080330]  ? nfsd_destroy+0x50/0x50 [nfsd]
[ 1263.080351]  svc_process+0xb7/0xf0 [sunrpc]
[ 1263.080362]  nfsd+0xe3/0x140 [nfsd]
[ 1263.080368]  kthread+0x112/0x130
[ 1263.080372]  ? kthread_bind+0x30/0x30
[ 1263.080376]  ret_from_fork+0x35/0x40

@rincebrain
Copy link
Contributor Author

...well, uh, now my machine panics within 5 minutes of booting, even if I don't actively run anything on it, so I'm pretty confident this is an interaction with NFS.

@rincebrain
Copy link
Contributor Author

rincebrain commented Jun 13, 2021

This didn't turn up for a long time, ever since I stopped having my Linux/sparc64 system chroot over NFS.

Then tonight, with no sparc64 NFS consumer in sight...

[424104.522517] VERIFY(HDR_EMPTY(hdr)) failed
[424104.522598] PANIC at arc.c:6439:arc_release()
[424104.522667] Showing stack for process 78699
[424104.522674] CPU: 1 PID: 78699 Comm: kvm Kdump: loaded Tainted: P           OE     4.19.0-16-amd64 #1 Debian 4.19.181-1
[424104.522677] Hardware name: Supermicro Super Server/X10SDV-TLN4F, BIOS 2.1 11/22/2019
[424104.522679] Call Trace:
[424104.522698]  dump_stack+0x66/0x81
[424104.522726]  spl_panic+0xd3/0xfb [spl]
[424104.522941]  ? rs_get_start+0x9/0x20 [zfs]
[424104.523145]  ? range_tree_stat_incr+0x28/0xe0 [zfs]
[424104.523152]  ? _cond_resched+0x15/0x30
[424104.523158]  ? kmem_cache_alloc+0x167/0x1d0
[424104.523172]  ? cityhash4+0x75/0x90 [zcommon]
[424104.523432]  ? zrl_add_impl+0x14d/0x150 [zfs]
[424104.523438]  ? _cond_resched+0x15/0x30
[424104.523443]  ? mutex_lock+0xe/0x30
[424104.523696]  ? list_head+0x9/0x20 [zfs]
[424104.523701]  ? _cond_resched+0x15/0x30
[424104.523718]  ? waitqueue_active+0x9/0x20 [spl]
[424104.523733]  ? waitqueue_active+0x9/0x20 [spl]
[424104.523748]  ? __cv_destroy+0xb0/0x230 [spl]
[424104.523767]  ? spl_kmem_cache_free+0x115/0x1b0 [spl]
[424104.523874]  arc_release+0xad1/0xca0 [zfs]
[424104.523979]  dbuf_redirty+0x3b/0xc0 [zfs]
[424104.524083]  dmu_buf_will_dirty_impl+0x106/0x1b0 [zfs]
[424104.524194]  dmu_write_uio_dnode+0x6b/0x1d0 [zfs]
[424104.524305]  dmu_write_uio_dbuf+0x47/0x60 [zfs]
[424104.524518]  zfs_write+0x854/0xeb0 [zfs]
[424104.524530]  ? futex_wait+0x18a/0x240
[424104.524538]  ? __hrtimer_init+0xb0/0xb0
[424104.524794]  zpl_iter_write+0xd6/0x140 [zfs]
[424104.524806]  new_sync_write+0xfb/0x160
[424104.524815]  vfs_write+0xa5/0x1a0
[424104.524822]  ksys_pwrite64+0x61/0xa0
[424104.524831]  do_syscall_64+0x53/0x110
[424104.524839]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[424104.524846] RIP: 0033:0x7f3f0fb55edf
[424104.524851] Code: 41 54 49 89 d4 55 48 89 f5 53 89 fb 48 83 ec 18 e8 c6 f2 ff ff 4d 89 ea 4c 89 e2 48 89 ee 41 89 c0 89 df b8 12 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 fc f2 ff ff 48
[424104.524855] RSP: 002b:00007f3dc7ffe600 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
[424104.524861] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007f3f0fb55edf
[424104.524864] RDX: 0000000000001000 RSI: 00007f3e91c24000 RDI: 0000000000000009
[424104.524866] RBP: 00007f3e91c24000 R08: 0000000000000000 R09: 00000000ffffffff
[424104.524869] R10: 000000030c4b9000 R11: 0000000000000293 R12: 0000000000001000
[424104.524872] R13: 000000030c4b9000 R14: 00005600c2a14a60 R15: 00005600c001e185

edit: Well, it happened while the machine reported no NFS clients connected, so it's not solely an NFS complication.

Maybe it's just whoever tries to access the buffer first after whatever problem creates it? It's odd though, because I can go months without seeing it, and then see it a couple times in a short interval. I'd blame my RAM, but ECC hasn't kicked up even a single bit error on this system. So I guess heavily workload-dependent?

I was about to try reproducing this on a testbed in a VM when the most recent one happened, so I'll just go back to doing that.

@rincebrain
Copy link
Contributor Author

Oh boy oh boy, I was very nice, and for my birthday I got a kdump of this happening in a VM and not my actual server.

I still really want to know why I can reproduce this consistently by doing very little IO from my little Netra T1 inside a chroot on NFS, but a whole chroot debootstrap and Linux kernel build inside it from an x86_64 host doesn't make it blink...

Presumably I could sniff all the traffic from the sparc64 system doing it and an x86_64 system using qemu-user-static to do it and see what differs.

If anyone's any better at crash or kgdb and wants to take a whack at this, it doesn't take long to repro, so just let me know.

PID: 7857   TASK: ffff98124feaad00  CPU: 2   COMMAND: "nfsd"
 #0 [ffffa46784973580] machine_kexec at ffffffffa9252f87
 #1 [ffffa467849735d8] __crash_kexec at ffffffffa9314ebd
 #2 [ffffa467849736a0] panic at ffffffffa9902677
 #3 [ffffa46784973720] spl_panic at ffffffffc0c2de00 [spl]
 #4 [ffffa467849738a8] arc_release at ffffffffc0dea0f9 [zfs]
 #5 [ffffa46784973930] dbuf_redirty at ffffffffc0dfc0eb [zfs]
 #6 [ffffa46784973948] dbuf_dirty at ffffffffc0e07734 [zfs]
 #7 [ffffa467849739f0] dmu_write_uio_dnode at ffffffffc0e142b6 [zfs]
 #8 [ffffa46784973a50] dmu_write_uio_dbuf at ffffffffc0e143c7 [zfs]
 #9 [ffffa46784973a78] zfs_write at ffffffffc0f2be0c [zfs]
#10 [ffffa46784973c08] zpl_iter_write at ffffffffc0f77c73 [zfs]
#11 [ffffa46784973c80] do_iter_readv_writev at ffffffffa945103a
#12 [ffffa46784973cc0] do_iter_write at ffffffffa9452a80
#13 [ffffa46784973cf0] nfsd_vfs_write at ffffffffc0485776 [nfsd]
#14 [ffffa46784973d80] nfsd4_write at ffffffffc04915c2 [nfsd]
#15 [ffffa46784973de0] nfsd4_proc_compound at ffffffffc0493a02 [nfsd]
#16 [ffffa46784973e40] nfsd_dispatch at ffffffffc04809be [nfsd]
#17 [ffffa46784973e70] svc_process_common at ffffffffc02ca9b5 [sunrpc]
#18 [ffffa46784973ed8] svc_process at ffffffffc02cae67 [sunrpc]
#19 [ffffa46784973ef0] nfsd at ffffffffc04804c3 [nfsd]
#20 [ffffa46784973f10] kthread at ffffffffa929b302
#21 [ffffa46784973f50] ret_from_fork at ffffffffa9a00215

@rincebrain
Copy link
Contributor Author

Fascinatingly, afa7b34 repros this (using the sparc64 NFS method) on buster (4.19.0-16-amd64), but not bullseye (5.10.0-7-amd64) or Fedora 34 (5.12.9-300.fc34.x86_64). (I'll eventually try the "VM with virtio" method, but since that involves nesting virtualization, that's potentially a deep can of worms.)

Coincidence? Something subtly wrong in the NFSv4 codepath that got fixed? Compiler doing something differently? Who knows!

But if it's reachable by two different routes, it seems feasible that something could be going wrong inside OpenZFS, not just because something is misusing a function. Now that i have it reliably reproducing in a VM, I'll explore that...

@rincebrain
Copy link
Contributor Author

Well, after some poking around, I just took this suggestion and commented the failing VERIFY out. (gulp)

So far, the testbed I've done this in hasn't caught fire in any noticable way, and I know it's hit the changed behavior because I added an if(!HDR_EMPTY(hdr)) { zfs_dbgmsg(...); } in its place.

So after I do a bunch of writing over NFS, I'll move onto the other way I know to trigger this - kvm with virtio. We'll see how well x86 nested virtualization works...

@behlendorf
Copy link
Contributor

Removing the ASSERT() looks to me like the right thing to do here. The case which triggers this is when we're re-dirtying a buffer in the txg which has already been written out and thus has a block pointer (identity). As the comment in arc_release() states in this case we need to unconditionally discard that identify for the released anonymous buffer. Repeatedly rewriting the same range in a file would be the best way to try and reproduce this.

What's critical is that the following are true for the anonymous buffer (all of which are ASSERTed).

  1. the header isn't in the hash table (so it's not discoverable),
  2. it doesn't have an L2 header, and
  3. there is a single reference.

@rincebrain as long as nothing odd shows up in your testing I'd suggest opening a PR which drops the ASSERT and we'll get another set of eyes on it.

rincebrain added a commit to rincebrain/zfs that referenced this issue Jun 15, 2021
Unfortunately, there was an overzealous assertion that was (in pretty
specific circumstances) false, causing failure.

Let's not, and say we did.

Closes: openzfs#9897
Closes: openzfs#12020

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
tonyhutter pushed a commit to tonyhutter/zfs that referenced this issue Sep 15, 2021
Unfortunately, there was an overzealous assertion that was (in pretty
specific circumstances) false, causing failure.  This assertion was
added in error, so we're removing it.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#9897
Closes openzfs#12020
Closes openzfs#12246
rincebrain added a commit to rincebrain/zfs that referenced this issue Sep 22, 2021
Unfortunately, there was an overzealous assertion that was (in pretty
specific circumstances) false, causing failure.  This assertion was
added in error, so we're removing it.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#9897
Closes openzfs#12020
Closes openzfs#12246
datacore-rm pushed a commit to DataCoreSoftware/openzfs that referenced this issue Oct 4, 2022
Unfortunately, there was an overzealous assertion that was (in pretty
specific circumstances) false, causing failure.  This assertion was
added in error, so we're removing it.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#9897
Closes openzfs#12020
Closes openzfs#12246
datacore-rm pushed a commit to DataCoreSoftware/openzfs that referenced this issue Oct 13, 2022
Unfortunately, there was an overzealous assertion that was (in pretty
specific circumstances) false, causing failure.  This assertion was
added in error, so we're removing it.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#9897
Closes openzfs#12020
Closes openzfs#12246
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants