Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS hangs on kernel error: VERIFY3(0 == dmu_bonus_hold_by_dnode #12001

Open
vicsn opened this issue May 5, 2021 · 19 comments
Open

ZFS hangs on kernel error: VERIFY3(0 == dmu_bonus_hold_by_dnode #12001

vicsn opened this issue May 5, 2021 · 19 comments
Labels
Component: Encryption "native encryption" feature Component: Send/Recv "zfs send/recv" feature Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@vicsn
Copy link

vicsn commented May 5, 2021

System information

Type Version/Name
Distribution Name Debian
Distribution Version Buster
Linux Kernel 5.10.0-0.bpo.4-amd64 / 4.19.0-16
Architecture amd64
ZFS Version 2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

Zpool properties:

              -o ashift=12
              -O checksum=sha256
              -O compression=lz4
              -O atime=off
              -O normalization=formD
              -O xattr=sa
              -O relatime=on

Example test machine properties:

  • CPU model name : AMD Ryzen Threadripper 2990WX 32-Core Processor
  • 6 HDDs of model: Gigabyte Technology Co., Ltd. X399 AORUS PRO/X399 AORUS PRO-CF, BIOS F2g 05/08/2019

Describe the problem you're observing

During an incremental send using zfs send -I | receive -F, ZFS hangs. Any subsequent call to ZFS or e.g. opening any file on the ZFS filesystem, will hang indefinitely.

Note that we also experienced: #12014 - perhaps it is related.

Describe how to reproduce the problem

This is happening frequently for us, on multiple machines, all using HDDs. The machines act as backup targets so most activity is receiving snapshots all day. Other zfs related processes on the machines are: snapshots are taken and pruned multiple times per day and a scrub happens once a week. We use encrypted filesystems and a single zpool. Historically the error has triggered for us between 1 and 48 hours.

Total data usage is 2.12T, with zpool iostat output showing:

              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
data        4.37T  10.2T     17    172   967K  5.46M

Current mitigation

Downgrading to zfs version 0.8.4-1~bpo10+1 seems to have resolved the issue - at the time of writing we had about a week month without ZFS hanging.

Include any warning/errors/backtraces from the system logs

The first error

|May  4 17:11:31 <host> kernel: [96617.381743] VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)                               
│May  4 17:11:31 <host> kernel: [96617.381882] PANIC at dmu_recv.c:1811:receive_object()                                                                
│May  4 17:11:31 <host> kernel: [96617.381990] Showing stack for process 97766                                                                          
│May  4 17:11:31 <host> kernel: [96617.382002] CPU: 55 PID: 97766 Comm: receive_writer Tainted: P           OE     4.19.0-16-amd64 #1 Debian 4.19.181-1 
│May  4 17:11:31 <host> kernel: [96617.382006] Hardware name: Gigabyte Technology Co., Ltd. X399 AORUS PRO/X399 AORUS PRO-CF, BIOS F2g 05/08/2019       
│May  4 17:11:31 <host> kernel: [96617.382008] Call Trace:                                                                                              
│May  4 17:11:31 <host> kernel: [96617.382058]  dump_stack+0x66/0x81                                                                                    
│May  4 17:11:31 <host> kernel: [96617.382098]  spl_panic+0xd3/0xfb [spl]                                                                               
│May  4 17:11:31 <host> kernel: [96617.382454]  ? dbuf_rele_and_unlock+0x30f/0x660 [zfs]                                                                
│May  4 17:11:31 <host> kernel: [96617.382562]  ? dbuf_read+0x1ca/0x520 [zfs]                                                                           
│May  4 17:11:31 <host> kernel: [96617.382642]  ? dnode_hold_impl+0x350/0xc20 [zfs]                                                                     
│May  4 17:11:31 <host> kernel: [96617.382652]  ? spl_kmem_cache_free+0xec/0x1c0 [spl]                                                                  
│May  4 17:11:31 <host> kernel: [96617.382728]  receive_object+0x923/0xc70 [zfs]                                                                        
│May  4 17:11:31 <host> kernel: [96617.382803]  receive_writer_thread+0x279/0xa00 [zfs]                                                                 
│May  4 17:11:31 <host> kernel: [96617.382817]  ? set_curr_task_fair+0x26/0x50                                                                          
│May  4 17:11:31 <host> kernel: [96617.382894]  ? receive_process_write_record+0x190/0x190 [zfs]                                                        
│May  4 17:11:31 <host> kernel: [96617.382901]  ? __thread_exit+0x20/0x20 [spl]                                                                         
│May  4 17:11:31 <host> kernel: [96617.382905]  ? thread_generic_wrapper+0x6f/0x80 [spl]                                                                
│May  4 17:11:31 <host> kernel: [96617.382910]  thread_generic_wrapper+0x6f/0x80 [spl]                                                                  
│May  4 17:11:31 <host> kernel: [96617.382920]  kthread+0x112/0x130                                                                                     
│May  4 17:11:31 <host> kernel: [96617.382926]  ? kthread_bind+0x30/0x30                                                                                
│May  4 17:11:31 <host> kernel: [96617.382935]  ret_from_fork+0x22/0x40                

Subsequent errors

│May 4 17:16:05 <host> kernel: [96890.895937] hrtimer: interrupt took 74060 ns
│May 4 17:16:20 <host> kernel: [96906.140617] INFO: task txg_quiesce:1694 blocked for more than 120 seconds.
│May 4 17:16:20 <host> kernel: [96906.140714] Tainted: P OE 4.19.0-16-amd64 #1 Debian 4.19.181-1
│May 4 17:16:20 <host> kernel: [96906.140799] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
│May 4 17:16:20 <host> kernel: [96906.140884] txg_quiesce D 0 1694 2 0x80000000
│May 4 17:16:20 <host> kernel: [96906.140889] Call Trace:
│May 4 17:16:20 <host> kernel: [96906.140903] __schedule+0x29f/0x840
│May 4 17:16:20 <host> kernel: [96906.140908] schedule+0x28/0x80
│May 4 17:16:20 <host> kernel: [96906.140924] cv_wait_common+0xfb/0x130 [spl]
│May 4 17:16:20 <host> kernel: [96906.140935] ? finish_wait+0x80/0x80
│May 4 17:16:20 <host> kernel: [96906.141077] txg_quiesce_thread+0x2a9/0x3a0 [zfs]
│May 4 17:16:20 <host> kernel: [96906.141169] ? txg_sync_thread+0x480/0x480 [zfs]
│May 4 17:16:20 <host> kernel: [96906.141176] ? __thread_exit+0x20/0x20 [spl]
│May 4 17:16:20 <host> kernel: [96906.141182] thread_generic_wrapper+0x6f/0x80 [spl]
│May 4 17:16:20 <host> kernel: [96906.141187] kthread+0x112/0x130
│May 4 17:16:20 <host> kernel: [96906.141190] ? kthread_bind+0x30/0x30
│May 4 17:16:20 <host> kernel: [96906.141193] ret_from_fork+0x22/0x40
│May 4 17:16:20 <host> kernel: [96906.141783] INFO: task receive_writer:97766 blocked for more than 120 seconds. ┤
│May 4 17:16:20 <host> kernel: [96906.141865] Tainted: P OE 4.19.0-16-amd64 #1 Debian 4.19.181-1 ┤
│May 4 17:16:20 <host> kernel: [96906.141963] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ┤
│May 4 17:16:20 <host> kernel: [96906.142065] receive_writer D 0 97766 2 0x80000000 ┤
│May 4 17:16:20 <host> kernel: [96906.142068] Call Trace: ┤
│May 4 17:16:20 <host> kernel: [96906.142074] __schedule+0x29f/0x840 ┤
│May 4 17:16:20 <host> kernel: [96906.142078] ? ret_from_fork+0x21/0x40 ┤
│May 4 17:16:20 <host> kernel: [96906.142083] schedule+0x28/0x80 ┤
│May 4 17:16:20 <host> kernel: [96906.142091] spl_panic+0xf9/0xfb [spl] ┤
│May 4 17:16:20 <host> kernel: [96906.142192] ? dbuf_rele_and_unlock+0x30f/0x660 [zfs] ┤
│May 4 17:16:20 <host> kernel: [96906.142239] ? dbuf_read+0x1ca/0x520 [zfs] │
│May 4 17:16:20 <host> kernel: [96906.142289] ? dnode_hold_impl+0x350/0xc20 [zfs] ┤
│May 4 17:16:20 <host> kernel: [96906.142295] ? spl_kmem_cache_free+0xec/0x1c0 [spl] ┤
│May 4 17:16:20 <host> kernel: [96906.142343] receive_object+0x923/0xc70 [zfs] ┤
│May 4 17:16:20 <host> kernel: [96906.142394] receive_writer_thread+0x279/0xa00 [zfs] ┤
│May 4 17:16:20 <host> kernel: [96906.142399] ? set_curr_task_fair+0x26/0x50 ┤
│May 4 17:16:20 <host> kernel: [96906.142448] ? receive_process_write_record+0x190/0x190 [zfs] ┤
│May 4 17:16:20 <host> kernel: [96906.142455] ? __thread_exit+0x20/0x20 [spl] ┤
│May 4 17:16:20 <host> kernel: [96906.142460] ? thread_generic_wrapper+0x6f/0x80 [spl] ┤
│May 4 17:16:20 <host> kernel: [96906.142465] thread_generic_wrapper+0x6f/0x80 [spl] ┤
│May 4 17:16:20 <host> kernel: [96906.142469] kthread+0x112/0x130 ┤
│May 4 17:16:20 <host> kernel: [96906.142471] ? kthread_bind+0x30/0x30 ┤
│May 4 17:16:20 <host> kernel: [96906.142473] ret_from_fork+0x22/0x40

EDIT May 11th 2021: downgrading ZFS to 0.8.4-1~bpo10+1 resolved the issue.
EDIT June 14th 2021: the issue has still been resolved on the downgraded ZFS version.
EDIT December 20th 2021: the issue is still there. Also with zfs-2.0.3-9 on 5.10.0-8-amd64

@vicsn vicsn added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels May 5, 2021
@stewartadam
Copy link

stewartadam commented Nov 21, 2021

Same issue here, ran into this while copying many snapshots over from one pool to another. Stack trace is similar, but a little different on ZFS 2.1.1:

Nov 17 23:53:57 localhost kernel: VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
Nov 17 23:53:57 localhost kernel: PANIC at dmu_recv.c:1798:receive_object()
Nov 17 23:53:57 localhost kernel: Showing stack for process 1363685
Nov 17 23:53:57 localhost kernel: CPU: 2 PID: 1363685 Comm: receive_writer Tainted: P           OE     5.13.12-200.fc34.x86_64 #1
Nov 17 23:53:57 localhost kernel: Hardware name: Supermicro X10SLH-F/X10SLM+-F/X10SLH-F/X10SLM+-F, BIOS 3.2a 05/31/2019
Nov 17 23:53:57 localhost kernel: Call Trace:
Nov 17 23:53:57 localhost kernel: dump_stack+0x76/0x94
Nov 17 23:53:57 localhost kernel: spl_panic+0xd4/0xfc [spl]
Nov 17 23:53:57 localhost kernel: ? dbuf_rele_and_unlock+0x487/0x640 [zfs]
Nov 17 23:53:57 localhost kernel: ? dbuf_create+0x5c8/0x5f0 [zfs]
Nov 17 23:53:57 localhost kernel: ? __raw_callee_save___native_queued_spin_unlock+0x11/0x1e
Nov 17 23:53:57 localhost kernel: ? dnode_rele_and_unlock+0x64/0xc0 [zfs]
Nov 17 23:53:57 localhost kernel: receive_object+0xa59/0xc50 [zfs]
Nov 17 23:53:57 localhost kernel: receive_writer_thread+0x1ba/0xac0 [zfs]
Nov 17 23:53:57 localhost kernel: ? kfree+0x403/0x430
Nov 17 23:53:57 localhost kernel: ? thread_generic_wrapper+0x62/0x80 [spl]
Nov 17 23:53:57 localhost kernel: ? receive_read_payload_and_next_header+0x2f0/0x2f0 [zfs]
Nov 17 23:53:57 localhost kernel: ? thread_generic_wrapper+0x6f/0x80 [spl]
Nov 17 23:53:57 localhost kernel: thread_generic_wrapper+0x6f/0x80 [spl]
Nov 17 23:53:57 localhost kernel: ? __thread_exit+0x20/0x20 [spl]
Nov 17 23:53:57 localhost kernel: kthread+0x127/0x150
Nov 17 23:53:57 localhost kernel: ? set_kthread_struct+0x40/0x40
Nov 17 23:53:57 localhost kernel: ret_from_fork+0x22/0x30

It does seem random as after reboot I re-ran the exact same zfs send | zfs recv (with resume token) and it completed no issues.

@whyamiroot
Copy link

Any updates for this issue or #12785?

@behlendorf behlendorf added Component: Encryption "native encryption" feature Component: Send/Recv "zfs send/recv" feature and removed Status: Triage Needed New issue which needs to be triaged labels Jan 4, 2022
@vicsn
Copy link
Author

vicsn commented Jan 7, 2022

Some debugging output after a recent crash

The issue occured again on january 6 14:33:34 - I base this on default log output which I think is UTC+1 by default. This would translate to timestamp 1641476014.

zfs_get_all.txt
zfs_dbgmsg.txt
zpool_history.txt

EDIT: logs were set to UTC + 1

@vicsn
Copy link
Author

vicsn commented Jan 11, 2022

One more set of logs of the same event happening a day later at Jan 7 20:20:22 1641583222

dbgmsg_after_reboot.txt
zpool_history.txt
dbgmsg.txt

With the following zpool status -v output:

  pool: data
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 12:50:05 with 0 errors on Tue Jan  4 00:12:53 2022
config:

	NAME        STATE     READ WRITE CKSUM
	data        ONLINE       0     0     0
	  raidz2-0  ONLINE       0     0     0
	    sda     ONLINE       0     0     0
	    sdb     ONLINE       0     0     0
	    sdd     ONLINE       0     0     0
	    sde     ONLINE       0     0     0
	spares
	  sdc       AVAIL   

errors: Permanent errors have been detected in the following files:

        data/private/private/private/private/private/%recv:<0x0>

@gamanakis
Copy link
Contributor

I am pretty sure the panic happens because dbuf_read() returns an EIO error in the codepath: receive_object() -> dmu_bonus_hold_by_dnode() -> dbuf_read().

Anyone who can replicate the bug:

  1. set echo 512 | sudo tee /sys/module/zfs/parameters/zfs_flags
  2. Trigger the bug
  3. post a cat /proc/spl/kstat/zfs/dbgmsg

@malfret
Copy link

malfret commented Apr 4, 2022

System information

Type Version/Name
Distribution Name Proxmox
Distribution Version 7.1.12
Linux Kernel 5.13.19-6-pve
Architecture amd64
ZFS Version 2.1.4-pve1

zpool get all.txt

zfs send -c -I | ssh ... receive -Fe
echo 512 > /sys/module/zfs/parameters/zfs_flags

syslog.txt
dbgmsg.txt

Time syslog UTC+4

@gamanakis
Copy link
Contributor

@malfret thank you for the logs. Based on the time the panic occurred I went to look for events in dbgmsg.txt around 1648983726. However the events logged into dbgmsg.txt are about 10 minutes later than the panic (starting at 1648984381).

If you can reproduce this, would you mind posting a dbgmsg (with echo 512 > /sys/module/zfs/parameters/zfs_flags set beforehand) right after the panic occurs?

@malfret
Copy link

malfret commented Apr 4, 2022

At the time of panic, the file system was not working properly and the shell script was unable to save the log. The log was saved manually.

How to increase the log size so that the data can be saved?

@malfret
Copy link

malfret commented Apr 4, 2022

If I change zfs_dbgmsg_maxsize - it will increase the size of the dbgmsg log?

echo $((128*1024*1024)) >> /sys/module/zfs/parameters/zfs_dbgmsg_maxsize

@gamanakis
Copy link
Contributor

If I change zfs_dbgmsg_maxsize - it will increase the size of the dbgmsg log?

echo $((128*1024*1024)) >> /sys/module/zfs/parameters/zfs_dbgmsg_maxsize

Right, that may help.

@malfret
Copy link

malfret commented Apr 6, 2022

2022-04-06_dbgmsg.log.zip
syslog 2022-04-06.txt

Time syslog UTC+4

@gamanakis
Copy link
Contributor

gamanakis commented May 4, 2022

The panic seems to happen because arc_untransform() returns an EIO error in the codepath: receive_object() -> dmu_bonus_hold_by_dnode() -> dbuf_read() -> arc_untransform().

And that happens because arc_buf_fill() returns an error ECKSUM (EBADE or 52 in linux).

@rockhouse
Copy link

@gamanakis We are still experiencing this bug on our production systems, any chance that this will be solved any time soon? We are running the latest version from debian stable 2.0.3-9. Could this be escalated maybe?

Any help is appreciated!

@siilike
Copy link

siilike commented Dec 18, 2022

Also happened with 2.1.5-1~bpo11+1. Nothing logged to syslog.

kernel:[9838382.870025] VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
kernel:[9838382.870083] PANIC at dmu_recv.c:1811:receive_object()

@singingtelegram
Copy link

Also occurred with 2.1.7 (with c8d2ab0 reverted, otherwise I would run into #14252).

Dec 23 12:30:18 jonathan-desktop kernel: VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
Dec 23 12:30:18 jonathan-desktop kernel: PANIC at dmu_recv.c:1806:receive_object()
Dec 23 12:30:18 jonathan-desktop kernel: Showing stack for process 3273208
Dec 23 12:30:18 jonathan-desktop kernel: CPU: 5 PID: 3273208 Comm: receive_writer Tainted: P           OE      6.1.1-x64v3-xanmod1 #0~20221222.0006b63
Dec 23 12:30:18 jonathan-desktop kernel: Hardware name: To Be Filled By O.E.M. X570 Taichi/X570 Taichi, BIOS P5.00 10/19/2022
Dec 23 12:30:18 jonathan-desktop kernel: Call Trace:
Dec 23 12:30:18 jonathan-desktop kernel:  <TASK>
Dec 23 12:30:18 jonathan-desktop kernel:  dump_stack_lvl+0x44/0x5c
Dec 23 12:30:18 jonathan-desktop kernel:  spl_panic+0xcb/0xe3 [spl]
Dec 23 12:30:18 jonathan-desktop kernel:  ? dbuf_rele_and_unlock+0x5ad/0x730 [zfs]
Dec 23 12:30:18 jonathan-desktop kernel:  ? dbuf_cons+0xa2/0xc0 [zfs]
Dec 23 12:30:18 jonathan-desktop kernel:  ? spl_kmem_cache_alloc+0x9f/0x740 [spl]
Dec 23 12:30:18 jonathan-desktop kernel:  ? dbuf_read+0x10a/0x5e0 [zfs]
Dec 23 12:30:18 jonathan-desktop kernel:  ? aggsum_add+0x16a/0x180 [zfs]
Dec 23 12:30:18 jonathan-desktop kernel:  ? dnode_rele_and_unlock+0x54/0xe0 [zfs]
Dec 23 12:30:18 jonathan-desktop kernel:  receive_object+0xb06/0xd30 [zfs]
Dec 23 12:30:18 jonathan-desktop kernel:  ? dnode_rele_and_unlock+0x54/0xe0 [zfs]
Dec 23 12:30:18 jonathan-desktop kernel:  ? dmu_object_info+0x4f/0x80 [zfs]
Dec 23 12:30:18 jonathan-desktop kernel:  receive_writer_thread+0x1b2/0xa90 [zfs]
Dec 23 12:30:18 jonathan-desktop kernel:  ? set_next_task_fair+0x66/0x90
Dec 23 12:30:18 jonathan-desktop kernel:  ? start_cfs_bandwidth.part.0+0x50/0x50
Dec 23 12:30:18 jonathan-desktop kernel:  ? set_user_nice.part.0+0x11b/0x330
Dec 23 12:30:18 jonathan-desktop kernel:  ? receive_process_write_record+0x160/0x160 [zfs]
Dec 23 12:30:18 jonathan-desktop kernel:  ? __thread_exit+0x10/0x10 [spl]
Dec 23 12:30:18 jonathan-desktop kernel:  thread_generic_wrapper+0x55/0x60 [spl]
Dec 23 12:30:18 jonathan-desktop kernel:  kthread+0xd5/0x100
Dec 23 12:30:18 jonathan-desktop kernel:  ? kthread_complete_and_exit+0x20/0x20
Dec 23 12:30:18 jonathan-desktop kernel:  ret_from_fork+0x22/0x30
Dec 23 12:30:18 jonathan-desktop kernel:  </TASK>

@lnagel
Copy link

lnagel commented Apr 23, 2023

[1687794.507453] VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
[1687794.507479] PANIC at dmu_recv.c:1806:receive_object()
[1687794.507493] Showing stack for process 2882666
[1687794.507505] CPU: 5 PID: 2882666 Comm: receive_writer Tainted: P      D    OE     5.19.0-38-generic #39~22.04.1-Ubuntu
[1687794.507529] Hardware name: Intel(R) Client Systems NUC8i7BEH/NUC8BEB, BIOS BECFL357.86A.0090.2022.0916.1942 09/16/2022
[1687794.507547] Call Trace:
[1687794.507557]  <TASK>
[1687794.507571]  show_stack+0x52/0x69
[1687794.507598]  dump_stack_lvl+0x49/0x6d
[1687794.507629]  dump_stack+0x10/0x18
[1687794.507647]  spl_dumpstack+0x29/0x35 [spl]
[1687794.507691]  spl_panic+0xd1/0xe9 [spl]
[1687794.507732]  ? arc_space_consume+0x54/0x130 [zfs]
[1687794.508027]  ? dbuf_create+0x5c1/0x5f0 [zfs]
[1687794.508306]  ? dbuf_read+0x11b/0x630 [zfs]
[1687794.508587]  ? dmu_bonus_hold_by_dnode+0x15b/0x1b0 [zfs]
[1687794.508860]  receive_object+0xae1/0xd40 [zfs]
[1687794.509186]  ? __slab_free+0x31/0x340
[1687794.509211]  ? spl_kmem_free+0x32/0x40 [spl]
[1687794.509244]  ? kfree+0x30f/0x330
[1687794.509259]  ? mutex_lock+0x13/0x50
[1687794.509277]  receive_writer_thread+0x1ce/0xb50 [zfs]
[1687794.509560]  ? set_next_task_fair+0x70/0xb0
[1687794.509578]  ? receive_process_write_record+0x1a0/0x1a0 [zfs]
[1687794.509861]  ? spl_kmem_free+0x32/0x40 [spl]
[1687794.509894]  ? kfree+0x30f/0x330
[1687794.509909]  ? receive_process_write_record+0x1a0/0x1a0 [zfs]
[1687794.510191]  ? __thread_exit+0x20/0x20 [spl]
[1687794.510225]  thread_generic_wrapper+0x61/0x80 [spl]
[1687794.510256]  ? thread_generic_wrapper+0x61/0x80 [spl]
[1687794.510288]  kthread+0xeb/0x120
[1687794.510303]  ? kthread_complete_and_exit+0x20/0x20
[1687794.510321]  ret_from_fork+0x1f/0x30
[1687794.510343]  </TASK>

@singingtelegram
Copy link

singingtelegram commented Apr 23, 2023 via email

@lnagel
Copy link

lnagel commented Dec 24, 2023

Using 2.2.0 now

[   46.799495] ZFS: Loaded module v2.2.0-0ubuntu1~23.10, ZFS pool version 5000, ZFS filesystem version 5

The error seems to be the same

[984462.595691] VERIFY0(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
[984462.595700] PANIC at dmu_recv.c:2093:receive_object()
[984462.595703] Showing stack for process 44582
[984462.595704] CPU: 4 PID: 44582 Comm: receive_writer Tainted: P           O       6.5.0-14-generic #14-Ubuntu
[984462.595706] Hardware name: Intel(R) Client Systems NUC8i7BEH/NUC8BEB, BIOS BECFL357.86A.0090.2022.0916.1942 09/16/2022
[984462.595708] Call Trace:
[984462.595710]  <TASK>
[984462.595713]  dump_stack_lvl+0x48/0x70
[984462.595722]  dump_stack+0x10/0x20
[984462.595724]  spl_dumpstack+0x29/0x40 [spl]
[984462.595742]  spl_panic+0xfc/0x120 [spl]
[984462.595757]  receive_object+0x9ce/0xaa0 [zfs]
[984462.596131]  ? __list_del_entry+0x1d/0x30 [zfs]
[984462.596427]  ? list_del+0xd/0x30 [zfs]
[984462.596674]  receive_process_record+0x84/0x2b0 [zfs]
[984462.596963]  receive_writer_thread+0xb7/0x1a0 [zfs]
[984462.597246]  ? __pfx_receive_writer_thread+0x10/0x10 [zfs]
[984462.597543]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[984462.597563]  thread_generic_wrapper+0x5c/0x70 [spl]
[984462.597583]  kthread+0xef/0x120
[984462.597587]  ? __pfx_kthread+0x10/0x10
[984462.597590]  ret_from_fork+0x44/0x70
[984462.597594]  ? __pfx_kthread+0x10/0x10
[984462.597597]  ret_from_fork_asm+0x1b/0x30
[984462.597601]  </TASK>

@SkyWriter
Copy link

Running 2.2.6 here:

[   14.997701] ZFS: Loaded module v2.2.6-1, ZFS pool version 5000, ZFS filesystem version 5

The system is used as a home NAS, runs 3 2-disk mirror VDEV's + a mirrored SLOG. The pool is encrypted with native ZFS encryption. It exposes NFS, iSCSI. I have regular tasks to receive remote snapshots from my servers and make local ones.

At the time of failure I was running a VM on another host using the NAS data via iSCSI and NFS. I was also receiving remote snapshots (judging by the logs).

The error seems to be the same:

Jan 14 18:21:18 cellar kernel: VERIFY0(dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
Jan 14 18:21:18 cellar kernel: PANIC at dmu_recv.c:2093:receive_object()
Jan 14 18:21:18 cellar kernel: Showing stack for process 3515068
Jan 14 18:21:18 cellar kernel: CPU: 1 PID: 3515068 Comm: receive_writer Tainted: P           O       6.6.52 #1-NixOS
Jan 14 18:21:18 cellar kernel: Hardware name: Default string Default string/Default string, BIOS 5.27 12/21/2023
Jan 14 18:21:18 cellar kernel: Call Trace:
Jan 14 18:21:18 cellar kernel:  <TASK>
Jan 14 18:21:18 cellar kernel:  dump_stack_lvl+0x47/0x60
Jan 14 18:21:18 cellar kernel:  spl_panic+0x100/0x120 [spl]
Jan 14 18:21:18 cellar kernel:  receive_object+0xb5b/0xd80 [zfs]
Jan 14 18:21:18 cellar kernel:  ? __wake_up_common_lock+0x8f/0xd0
Jan 14 18:21:18 cellar kernel:  receive_writer_thread+0x29b/0xb10 [zfs]
Jan 14 18:21:18 cellar kernel:  ? __pfx_receive_writer_thread+0x10/0x10 [zfs]
Jan 14 18:21:18 cellar kernel:  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
Jan 14 18:21:18 cellar kernel:  thread_generic_wrapper+0x5b/0x70 [spl]
Jan 14 18:21:18 cellar kernel:  kthread+0xe5/0x120
Jan 14 18:21:18 cellar kernel:  ? __pfx_kthread+0x10/0x10
Jan 14 18:21:18 cellar kernel:  ret_from_fork+0x31/0x50
Jan 14 18:21:18 cellar kernel:  ? __pfx_kthread+0x10/0x10
Jan 14 18:21:18 cellar kernel:  ret_from_fork_asm+0x1b/0x30
Jan 14 18:21:18 cellar kernel:  </TASK>

There's another thing: none of my data from the time of failure at 18:21:18 until my VM locked up due to iSCSI errors at about 19:20:00 got persisted to the underlying storage. iSCSI Ext4 volume ended up being corrupt, NFS contained no changes for the period. Both were accepting and confirming writes all that time.

NAS reboot was required to recover from the failed state, during import I observed 800 Mb's worth of SLOG getting dumped to disks.

I am attaching a more detailed system log:

zfs_failure_log.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Encryption "native encryption" feature Component: Send/Recv "zfs send/recv" feature Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests