-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZFS hangs on kernel error: VERIFY3(0 == dmu_bonus_hold_by_dnode #12001
Comments
Same issue here, ran into this while copying many snapshots over from one pool to another. Stack trace is similar, but a little different on ZFS 2.1.1:
It does seem random as after reboot I re-ran the exact same |
Any updates for this issue or #12785? |
Some debugging output after a recent crash The issue occured again on january 6 14:33:34 - I base this on default log output which I think is UTC+1 by default. This would translate to timestamp 1641476014. zfs_get_all.txt EDIT: logs were set to UTC + 1 |
One more set of logs of the same event happening a day later at dbgmsg_after_reboot.txt With the following
|
I am pretty sure the panic happens because Anyone who can replicate the bug:
|
System information
Time syslog UTC+4 |
@malfret thank you for the logs. Based on the time the panic occurred I went to look for events in dbgmsg.txt around 1648983726. However the events logged into dbgmsg.txt are about 10 minutes later than the panic (starting at 1648984381). If you can reproduce this, would you mind posting a dbgmsg (with |
At the time of panic, the file system was not working properly and the shell script was unable to save the log. The log was saved manually. How to increase the log size so that the data can be saved? |
If I change zfs_dbgmsg_maxsize - it will increase the size of the dbgmsg log?
|
Right, that may help. |
2022-04-06_dbgmsg.log.zip Time syslog UTC+4 |
The panic seems to happen because And that happens because |
@gamanakis We are still experiencing this bug on our production systems, any chance that this will be solved any time soon? We are running the latest version from debian stable 2.0.3-9. Could this be escalated maybe? Any help is appreciated! |
Also happened with 2.1.5-1~bpo11+1. Nothing logged to syslog.
|
Also occurred with 2.1.7 (with c8d2ab0 reverted, otherwise I would run into #14252).
|
|
You might want to update to the latest ZFS release. jonathonf's PPA hasn't
been updated in quite a while now.
What I did was that I grabbed the zfs-linux source tarball from the lunar
repo and compiled it with dpkg-buildpackage.
…On Sat, Apr 22, 2023 at 9:21 PM Lenno Nagel ***@***.***> wrote:
[1687794.507453] VERIFY3(0 == dmu_bonus_hold_by_dnode(dn, FTAG, &db, flags)) failed (0 == 5)
[1687794.507479] PANIC at dmu_recv.c:1806:receive_object()
[1687794.507493] Showing stack for process 2882666
[1687794.507505] CPU: 5 PID: 2882666 Comm: receive_writer Tainted: P D OE 5.19.0-38-generic #39~22.04.1-Ubuntu
[1687794.507529] Hardware name: Intel(R) Client Systems NUC8i7BEH/NUC8BEB, BIOS BECFL357.86A.0090.2022.0916.1942 09/16/2022
[1687794.507547] Call Trace:
[1687794.507557] <TASK>
[1687794.507571] show_stack+0x52/0x69
[1687794.507598] dump_stack_lvl+0x49/0x6d
[1687794.507629] dump_stack+0x10/0x18
[1687794.507647] spl_dumpstack+0x29/0x35 [spl]
[1687794.507691] spl_panic+0xd1/0xe9 [spl]
[1687794.507732] ? arc_space_consume+0x54/0x130 [zfs]
[1687794.508027] ? dbuf_create+0x5c1/0x5f0 [zfs]
[1687794.508306] ? dbuf_read+0x11b/0x630 [zfs]
[1687794.508587] ? dmu_bonus_hold_by_dnode+0x15b/0x1b0 [zfs]
[1687794.508860] receive_object+0xae1/0xd40 [zfs]
[1687794.509186] ? __slab_free+0x31/0x340
[1687794.509211] ? spl_kmem_free+0x32/0x40 [spl]
[1687794.509244] ? kfree+0x30f/0x330
[1687794.509259] ? mutex_lock+0x13/0x50
[1687794.509277] receive_writer_thread+0x1ce/0xb50 [zfs]
[1687794.509560] ? set_next_task_fair+0x70/0xb0
[1687794.509578] ? receive_process_write_record+0x1a0/0x1a0 [zfs]
[1687794.509861] ? spl_kmem_free+0x32/0x40 [spl]
[1687794.509894] ? kfree+0x30f/0x330
[1687794.509909] ? receive_process_write_record+0x1a0/0x1a0 [zfs]
[1687794.510191] ? __thread_exit+0x20/0x20 [spl]
[1687794.510225] thread_generic_wrapper+0x61/0x80 [spl]
[1687794.510256] ? thread_generic_wrapper+0x61/0x80 [spl]
[1687794.510288] kthread+0xeb/0x120
[1687794.510303] ? kthread_complete_and_exit+0x20/0x20
[1687794.510321] ret_from_fork+0x1f/0x30
[1687794.510343] </TASK>
—
Reply to this email directly, view it on GitHub
<#12001 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABI35LW7C3C2JF5BL4ICZVLXCSU4DANCNFSM44EHQVOA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Using 2.2.0 now
The error seems to be the same
|
Running 2.2.6 here:
The system is used as a home NAS, runs 3 2-disk mirror VDEV's + a mirrored SLOG. The pool is encrypted with native ZFS encryption. It exposes NFS, iSCSI. I have regular tasks to receive remote snapshots from my servers and make local ones. At the time of failure I was running a VM on another host using the NAS data via iSCSI and NFS. I was also receiving remote snapshots (judging by the logs). The error seems to be the same:
There's another thing: none of my data from the time of failure at 18:21:18 until my VM locked up due to iSCSI errors at about 19:20:00 got persisted to the underlying storage. iSCSI Ext4 volume ended up being corrupt, NFS contained no changes for the period. Both were accepting and confirming writes all that time. NAS reboot was required to recover from the failed state, during import I observed 800 Mb's worth of SLOG getting dumped to disks. I am attaching a more detailed system log: |
System information
Zpool properties:
Example test machine properties:
Describe the problem you're observing
During an incremental send using
zfs send -I | receive -F
, ZFS hangs. Any subsequent call to ZFS or e.g. opening any file on the ZFS filesystem, will hang indefinitely.Note that we also experienced: #12014 - perhaps it is related.
Describe how to reproduce the problem
This is happening frequently for us, on multiple machines, all using HDDs. The machines act as backup targets so most activity is receiving snapshots all day. Other zfs related processes on the machines are: snapshots are taken and pruned multiple times per day and a scrub happens once a week. We use encrypted filesystems and a single zpool. Historically the error has triggered for us between 1 and 48 hours.
Total data usage is
2.12T
, withzpool iostat
output showing:Current mitigation
Downgrading to zfs version
0.8.4-1~bpo10+1
seems to have resolved the issue - at the time of writing we had about aweekmonth without ZFS hanging.Include any warning/errors/backtraces from the system logs
The first error
Subsequent errors
EDIT May 11th 2021: downgrading ZFS to
0.8.4-1~bpo10+1
resolved the issue.EDIT June 14th 2021: the issue has still been resolved on the downgraded ZFS version.
EDIT December 20th 2021: the issue is still there. Also with
zfs-2.0.3-9
on5.10.0-8-amd64
The text was updated successfully, but these errors were encountered: