-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PANIC at dmu_recv.c on receiving snapshot to encrypted file system #12732
Comments
I think this is just another flavor of #11679, where a NULL winds up where a valid buffer is expected, and fireworks ensue. |
Update on this, I'm seeing this at least once a day now and have had it occur on a host that receives from this host. As it stands the backup host is effectively non-functional. |
Further update After running two complete scrubs of the filesystem the corrupted data errors went away and we were able to resume transfers. As a test Sanoid was disabled, so that there should be snapshot deletion on a file system whilst it was being modified by a receive and things were much more stable. However, leaving it running the large (~20TB) transfer over the weekend I returned this morning to find it hung up again, but this time with a different error: Nov 20 12:41:15 fs3 kernel: Modules linked in: 8021q garp mrp bonding ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat n Data errors are back (on different file systems), so I have some hope that scrubs will clear these. |
Is there an error message above the "Modules linked in: ..."? Because I would usually expect one, I think. |
...and a bit later on in the logs...
|
Correction, there was a NMI watchdog line, must have failed to copy that the first time, it's for the same process ID as the second call trace. |
Huh. I wonder what state is wrong that it's stuck forever on, because I don't think I expect ordinary "blocked taskq" to trigger one of those. |
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions. |
Bumping this as this is something I've been running into with exactly the same symptoms as above. This essentially makes native zfs encryption completely unusable for me. Also, there's another somewhat similar issue: sending from encrypted fs also leads to seemingly recoverable "permanent data errors" on intermediate from-snapshots. It doesn't lead to any kernel errors, and recovering from it is simply deleting an offending snapshot. |
You might find the fix for #13709 useful for the latter, w23, because I don't know what else might break those. |
Thanks for the link. After reading it a bit I'm not sure if that's relevant to issues I'm seeing. I.e. the pools and encrypted filesystems get unlocked and mounted just fine. It's send/recv that has issues. It feels like it's some sort of pool-wide race condition, as doing more ops to completely different filesystem subtrees (i.e. unrelated by nesting, and having different keys) severely exacerbates the problem. These ops include stuff like:
Transfer speed also affects the probability of hitting it. It can be fine for weeks if transfers are around few hundred KiB/sec. If it gets to a few dozen MiB/sec, it usually panics within a few minutes. I contemplated making a standalone test, e.g. a minimal qemu image with a script doing |
I was talking about the intermediate snapshots failing to unlock, as I was remarking that it might be that problem. |
System information
Describe the problem you're observing
This issue seems very similar to #12270 and #12001 but has a different stack trace.
The system is a backup target and uses Sanoid/Syncoid, to manage and transfer snapshots from several systems and has been in operation for several years. Whilst running 0.8.6 of ZFS it suffered from regular OOM killer events on the zpool command but was otherwise stable, so was upgraded to 2.0.6 from 0.8.6 on mid-October with the hope that this would improve reliability. It hasn't!
The system has ca 150 ZFS file systems that either ZFS receive snapshots,SSH/rsync targets or run a minio server (as an experiment with object stores, but currently unused). All file systems locally are encrypted aes-256-ccm. Remote sources are a mixture of unencrypted, aes-256-ccm and aes-256-gcm (depending on host ZFS capabilities and age). In addition an offsite ZFS target pulls these snapshots to provide for disaster recovery. This remote system is stable and is on largely identical hardware but only pulls from the host (no transfers out).
Host has 64GB RAM.
Snapshots are being created/destroyed on the source and destination throughout long transfers of which we have several that are failing to complete between hangs due to being multi-TB in size.
On hang:
Syncoid is called with:
--no-sync-snap --create-bookmark -r --skip-parent
(have experimented with changing the mbuffer-size with little effect)
At the time of the hang the system log contains:
The minio hangs I believe are just a symptom of the underlying issue, not the cause.
Post reboot the permanent errors are in typically in snapshots (or files within snapshots) and mostly from deleted snapshots:
This happened again today (after a reboot this morning to recover from the above).
At this point, zfs sends appear to be still in operation (remote target continues to receive data) but anything involving writes is hung.
Describe how to reproduce the problem
Use sanoid/syncoid to generate and transfer (push and pull) snapshots between systems. Wait...
Include any warning/errors/backtraces from the system logs
The text was updated successfully, but these errors were encountered: