Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS corruption related to snapshots post-2.0.x upgrade #12014

Open
jgoerzen opened this issue May 8, 2021 · 201 comments
Open

ZFS corruption related to snapshots post-2.0.x upgrade #12014

jgoerzen opened this issue May 8, 2021 · 201 comments
Labels
Component: Encryption "native encryption" feature Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@jgoerzen
Copy link

jgoerzen commented May 8, 2021

System information

Type Version/Name
Distribution Name Debian
Distribution Version Buster
Linux Kernel 5.10.0-0.bpo.5-amd64
Architecture amd64
ZFS Version 2.0.3-1~bpo10+1
SPL Version 2.0.3-1~bpo10+1

Describe the problem you're observing

Since upgrading to 2.0.x and enabling crypto, every week or so, I start to have issues with my zfs send/receive-based backups. Upon investigating, I will see output like this:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:03:37 with 0 errors on Mon May  3 16:58:33 2021
config:

	NAME         STATE     READ WRITE CKSUM
	rpool        ONLINE       0     0     0
	  nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xeb51>:<0x0>

Of note, the <0xeb51> is sometimes a snapshot name; if I zfs destroy the snapshot, it is replaced by this tag.

Bug #11688 implies that zfs destroy on the snapshot and then a scrub will fix it. For me, it did not. If I run a scrub without rebooting after seeing this kind of zpool status output, I get the following in very short order, and the scrub (and eventually much of the system) hangs:

[393801.328126] VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
[393801.328129] PANIC at arc.c:3790:arc_buf_destroy()
[393801.328130] Showing stack for process 363
[393801.328132] CPU: 2 PID: 363 Comm: z_rd_int Tainted: P     U     OE     5.10.0-0.bpo.5-amd64 #1 Debian 5.10.24-1~bpo10+1
[393801.328133] Hardware name: Dell Inc. XPS 15 7590/0VYV0G, BIOS 1.8.1 07/03/2020
[393801.328134] Call Trace:
[393801.328140]  dump_stack+0x6d/0x88
[393801.328149]  spl_panic+0xd3/0xfb [spl]
[393801.328153]  ? __wake_up_common_lock+0x87/0xc0
[393801.328221]  ? zei_add_range+0x130/0x130 [zfs]
[393801.328225]  ? __cv_broadcast+0x26/0x30 [spl]
[393801.328275]  ? zfs_zevent_post+0x238/0x2a0 [zfs]
[393801.328302]  arc_buf_destroy+0xf3/0x100 [zfs]
[393801.328331]  arc_read_done+0x24d/0x490 [zfs]
[393801.328388]  zio_done+0x43d/0x1020 [zfs]
[393801.328445]  ? zio_vdev_io_assess+0x4d/0x240 [zfs]
[393801.328502]  zio_execute+0x90/0xf0 [zfs]
[393801.328508]  taskq_thread+0x2e7/0x530 [spl]
[393801.328512]  ? wake_up_q+0xa0/0xa0
[393801.328569]  ? zio_taskq_member.isra.11.constprop.17+0x60/0x60 [zfs]
[393801.328574]  ? taskq_thread_spawn+0x50/0x50 [spl]
[393801.328576]  kthread+0x116/0x130
[393801.328578]  ? kthread_park+0x80/0x80
[393801.328581]  ret_from_fork+0x22/0x30

However I want to stress that this backtrace is not the original cause of the problem, and it only appears if I do a scrub without first rebooting.

After that panic, the scrub stalled -- and a second error appeared:

zpool status -v
  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub in progress since Sat May  8 08:11:07 2021
	152G scanned at 132M/s, 1.63M issued at 1.41K/s, 172G total
	0B repaired, 0.00% done, no estimated completion time
config:

	NAME         STATE     READ WRITE CKSUM
	rpool        ONLINE       0     0     0
	  nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xeb51>:<0x0>
        rpool/crypt/debian-1/home/jgoerzen/no-backup@[elided]-hourly-2021-05-07_02.17.01--2d:<0x0>

I have found the solution to this issue is to reboot into single-user mode and run a scrub. Sometimes it takes several scrubs, maybe even with some reboots in between, but eventually it will clear up the issue. If I reboot before scrubbing, I do not get the panic or the hung scrub.

I run this same version of ZoL on two other machines, one of which runs this same kernel version. What is unique about this machine?

  • It is a laptop
  • It uses ZFS crypto (the others use LUKS)

I made a significant effort to rule out hardware issues, including running several memory tests and the built-in Dell diagnostics. I believe I have rules that out.

Describe how to reproduce the problem

I can't at will. I have to wait for a spell.

Include any warning/errors/backtraces from the system logs

See above

Potentially related bugs

@jgoerzen jgoerzen added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels May 8, 2021
@jgoerzen
Copy link
Author

jgoerzen commented May 8, 2021

Two other interesting tidbits...

When I do the reboot after this issue occurs, the mounting of the individual zfs datasets is S L O W. Several seconds each, and that normally just flies by. After scrubbing, it is back to normal speed of mounting.

The datasets that have snapshot issues vary with each one. Sometimes it's just one, sometimes many. But var is almost always included. (Though its parent, which has almost no activity ever, also is from time to time, so that's odd.)

@jstenback
Copy link
Contributor

Same symptoms here, more or less. See also issue #11688.

@glueckself
Copy link

glueckself commented May 9, 2021

I also have the symptom with the corrupted snapshots, without kernel panics so far.

So far it only affected my Debian system with Linux 5.10 and zfs 2.0.3 (I've turned the server off for today, I can check the exact versions tomorrow). Also, while the system has the 2.0.3 zfs utils + module, the pool is still left on 0.8.6 format. I wasn't able to execute zfs list -r -t all <affected dataset> - it displayed cannot iterate filesystems and only a few snapshots (instead of tens it should've). Also, I couldn't destroy the affected snapshots because it said they didn't exist anymore. I couldn't send the dataset with syncoid at all.

On the corrupted system, after I got the mail from ZED, I manually ran a scrub at first, after which the zpool status said that there were no errors. However, the next zpool status, seconds after the first, again said that there were errors. Subsequent scrubs didn't clean the errors.

I've rebooted the server into an Ubuntu 20.10 live with zfs 0.8.4-1ubuntu11 (again, sorry that I haven't noted the version, can add it tomorrow) and after a scrub the errors were gone. Following scrubs haven't detected errors anymore. zfs list -r -t all ... again displayed a large list of snapshots.

The errors didn't seem to affect the data on the zvols (all 4 affected snapshots are of zvols). The zvols are used as disks for VMs with ext4 on them. I will verify them tomorrow.
EDIT: I checked one of the VM disks, neither fsck nor dpkg -V (verify checksums of all files installed from a package) could find any errors (except mismatching dpkg-checksums of config files I've changed - that is to be expected).

I have two other Ubuntu 21.04 based Systems with zfs-2.0.2-1ubuntu5 which are not affected until now. However, they have their pools already upgraded to 2. All are snapshotted with sanoid and have the datasets encrypted.

My next step will be to downgrade zfs back to 0.8.6 on the Debian system and see what happens.

EDIT:
More points I've noted while investigating with 0.8.4-1ubuntu11:

  • Creating new snapshots continued working for affected datasets, however destroying them didn't (right now I have 127 "frequently" (sanoids term for the most often snapshot - in my case 15 minutes) instead of the 10 sanoid is configured to keep.
  • With 0.8, the destroying of the affected snapshots worked. Scrubbing afterwards didn't find any errors.

EDIT 2:

  • On 2.0.2 (Ubuntu 21.04 again), sanoid managed to successfully prune (destroy) all remaining snapshots that past their valid-time. A scrub afterwards didn't find any errors. I'll be running the 2.0.2 for a while and see what happens.

@dcarosone
Copy link

dcarosone commented May 21, 2021

I'm seeing this too, on Ubuntu 21.04, also using zfs encryption

I have znapzend running, and it makes a lot of snapshots. Sometimes, some of them are bad, and can't be used (for example, attempting to send them to a replica destination fails). I now use the skipIntermediates option, and so at least forward progress is made on the next snapshot interval.

In the most recent case (this morning) I had something like 4300 errors (many more than I'd seen previously). There are no block-level errors (read/write/cksum). They're cleared after destroying the affected snapshots and scrubbing (and maybe a reboot, depending on .. day?)

Warning! Speculation below:

  • this may be related to a race condition?
  • znapzend wakes up and makes recursive snapshots of about 6 first-level child datasets ot rpool (ROOT, home, data, ...) all at the same time (as well as a couple of other pools, some of those still using LUKS for encryption underneath instead).
  • I have been having trouble with the ubuntu-native zsysd, whch gets stuck at 100% cpu. Normally I get frustrated and just disable it.
  • However, recently, I have been trying to understand what it's doing and what's going wrong (it tries to collect every dataset and snapshot and property in memory on startup). It seems like this has happened several times in the past few days while I have been letting zsysd run (so more contention for libzfs operations)
  • Update I haven't seen this again since disabling zsysd .. ~3 weeks and counting.

@aerusso
Copy link
Contributor

aerusso commented Jun 12, 2021

@jgoerzen Can you

  1. Capture the zpool events -v report when one of these "bad" snapshots is created?
  2. Try to zfs send that snapshot (i.e., to zfs send ... | cat >/dev/null; notice the need to use cat).
  3. Reboot, and try to zfs send the snapshot.

In my case, #11688 (which you already reference), I've discovered that rebooting "heals" the snapshot -- at least using the patchset I mentioned there

@jgoerzen
Copy link
Author

I'll be glad to. Unfortunately, I rebooted the machine yesterday, so I expect it will be about a week before the problem recurs.

It is interesting to see the discussion today in #11688. The unique factor about the machine that doesn't work for me is that I have encryption enabled. It wouldn't surprise me to see the same thing here, but I will of course wait for it to recur and let you know.

@jgoerzen
Copy link
Author

Hello @aerusso,

The problem recurred over the weekend and I noticed it this morning.

Unfortunately, the incident that caused it had already expired out of the zpool events buffer (apparently), as it only went as far back as less than an hour ago. However, I did find this in syslog:

Jun 20 01:17:39 athena zed: eid=34569 class=authentication pool='rpool' bookmark=12680:0:0:98
Jun 20 01:17:39 athena zed: eid=34570 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=12680:0:0:242
Jun 20 01:17:40 athena zed: eid=34571 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=12680:0:0:261
...
Jun 20 17:17:39 athena zed: eid=37284 class=authentication pool='rpool' bookmark=19942:0:0:98
Jun 20 17:17:39 athena zed: eid=37285 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=19942:0:0:242
Jun 20 17:17:40 athena zed: eid=37286 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=19942:0:0:261
...
Jun 20 18:17:28 athena zed: eid=37376 class=data pool='rpool' priority=2 err=5 flags=0x180 bookmark=21921:0:0:2072
Jun 20 18:17:29 athena zed: eid=37377 class=authentication pool='rpool' priority=2 err=5 flags=0x80 bookmark=21921:0:0:2072
Jun 20 18:17:29 athena zed: eid=37378 class=data pool='rpool' priority=2 err=5 flags=0x80 bookmark=21921:0:0:2072
Jun 20 18:17:40 athena zed: eid=37411 class=authentication pool='rpool' bookmark=21923:0:0:0

It should be noted that my hourly snap/send stuff runs at 17 minutes past the hour, so that may explain this timestamp correlation.

zpool status reported:

  pool: rpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:04:12 with 0 errors on Sun Jun 13 00:28:13 2021
config:

	NAME         STATE     READ WRITE CKSUM
	rpool        ONLINE       0     0     0
	  nvme0n1p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x5c81>:<0x0>
        <0x3188>:<0x0>
        rpool/crypt/debian-1@athena-hourly-2021-06-20_23.17.01--2d:<0x0>
        rpool/crypt/debian-1/var@athena-hourly-2021-06-20_23.17.01--2d:<0x0>
        <0x4de6>:<0x0>

Unfortunately I forgot to attempt to do a zfs send before reboot. Those snapshots, though not referenced directly, would have been included in a send -I that would have been issued. From my logs:

Jun 20 18:17:03 athena simplesnapwrap[4740]: Running: /sbin/zfs send -I rpool/crypt/debian-1/var@__simplesnap_bakfs1_2021-06-20T22:17:02__ rpool/crypt/debian-1/var@__simplesnap_bakfs1_2021-06-20T23:17:03__
Jun 20 18:17:03 athena simplesnap[2466/simplesnapwrap]: internal error: warning: cannot send 'rpool/crypt/debian-1/var@athena-hourly-2021-06-20_23.17.01--2d': Invalid argument

So I think that answers the question.

After a reboot but before a scrub, the zfs send you gave executes fine.

@cbreak-black
Copy link

I have similar symptoms, on an encrypted single-ssd ubuntu 21.04 boot pool, using stock zfs from ubuntu's repos. Deleting the affected snapshots and scrubbing previously cleared the errors, but on reoccurence, repeated scrubbing (without deleting them) caused a deadlock. My system has ECC memory, so it's probably not RAM related.

  • Does this problem happen with slower pools (like hard disk pools?)
  • Does this problem happen with pools that have redundancy?
  • Does this problem happen with with pools that don't trim (hard disk pools again?)

@aerusso
Copy link
Contributor

aerusso commented Jul 4, 2021

@cbreak-black Was there a system restart between the occurrence of the corrupted snapshot and the problems? Restarting has "fixed" this symptom for me (though you will need to scrub twice for the message to disappear, I believe).

I have a suspicion that this may be a version of #10737 , which has an MR under way there. The behavior I am experiencing could be explained by that bug (syncoid starts many zfs sends on my machine, some of which are not finished; SSDs do the send much faster, so are more likely to get deeper into the zfs send before the next command in the pipeline times out; a reboot heals the issue, for me; there's no on disk corruption, as far as I can tell).

I'm holding off on trying to bisect this issue (at least) until testing that MR. (And all the above is conjecture!)

@cbreak-black
Copy link

@aerusso No, without a restart I got into the scrub-hang, and had to restart hard. Afterwards, the scrub finished, and several of the errors vanished. The rest of the errors vanished after deleting the snapshots and scrubbing again.

@InsanePrawn
Copy link
Contributor

InsanePrawn commented Jul 4, 2021

Can I join the club too? #10019
Note how it's also at 0x0. Sadly I deleted said snapshot and dataset by now.

@aerusso
Copy link
Contributor

aerusso commented Jul 4, 2021

@InsanePrawn I can't seem to find commit 4d5b4a33d in any repository I know of (and neither can github, apparently, either). However, in your report you say this was a "recent git master" and the commit I'm currently betting on being guilty is da92d5c, which was committed in November of the previous year, so I can't use your data point to rule out my theory!

Also, it sounds like you didn't have any good way to reproduce the error --- however, you were using a test pool. Compared to my reproduction strategy (which is just, turn my computer on and browse the web, check mail, etc.) it might be easier to narrow in on a test case (or might have been easier a year and a half ago, when this was all fresh). Anyway, if you have any scripts or ideas of what you were doing that caused this besides "snapshots being created and deleted every couple minutes", it might be useful too. (I already tried lots of snapshot creations and deletions during fio on several datasets in a VM).

@InsanePrawn
Copy link
Contributor

InsanePrawn commented Jul 4, 2021

Yeah, idk why I didn't go look for the commit in my issue - luckily for us, that server (and pool; it does say yolo, but it's my private server's root pool. it's just that i won't cry much if it breaks; originally due to then unreleased crypto) and the git repo on it still exist. Looks like 4d5b4a33d was two systemd-generator commits by me after 610eec4

@InsanePrawn
Copy link
Contributor

InsanePrawn commented Jul 4, 2021

FWIW the dataset the issue appeared on was an empty filesystem (maybe a single small file inside) dataset that had snapshots (without actual fs activity) taken in quick intervals (somewhere between 30s and 5m intervals) in parallel with a few (5-15) other similarly empty datasets.
Edit: These were being snapshotted and replicated by zrepl, probably in a similar manner to what znapzend does.

The pool is a raidz2 on 3.5" spinning SATA disks.
I'm afraid I have nothing more to add in terms of reproduction :/

Edit: Turns out the dataset also still exists, the defective snapshot however does not anymore. I doubt that's helpful?

@aerusso
Copy link
Contributor

aerusso commented Jul 5, 2021

@InsanePrawn Does running the zrepl workload reproduce the bug on 2.0.5 (or another recent release?)

I don't think the snapshot is terribly important --- unless you're able to really dig into it with zdb (which I have not developed sufficient expertise to do). Rather, I think it's the workload, hardware setup, and (possibly, but I don't understand the mechanism at all) the dataset itself. Encryption also is a common theme, but that might just affect the presentation (i.e., there's no MAC to fail in the unencrypted, unauthenticated, case).

Getting at zpool events -v showing the error would probably tell us something (see mine).

@cbreak-black
Copy link

I've since added redundancy to my pool (it's now a mirror with two devices), and disabled autotrim. The snapshot corruption still happens. Still don't know what is causing it. And I also don't know if the corruption happens when creating the snapshot, and only later gets discovered (when I try to zfs send the snapshots), or if snapshots get corrupted some time in between creation and sending.

@aerusso
Copy link
Contributor

aerusso commented Aug 14, 2021

@cbreak-black Can you enable the all-debug.sh ZEDlet, and put the temporary directory somewhere permanent (i.e., not the default of /tmp/zed.debug.log)?

This will get the output of zpool events -v as it is generated, and will give times, which you can conceivably triangulate with your other logs. There's other information in those logs that is probably useful, too.

I'll repeat this here: if anyone gets me a reliable reproducer on a new pool, I have no doubt we'll be able to solve this in short order.

@wohali
Copy link

wohali commented Sep 1, 2021

Just mentioning here that we saw this on TrueNAS 12.0-U5 with OpenZFS 2.0.5 as well -- see #11688 (comment) for our story.

@rincebrain
Copy link
Contributor

Since I don't see anyone mentioning it here yet, #11679 contains a number of stories about the ARC getting confused when encryption is involved and, in a very similar looking illumos bug linked from there, eating data at least once.

@gamanakis
Copy link
Contributor

gamanakis commented Sep 30, 2021

@jgoerzen are you using raw send/receive? If yes this is closely related to #12594.

@jgoerzen
Copy link
Author

@gamanakis Nope, I'm not using raw (-w).

@phreaker0
Copy link

it's present in v2.1.1 as well:

Okt 09 01:01:14 tux sanoid[2043026]: taking snapshot ssd/container/debian-test@autosnap_2021-10-08_23:01:14_hourly
Okt 09 01:01:16 tux sanoid[2043026]: taking snapshot ssd/container/debian-test@autosnap_2021-10-08_23:01:14_frequently
Okt 09 01:01:16 tux kernel: VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
Okt 09 01:01:16 tux kernel: PANIC at arc.c:3836:arc_buf_destroy()
Okt 09 01:01:16 tux kernel: Showing stack for process 435
Okt 09 01:01:16 tux kernel: CPU: 2 PID: 435 Comm: z_rd_int_1 Tainted: P           OE     5.4.0-84-generic #94-Ubuntu
Okt 09 01:01:16 tux kernel: Hardware name: GIGABYTE GB-BNi7HG4-950/MKHM17P-00, BIOS F1 05/24/2016
Okt 09 01:01:16 tux kernel: Call Trace:
Okt 09 01:01:16 tux kernel:  dump_stack+0x6d/0x8b
Okt 09 01:01:16 tux kernel:  spl_dumpstack+0x29/0x2b [spl]
Okt 09 01:01:16 tux kernel:  spl_panic+0xd4/0xfc [spl]
Okt 09 01:01:16 tux kernel:  ? kfree+0x231/0x250
Okt 09 01:01:16 tux kernel:  ? spl_kmem_free+0x33/0x40 [spl]
Okt 09 01:01:16 tux kernel:  ? kfree+0x231/0x250
Okt 09 01:01:16 tux kernel:  ? zei_add_range+0x140/0x140 [zfs]
Okt 09 01:01:16 tux kernel:  ? spl_kmem_free+0x33/0x40 [spl]
Okt 09 01:01:16 tux kernel:  ? zfs_zevent_drain+0xd3/0xe0 [zfs]
Okt 09 01:01:16 tux kernel:  ? zei_add_range+0x140/0x140 [zfs]
Okt 09 01:01:16 tux kernel:  ? zfs_zevent_post+0x234/0x270 [zfs]
Okt 09 01:01:16 tux kernel:  arc_buf_destroy+0xfa/0x100 [zfs]
Okt 09 01:01:16 tux kernel:  arc_read_done+0x251/0x4a0 [zfs]
Okt 09 01:01:16 tux kernel:  zio_done+0x407/0x1050 [zfs]
Okt 09 01:01:16 tux kernel:  zio_execute+0x93/0xf0 [zfs]
Okt 09 01:01:16 tux kernel:  taskq_thread+0x2fb/0x510 [spl]
Okt 09 01:01:16 tux kernel:  ? wake_up_q+0x70/0x70
Okt 09 01:01:16 tux kernel:  ? zio_taskq_member.isra.0.constprop.0+0x60/0x60 [zfs]
Okt 09 01:01:16 tux kernel:  kthread+0x104/0x140
Okt 09 01:01:16 tux kernel:  ? task_done+0xb0/0xb0 [spl]
Okt 09 01:01:16 tux kernel:  ? kthread_park+0x90/0x90
Okt 09 01:01:16 tux kernel:  ret_from_fork+0x1f/0x40

@phreaker0
Copy link

@aerusso you wrote that da92d5c may be the cause of this issue. My workstation at work panics after a couple of days and I need to reset it. Could you provide a branch of 2.1.1 with this commit reverted (as revert causes merge conflicts I can't fix myself) so I could test if the machine no longer crashes?

@aerusso
Copy link
Contributor

aerusso commented Oct 14, 2021

@phreaker0 Unfortunately, the bug that da92d5c introduced (#10737) was fixed by #12299, which I believe is present in all maintained branches now. It does not fix #11688, (which I suspect is the same as this bug).

I'm currently running 0.8.6 on Linux 5.4.y, and am hoping to wait out this bug (I don't have a lot of time right now, or for the foreseeable future). But, If you have a reliable reproducer (or a whole lot of time) you could bisect while running 5.4 (or some other pre-5.10 kernel). I can help anyone who wants to do that. If we can find the guilty commit, I have no doubt this can be resolved.

@Germano0
Copy link

@cbreak-black Can you enable the all-debug.sh ZEDlet, and put the temporary directory somewhere permanent (i.e., not the default of /tmp/zed.debug.log)?

@aerusso I have done

git clone https://github.com/openzfs/zfs.git
cd zfs/cmd/zed/zed.d
# sh all-debug.sh ZEDlet
all-debug.sh: line 10: /zed-functions.sh: File or directory does not exist

What am I missing?

@aerusso
Copy link
Contributor

aerusso commented Dec 12, 2024

@Germano0

First of all: it is a complete coincidence that I happened to read this message (so if I don't respond to something like this in the future, it's almost certainly just because I didn't see it at all).

To get the zedlet working, you need that script in /etc/zfs/zed.d, assuming your prefixes are set up the same way as they are on Debian:

cp all-debug.sh /etc/zfs/zed.d

There should be a bunch of scripts in there already (as symlinks). Even better would be to add an appropriate symlink for all-debug.sh.

Afterwards, you'll need to restart the ZED. Something like systemctl restart zfs-zed.service. You may also need to install and/or enable that service. For instance, on Debian, you'd need to run apt install zfs-zed.

As a side note, I'd like to add that getting a reliable reproducer using a pair of VMs would be incredible for this bug. I am not aware of any reproducer that doesn't rely on two pieces of hardware connected by a physical network using someone's production system. That said, performing the bisection after making this reproducer will be another herculean task, since you'll need check everything between zfs-0.8.6 and zfs-2.0.0. This will have to be done on Linux 5.1 or earlier, since the common merge-base is 78fac8d from 2019, and META back there indicates that the maximum supported kernel. I suspect that commit is good (doesn't cause the bug), but you would have to test it.

The estimate git gave me is 9 steps, plus that commit. I can provide high-level assistance doing that git bisect if someone has a reliable reproducer, but I would not be surprised if that bisection takes a month of run time, given how long it took for me to personally experience the errors. Furthermore, you'd have to be aware that, during that bisection, you'd be running through other, possible very serious, buggy versions.

@Germano0
Copy link

Thank you Antonio, I understand your concerns about the reproducibility of the bug.
In any case I would like to share the test environment I am using.
I created 2 Alma Linux 9 VMs with 4 additional disks each.

VM N.1

# dnf config-manager --set-enabled crb
# dnf install epel-release -y
# dnf install https://zfsonlinux.org/epel/zfs-release-2-3$(rpm --eval "%{dist}").noarch.rpm -y
# dnf install -y kernel-devel
# dnf install -y zfs
unpacked sanoid 2.2.0
# dnf install -y perl-Config-IniFiles perl-Data-Dumper perl-Capture-Tiny perl-Getopt-Long lzop mbuffer mhash pv
# reboot
# dkms status
zfs/2.1.16, 5.14.0-503.15.1.el9_5.x86_64, x86_64: installed
# modprobe zfs
# zpool create zpool_1 raidz2 /dev/vdb /dev/vdc /dev/vdd /dev/vde
# zfs create -o encryption=aes-256-gcm -o keyformat=passphrase zpool_1/dataset_1
# zfs mount -l zpool_1/dataset_1

followed instructions at https://github.com/jimsalterjrs/sanoid/blob/master/INSTALL.md
# cat /etc/sanoid/sanoid.conf 
[zpool_1/dataset_1]
        frequent_period = 15
        frequently = 0
        hourly = 36
        daily = 30
        monthly = 3
        yearly = 0
        autosnap = yes
        autoprune = yes

then continuosly running script that simulates user activity, by:

  • creating files with a random size (up to 100 MB)
  • modifying files
  • deleting files

and tring to not exceed 100 GB of available space

#!/bin/bash

# Configure the path of the directory to operate on
TARGET_DIR="/zpool_1/dataset_1/data"
LOG_FILE="/zpool_1/dataset_1/zfs_simulator.log"
MAX_SIZE=$((100 * 1024 * 1024 * 1024)) # 100 GB in bytes

# Check if the target directory exists
if [[ ! -d "$TARGET_DIR" ]]; then
  echo "The target directory $TARGET_DIR does not exist. Create it before running the script."
  exit 1
fi

# Function to calculate the total size of the directory
calculate_total_size() {
  du -sb "$TARGET_DIR" | awk '{print $1}'
}

# Function to create random files
create_random_file() {
  local current_size=$(calculate_total_size)
  if (( current_size >= MAX_SIZE )); then
    echo "$(date) - Size limit reached: $current_size bytes" >> "$LOG_FILE"
    return
  fi
  
  local file_size=$(( (RANDOM * 32768 + RANDOM) % 104857600 + 1 )) # Random file between 1 byte and 100 MB
  local filename="$TARGET_DIR/file_$(date +%s)_$RANDOM.txt"
  
  # Check if the file can be created without exceeding the limit
  if (( current_size + file_size > MAX_SIZE )); then
    file_size=$((MAX_SIZE - current_size))
  fi

  head -c "$file_size" </dev/zero | base64 > "$filename"
  echo "$(date) - Created file: $filename ($file_size bytes)" >> "$LOG_FILE"
}

# Function to modify a random file
modify_random_file() {
  local file=$(find "$TARGET_DIR" -type f | shuf -n 1)
  if [[ -n "$file" ]]; then
    echo "Random modification" >> "$file"
    echo "$(date) - Modified file: $file" >> "$LOG_FILE"
  fi
}

# Function to delete a random file
delete_random_file() {
  local file=$(find "$TARGET_DIR" -type f | shuf -n 1)
  if [[ -n "$file" ]]; then
    local file_size=$(stat -c%s "$file")
    rm "$file"
    echo "$(date) - Deleted file: $file ($file_size bytes)" >> "$LOG_FILE"
  fi
}

# Main function to execute random operations
simulate_user_activity() {
  local operations=("create" "modify" "delete")
  local op=${operations[$((RANDOM % ${#operations[@]}))]}
  
  case "$op" in
    "create")
      create_random_file
      ;;
    "modify")
      modify_random_file
      ;;
    "delete")
      delete_random_file
      ;;
  esac
}

# Infinite loop to simulate activity
while true; do
  simulate_user_activity
  sleep $((RANDOM % 5 + 1)) # Wait between 1 and 5 seconds
done

Aside this, I started to manually sync (unencrypted send) VM N.1 zfs encrypted dataset to VM N.1 zfs unencrypted dataset. I soon will schedule this with crontab

# syncoid --no-sync-snap zpool_1/dataset_1 root@vm_2_ip:zpool_1/dataset_1

VM N.2

# dnf config-manager --set-enabled crb
# dnf install epel-release -y
# dnf install https://zfsonlinux.org/epel/zfs-release-2-3$(rpm --eval "%{dist}").noarch.rpm -y
# dnf install -y kernel-devel
# dnf install -y zfs
# reboot
# dkms status
zfs/2.1.16, 5.14.0-503.15.1.el9_5.x86_64, x86_64: installed
# modprobe zfs
# zpool create zpool_1 raidz2 /dev/vdb /dev/vdc /dev/vdd /dev/vde

@ofthesun9
Copy link
Contributor

After reading the thread there's one detail for me that seems different and that is that I have been using syncoid for replication, and I was using -w in a pull configuration. I hadn't seen any data corruption errors until I also started backing up to another machine in the same way but without using -w (I wanted an on-prem backup that wasn't encrypted just in case something were to go wrong with zfs encryption and I couldn't restore from my remote backup).

TL;DR - one backup with -w was fine for months, then I added the local non-raw-send backup and started seeing this error within a few days.

Yes, I concur with your statement, based on my own experience (currently zfs 2.2.6)
I had to modify all my syncoid periodic tasks to be "raw-send" to get rid of the issue.

@pypeaday
Copy link

pypeaday commented Dec 18, 2024

I put together a script to potentially mitigate the mixed raw/non-raw send issues, although I see that there are folks here who are having the snapshot corruption bug even without the mixed sends... But this script does the following:

  1. Mounts the most recent snapshot of any dataset in a pool to a temp location
  2. rsyncs the snapshots to a target pool
  3. unmounts the snapshots

🐢 Obviously this will be much slower since it's at the file level, not the block level

⚠️ BACKUP YOUR DATA FIRST ⚠️

I am fairly confident this will not mess up anything on an existing backup and it doesn't do any operations on the source pool

I also welcome any thoughts on this as a possible alternative to avoid mixed raw sends!

‼️ Use at your own risk... I obviously can't take any responsibility for data loss ‼️

https://gist.github.com/pypeaday/57a8a0ec18f7db797540174512b0c4eb

@ckruijntjens
Copy link

After reading the thread there's one detail for me that seems different and that is that I have been using syncoid for replication, and I was using -w in a pull configuration. I hadn't seen any data corruption errors until I also started backing up to another machine in the same way but without using -w (I wanted an on-prem backup that wasn't encrypted just in case something were to go wrong with zfs encryption and I couldn't restore from my remote backup).
TL;DR - one backup with -w was fine for months, then I added the local non-raw-send backup and started seeing this error within a few days.

Yes, I concur with your statement, based on my own experience (currently zfs 2.2.6) I had to modify all my syncoid periodic tasks to be "raw-send" to get rid of the issue.

Hi How does the raw send works?

I now have the following:

source = encrypted
target = encrypted (with its own encryption)

Now i can mount zfs filesystems on the target if needed. When i do raw sends can i still mount filesystems on the target? I believe its double necrypted this way? am i correct?

@Maltz42
Copy link

Maltz42 commented Dec 19, 2024

source = encrypted target = encrypted (with its own encryption)

Now i can mount zfs filesystems on the target if needed. When i do raw sends can i still mount filesystems on the target? I believe its double necrypted this way? am i correct?

It's not double-encrypted. But if you haven't been doing raw sends, you can't just switch, you have to rebuild the target from scratch.

A raw send just sends the blocks as-is, so if they're compressed and/or encrypted on the source, they're sent as-is to the target, where they remain in the same compressed and/or encrypted state. The nice thing about raw sends is that you don't have to unlock the target to do the send, which is good for cloud storage or other environments you don't want your data to ever be accessible to anyone but you. To mount datasets on the target, you would use the same process as mounting on the source, including the same passphrase.

If you don't do a raw send, the block is decrypted and uncompressed before its sent, then written to the target using the target pool's settings, which may be different - different compression algorithm, different encryption key, etc. So when sending non-raw streams, the target must be unlocked before it can receive.

My preference has always been raw sends, so it's serendipitous that raw sends appear to be a workaround for these problems.

@ckruijntjens
Copy link

ckruijntjens commented Dec 19, 2024

source = encrypted target = encrypted (with its own encryption)
Now i can mount zfs filesystems on the target if needed. When i do raw sends can i still mount filesystems on the target? I believe its double necrypted this way? am i correct?

It's not double-encrypted. But if you haven't been doing raw sends, you can't just switch, you have to rebuild the target from scratch.

A raw send just sends the blocks as-is, so if they're compressed and/or encrypted on the source, they're sent as-is to the target, where they remain in the same compressed and/or encrypted state. The nice thing about raw sends is that you don't have to unlock the target to do the send, which is good for cloud storage or other environments you don't want your data to ever be accessible to anyone but you. To mount datasets on the target, you would use the same process as mounting on the source, including the same passphrase.

If you don't do a raw send, the block is decrypted and uncompressed before its sent, then written to the target using the target pool's settings, which may be different - different compression algorithm, different encryption key, etc. So when sending non-raw streams, the target must be unlocked before it can receive.

My preference has always been raw sends, so it's serendipitous that raw sends appear to be a workaround for these problems.

Thank you for the info. Perfect i am going to try raw sends.

Does this way also support incremental? And what is i have a snapshot mounted on the target machine and syncoid is starting a raw send? Is it not going to corrupt the data?

@Maltz42
Copy link

Maltz42 commented Dec 19, 2024

Does this way also support incremental?

Yes. Just do a raw send to build the target initially, then subsequent incremental sends can be raw.

And what is i have a snapshot mounted on the target machine and syncoid is starting a raw send? Is it not going to corrupt the data?

In that respect, incremental using raw works just like incremental without it. It will roll the target back to the starting snapshot then update the target to the new incremental snapshot. Any data that was written on the target in between those source snapshots will be lost. But the rollback/update steps are atomic, so from the application layer's perspective, they both happen instantly after the send/receive transfer is complete. No file system corruption will occur. I.e., if it's working for you now, it'll work the same using raw.

@ckruijntjens
Copy link

Does this way also support incremental?

Yes. Just do a raw send to build the target initially, then subsequent incremental sends can be raw.

And what is i have a snapshot mounted on the target machine and syncoid is starting a raw send? Is it not going to corrupt the data?

In that respect, incremental using raw works just like incremental without it. It will roll the target back to the starting snapshot then update the target to the new incremental snapshot. Any data that was written on the target in between those source snapshots will be lost. But the rollback/update steps are atomic, so from the application layer's perspective, they both happen instantly after the send/receive transfer is complete. No file system corruption will occur. I.e., if it's working for you now, it'll work the same using raw.

hmmm,

I fully deleted the target filesystem. Now i want to syncoid to target and get this error:

cannot receive new filesystem stream: incompatible embedded data stream feature with encrypted receive

@ckruijntjens
Copy link

--compress=zstd-fast --sendoptions="wLecp" --no-sync-snap --no-clone-handling --no-privilege-elevation

When i am using these options its working

--compress=zstd-fast --sendoptions="wLecp" --no-sync-snap --no-clone-handling --no-privilege-elevation

@ckruijntjens
Copy link

--compress=zstd-fast --sendoptions="wLecp" --no-sync-snap --no-clone-handling --no-privilege-elevation

When i am using these options its working

--compress=zstd-fast --sendoptions="wLecp" --no-sync-snap --no-clone-handling --no-privilege-elevation

And i see when using raw send on the target i have to load-key for every zfs filesystem manualy (that i want to mount) but this is no problem.

I am curious now that i am using raw sends the issue will return?

@Sieboldianus
Copy link

Sieboldianus commented Dec 22, 2024

I had to modify all my syncoid periodic tasks to be "raw-send" to get rid of the issue.

Just a single observation point here, but I also use syncoid in pull mode with --sendoptions=Rw for all my ZFS datasets for about a year now. Never experienced any problems with zfs-2.1.11-1 (Debian bookworm, receiving backup server) and zfs-2.2.6-pve1 (Proxmox, main server from where snapshots are pulled).

@HankB
Copy link

HankB commented Dec 22, 2024

Good morning,
As a S/W dev I understand how important it is to be able to reproduce bugs in order to track down and fix. To that end I have prepared two scripts to help with this. One creates multiple filesystems within a pool and populates with compressible and incompressible files. The second one walks through the files and (slightly) modifies some in order to mimic normal filesystem usage. The project is at https://github.com/HankB/provoke_ZFS_corruption.

To date I have reproduced file system corruption at least twice, sending data via syncoid from an encrypted pool to an unencrypted pool. One of my configurations is on a low spec host (Intel j1900, 8GB RAM) with a single SSD for boot and remainder in two pools. Testing including populating the pools and provoking errors takes a couple days. The test can run for hours w/out provoking errors but once the errors occur, they seem to cascade.

There are literally tens of thousands of lines of information but I have tried to summarize the results of the most recent test at https://github.com/HankB/provoke_ZFS_corruption/blob/main/X86_trial_2.md#2024-12-22-early-am

This test was using Debian 12 (stable) and ZFS 2.2.6 from bookworm-backports. I have preserved the host and can perform any further investigations that anyone who understands this better then me might suggest.

This issue is getting a little long and if desired, I could start another issue or continue discussion on the zfs-discussion mailing list. My only preference whatever will move this bug closer to resolution.

Thanks!

@aerusso
Copy link
Contributor

aerusso commented Dec 22, 2024

This is incredible, @HankB. Very good job! My suggestion is to try to bisect the commit that creates (or exposes!) the buggy behavior. The first step to that is getting the reproducer working with Linux kernel 5.1 (or earlier), and demonstrating that, with that kernel, 2.2.6 continues to reproduce the bug. Debian buster (oldoldstable) may be the easiest way to do that. After that, you'll want to demonstrate that 78fac8d does NOT reproduce the bug. Git bisect will then take about 9 more compile/run cycles to isolate the guilty commit.

Another prong of attack should be to get this reproducer working in a VM so that it can be run as part of the CI pipeline, and catch these kinds of bugs before they hit stable branches.

Just again, this is really incredible! I'll be looking at the repository to understand the details that I suspect are very important.

@HankB
Copy link

HankB commented Dec 22, 2024

My suggestion is to try to bisect the commit that creates (or exposes!) the buggy behavior.

That seems pretty sensible. I'm also thinking of performing this test on an old (but reliable) host that has proper ECC RAM. The way this problem cascades makes me wonder if some access pattern provokes something like a Rowhammer effect (though a wild pointer or buffer overrun seems more likely.)

First step is to reproduce the issue with the present configuration on the upgraded motherboard. Then I can proceed to bisect.

@Maltz42
Copy link

Maltz42 commented Dec 22, 2024

That is awesome! As a reminder, the first issues appeared with the 0.8.x to 2.0 transition, so bisecting that might be the way to get at the original root cause, which was never fully identified, afaik.

@no-usernames-left
Copy link

@jimsalterjrs Your expertise could come in handy here!

@jimsalterjrs
Copy link

@jimsalterjrs Your expertise could come in handy here!

There isn't much I can offer here; this is probably more up @allanjude 's alley than mine.

The one thing I'll note here is that some folks think this could be triggered by syncoid and sanoid walking the tree of datasets rather than using ZFS recursion. They do that for good reason--native ZFS recursion gets REAL screwy when you add and remove datasets from a tree you've already interacted with using ZFS native recursion!--but you have the option of NOT letting them walk the tree, if you'd like to further bisect.

In sanoid, use recursive=zfs instead of recursive=yes if you want to use ZFS native recursion. In syncoid, instead of using the -r argument, use --sendoptions=R.

Note that you WILL discover why my tools walk the tree manually if you decide to do this, and end up adding or removing datasets after you've begun. I tend to forget the exact details of what triggers which errors and undesired behaviors when... But it doesn't take long to find out the same way I did.

BUT, if you understand those limitations and want to do things that way--either in general, or to bisect this particular problem from a new direction--that's how you'd go about it.

@HankB
Copy link

HankB commented Dec 27, 2024

Good evening,
I've just installed Debian Buster and pulled the .tgz for 0.8.6 and am attempting to build it. I'm going to need a bit of help with this. The instructions at https://openzfs.github.io/openzfs-docs/Developer%20Resources/Building%20ZFS.html don't work with this older version. The 0.8.6 README is pretty brief, suggesting

# Installation

Full documentation for installing ZoL on your favorite Linux distribution can
be found at [our site](http://zfsonlinux.org/).

Poking around here I find the link above. Pointers to instructions that work with Buster and 0.8.6 are going to be very helpful in my progress. Specifically

root@orcus:~# sudo apt install alien autoconf automake build-essential debhelper-compat dh-autoreconf dh-dkms dh-python dkms fakeroot gawk git libaio-dev libattr1-dev libblkid-dev libcurl4-openssl-dev libelf-dev libffi-dev libpam0g-dev libssl-dev libtirpc-dev libtool libudev-dev linux-headers-generic parallel po-debconf python3 python3-all-dev python3-cffi python3-dev python3-packaging python3-setuptools python3-sphinx uuid-dev zlib1g-dev
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'debhelper' instead of 'debhelper-compat'
Package linux-headers-generic is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Unable to locate package dh-dkms
E: Package 'linux-headers-generic' has no installation candidate
root@orcus:~# 

[whinge]
It would be great if each release included the relevant instructions as these will surely change over time.
[/whinge]

Thanks!

In other news, I'm working on a different platform (X8DTL). Good: Builds ZFS a lot quicker and disk operations are a lot faster. Bad news: It took several days to produce the corruption errors. :-/

@aerusso
Copy link
Contributor

aerusso commented Dec 27, 2024

The make native-deb target is, I believe, a relatively new addition. Better to just make install after your build both the kernel and userspace. I would not trust an uninstall when you're doing a bisection where the test-time is so long (days). In that case, you don't need any of the dh-* dependencies on Debian, they are for building the .deb files (the dh is for debhelper).

Specifically: are you running on bare metal or a VM? If a VM, I'd snapshot the VM image and then do the build/install/test. Rollback afterwards && repeat. In that case, you can just make install and not futz around with dkms at all. If you're on bare metal, I'd honestly script a wipe and reinstall. Sure, it adds an hour, but that's probably short compared to the time to reproduce the corruption, and you can be confident that you're not leaving some artifact around that could cause heartache.

In other news, I'm working on a different platform (X8DTL). Good: Builds ZFS a lot quicker and disk operations are a lot faster. Bad news: It took several days to produce the corruption errors. :-/

You reproduced this on ZoL 2.2.6 and what kernel version? If it's a much earlier Linux version, it might just be that the bug takes longer to reproduce.

@HankB
Copy link

HankB commented Dec 27, 2024

Many thanks for the quick reply.

Bare metal. And I agree, it makes sense to reinstall and start with a clean slate for each test. At the moment I'm working through ./configure and working out the missing dependencies.

what kernel version?

That was a fully up to date Debian 12 install (since repaved with Buster) but would have been running

Linux oak 6.1.0-27-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.115-1 (2024-11-01) x86_64 GNU/Linux

It's the same version as the low resource (J1900) host. Might be related to that or might just be luck of the draw. It just means that any negative (e.g. no corruption) tests will need to run for about a week before I consider them conclusive.

@Maltz42
Copy link

Maltz42 commented Dec 27, 2024

In other news, I'm working on a different platform (X8DTL). Good: Builds ZFS a lot quicker and disk operations are a lot faster. Bad news: It took several days to produce the corruption errors. :-/

So is that a good trade then? Or would you be better off to take longer to build ZFS and be able to reproduce the corruption faster?

@HankB
Copy link

HankB commented Dec 27, 2024

So is that a good trade then? Or would you be better off to take longer to build ZFS and be able to reproduce the corruption faster?

It would be best to be able to run on both hosts, but the low resource host is presently malfunctioning. The other host is solid.

@grahamperrin
Copy link
Contributor

#12014 (comment)

… determine if all the issues around encrypted snapshots have been fixed … if there are still bugs, then I'd really like fresh, clear bug reports. …

@robn link from issue 13755 to some more recent point in this issue 12014?

@ckruijntjens
Copy link

I Now use a different approach with sanoid/syncoid. I use sendoptions to recusive send the snapshots. Now i don't have errors on the enrypted pool. Its now running 6 days without errors. before i always had errors within 5 days.

I will keep you informed if it keeps running without errors.

@ckruijntjens
Copy link

Guys,

I can confirm. If i use the following i get encryption errors within the 5 days.

syncoid --recursive

But when i use it like this the errors are not happening.

syncoid --sendoptions="R"

I am using it now to send it in raw mode. So now the errors are not happening anymore.

@AndrewJDR
Copy link

Just throwing this out there for those of you doing automated tests (@HankB and @Germano0) and are trying to reduce the amount of time it takes to reproduce the issue:
Since there's now some evidence that the bug is exacerbated by the manual walking of the dataset tree that syncoid performs during transfers, it may make sense to use more complex (deeper and broader) dataset hierarchies in your testcases. And perhaps when generating writes to the datasets, distribute those writes to files strewn throughout that entire dataset hierarchy...

@HankB
Copy link

HankB commented Jan 2, 2025

it may make sense to use more complex (deeper and broader) dataset hierarchies in your testcases.

Agreed.

The script I wrote eventually goes 5 deep for nested data sets with a total of 35 in my present test.

distribute those writes to files strewn throughout that entire dataset hierarchy

And writes files throughout the pool (using find -type f and skipping randomly from 0-20 files through the resulting list on each pass. There is a mix of compressible (test) files and in-compressible (random) files.

I loop each pass with 750s sleep in between with the syncoid command and modify script in their own loops. Since the syncoid and modify script take different times to complete, that results in un-synchronized overlap between the two. sanoid runs on a 15 minute schedule and includes frequent snapshots.

I'm open to other suggestions, but in the interest of reproducible results, I can't make too many changes at this point. I'd suggest that someone interested in additional strategies try them either on available H/W or VMs and this can reduce the overall time to get an answer.

best,

Edit: I'm adding a daily scrub to the testing process (manually invoked.) It occurs to me that the corruption could be happening and just not detected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Encryption "native encryption" feature Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests