Missing mechanism to fix permanent errors/delete file over all snapshots #4732

CySlider · 2016-06-05T14:42:26Z

Situation

Desktop PC setting
9TB RaidZ1 with 4 * 3TB
One failing disk replaced
Resilvering turned out a bunch of Permanent errors of the like:

tank/Shared@zfs-auto-snap_monthly-2016-01-08-1240:/Games/Steam/SteamApps/common/SatelliteReign/SatelliteReignLinux_Data/sharedassets1.assets

The corrupted files can not be accessed at all! (might only be a bit flip in a text file, so what?)
The new disk gets resilvered on each reboot of the PC (Takes 24 hours to complete) Even though beside the Permanent errors the state of the ZPOOL shows ONLINE and everything seems fine.

Expected behaviour

Being offered ways to fix the errors by one or more methods like:

Deleting damaged files from snapshots
Accepting corrupted files as being corrupt and update the hash value for them to reflect the corruption, making the corrupted files accessible again and the ZPOOL consistent.
Give the possibility to offer the original uncorrupted file from some other backup to repair it.
Brute force broken sector by flipping single bits until the hash hopefully matches again (IMHO this could realistically repair up to 2 random bit filips additionally to any FEC that might be already in place [is there?])

Actual behaviour

This might be wrong but to my research in the past days:
This state is only fixable by...

going nasty with zdb
pulling some "clone & rsync all snapshots"-stunt without this damaged files
deleting the affected snapshot and all following ones that contain this file + the file itself if it was not yet deleted. Loosing history for all other files as well
Destryoing & rebuilding the whole ZPOOL from a backup

These options all seem ridiculous and totally unfit for an elsewise so well written and thought through file system solution. Every other file system, naturally, has ways to handle and correct even uncorrectable errors in the sense that the file system itself at least is consistent again. And as a second objective to restore as much of the original data as possible (could be a bit flip in a text file which could be totally unproblematic)

A solution

I would highly recommend to make it at least possible to delete a single file from all snapshots without deleting the snapshots.

This would also come in handy in other situations where you simply want to delete a file, like a virus, or if you want to free disk space by deleting a whole folder that was not supposed to be snapshotted/is not needed any more.
One could then also write a tool that opens all snapshots at once and makes a "merged file system" containing all files from all snapshots on top of each other for the purpose of cleaning the ZPOOL from unneeded files/file trees without loosing the history for all other files.

The text was updated successfully, but these errors were encountered:

DeHackEd · 2016-06-05T16:39:15Z

Snapshots are STRICTLY read-only. That is the entire point of them. The entire architecture of ZFS is designed around this assumption and very special code would need to be written to handle that. This might even enter Block Pointer Rewrite territory.

Give the possibility to offer the original uncorrupted file from some other backup to repair it.

Brute force broken sector by flipping single bits until[...]

Pool redundancy (RAID-Z and mirrors) is supposed to provide the restoration point. One issue I have with these ideas idea is that the default checksum (fletcher4) isn't cryptographically secure making this a very risky process.

As for the filesystem being consistent, ZFS writes all metadata twice (at least) by default so automatic recovery is even more likely. The filesystem is consistent, your data is not.

CySlider · 2016-06-05T16:48:54Z

Yes, but at least allow to manually "accept" the corruption as the new status quo, updating the checksums to reflect the corruption should be possible.

And beside that I still think that making files deletable from snapshots is a worthwhile feature to implement.

CySlider · 2016-06-05T16:55:26Z

What the heck is happening now?
3 Disk resivering at the same time? how is this even possible? I get more and more confused by ZFS behaviour. And confused = scared for my data

zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jun  5 03:40:40 2016
    8,52T scanned out of 8,90T at 165M/s, 0h40m to go
    2,13T resilvered, 95,75% done
config:

NAME            STATE     READ WRITE CKSUM
tank            ONLINE       0     0    42
  raidz1-0      ONLINE       0     0    84
    sdb_crypt   ONLINE       0     0     0  (resilvering)
    sdz_crypt2  ONLINE       0     0     1  (resilvering)
    sdz_crypt3  ONLINE       0     0     0
    sdz_crypt4  ONLINE       0     0     2  (resilvering)
spares
  sdz_crypt5    AVAIL   

errors: Permanent errors have been detected in the following files:

    tank/Shared@zfs-auto-snap_daily-2016-05-20-0615:/Spiele/Steam/SteamApps/common/Sword Coast Legends/SwordCoast_Data/resources.assets.resS
    tank/Shared@zfs-auto-snap_monthly-2016-01-08-1240:/Spiele/Steam/SteamApps/common/SatelliteReign/SatelliteReignLinux_Data/sharedassets1.assets
    ... 30 more

richardelling · 2016-06-05T19:01:22Z

On Jun 5, 2016, at 9:39 AM, DeHackEd notifications@github.com wrote:

Snapshots are STRICTLY read-only. That is the entire point of them. The entire architecture of ZFS is designed around this assumption and very special code would need to be written to handle that. This might even enter Block Pointer Rewrite territory.

Agree. A report of error isn’t a fault.

Give the possibility to offer the original uncorrupted file from some other backup to repair it.
Brute force broken sector by flipping single bits until[...]
Pool redundancy (RAID-Z and mirrors) is supposed to provide the restoration point. One issue I have with these ideas idea is that the default checksum (fletcher4) isn't cryptographically secure making this a very risky process.

The checksum really doesn’t matter because you’ll hit the birthday problem with any of
the checksums. In other words, trying to recreate based on bit flips aiming to match a
checksum of 256 bits against a nominal sized block of 1M bits is futile.
— richard

As for the filesystem being consistent, ZFS writes all metadata twice (at least) by default so automatic recovery is even more likely. The filesystem is consistent, your data is not.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #4732 (comment), or mute the thread https://github.com/notifications/unsubscribe/AA08zW29BaWfe80xrfHzgAMwXzL4nhdqks5qIvu4gaJpZM4IuXoZ.

Richard.Elling@RichardElling.com
+1-760-896-4422

richardelling · 2016-06-05T19:02:52Z

On Jun 5, 2016, at 9:55 AM, Torge Kummerow notifications@github.com wrote:

What the heck is happening now?

From the data presented, it appears as though the “sd?_crypt?” devices are corrupting data.
As the scrub identifies bad data, it tries to repair. Since you’re seeing mostly checksum errors,
then it is likely the “devices” are corrupting data.
— richard

3 Disk resivering at the same time? how is this even possible? I get more and more confused by ZFS behaviour. And confused = scared for my data

zpool status -v tank
pool: tank
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Jun 5 03:40:40 2016
8,52T scanned out of 8,90T at 165M/s, 0h40m to go
2,13T resilvered, 95,75% done
config:

NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 42
raidz1-0 ONLINE 0 0 84
sdb_crypt ONLINE 0 0 0 (resilvering)
sdz_crypt2 ONLINE 0 0 1 (resilvering)
sdz_crypt3 ONLINE 0 0 0
sdz_crypt4 ONLINE 0 0 2 (resilvering)
spares
sdz_crypt5 AVAIL

errors: Permanent errors have been detected in the following files:
tank/Shared@zfs-auto-snap_daily-2016-05-20-0615:/Spiele/Steam/SteamApps/common/Sword Coast Legends/SwordCoast_Data/resources.assets.resS
tank/Shared@zfs-auto-snap_monthly-2016-01-08-1240:/Spiele/Steam/SteamApps/common/SatelliteReign/SatelliteReignLinux_Data/sharedassets1.assets
... 30 more
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #4732 (comment), or mute the thread https://github.com/notifications/unsubscribe/AA08zdDeM5jZVUSITvIjE9Jq0uPjZWM1ks5qIv-CgaJpZM4IuXoZ.

Richard.Elling@RichardElling.com
+1-760-896-4422

CySlider · 2016-06-05T19:34:15Z

Snapshots are STRICTLY read-only. That is the entire point of them.

Well this depends on the use case. I doubt that the "entire point" is it being read-only... More like the most important property is to provide access to data from an earlier time. Making it read-only makes sense to protect it, but I don't see a reason to make it a religion if the user wants a different behaviour.
I think my provided use cases are valid and reasonable, at least for desktop deployments.

As for the filesystem being consistent, ZFS writes all metadata twice (at least) by default so automatic recovery is even more likely. The filesystem is consistent, your data is not.

So ZFS does not consider the actual data it stores as being part of it's file system? This is new. So ZFS's mindset basicall is: "I don't care for your data, my meta data is consistent, live with it?" I highly doubt that.
So from my point of view my ZFS file system is currently in an corrupted state and offers no way to recover from it. Also if it were not in a corrupted state anymore, why does it try to resilver each startup?

The checksum really doesn’t matter because you’ll hit the birthday problem with any of the checksums. In other words, trying to recreate based on bit flips aiming to match a checksum of 256 bits against a nominal sized block of 1M bits is futile.

This is an extremely unlikely event with sha256. With CRC or even md5 you might have been right. But with SHA256 trying a few million combinations to create a birthday paradox is nothing to sweat about. Else cryptography would have a big problem by now.
But that aside I don't propose this to be an "automated" mechanism, just a manual, "Hey you could try this to save your data" lifeline. But lets keep this from discussion as this was just an additional idea.

The absolute minimum I would - and still do - expect is to "accept" the corrupted data and modify the checksums to reflect that to make the corrupted data accessible again for investigation.

Is that not a reasonable request?

Or is there anything I miss out, that I can do now that is appropriate amount of effort for a few unimportant files being corrupt on an 9TB container?
If he would only show me these errors until eternity and stop the resilvering, I would even be fine by that, a bit disturbed, but fine.

CySlider · 2016-06-05T19:57:48Z

On Jun 5, 2016, at 9:55 AM, Torge Kummerow _@_.***> wrote:

What the heck is happening now?

From the data presented, it appears as though the “sd?_crypt?” devices are corrupting data.
As the scrub identifies bad data, it tries to repair. Since you’re seeing mostly checksum errors,
then it is likely the “devices” are corrupting data.

Well, I doubt that. I admit I did not scrub regularly, so when the disk failed I guess those 40 files that are now corrupted are fully on me, as the resilvering never had a chance to recrate them from 3 disks. Else this tank did run fine for nearly 3 years now.
But even if I had scrubbed and one disk fails, if during the resilver one bitrot or sector fail appears the data will be lost and the Permanent error will appear making it a not so uncommon event I assume. *

Another thing that I now see & that is strange is, that he is still resilvering. The stats it shows are totally wrong.

3 Hours ago:
8,52T scanned out of 8,90T at 165M/s, 0h40m to go
2,13T resilvered, 95,75% done

now:
8,63T scanned out of 8,90T at 140M/s, 0h33m to go
2,16T resilvered, 97,00% done

And no matter when I do zpool status -v tank it shows its resilvering at over 100M/s.
Maybe because he is resilvering 3 devices simultaneously now?

Just to be clear, when the resilvering is done, the status shows fine on all 4 devices until I reboot.
I think next thing I will do is to delete all the snapshots (6 months worth) that have these files to eliminate the corruptions and scrub again afterwards. THIS SUCKS, but I see no other viable option. So I have to delete 6 Month of snapshots because "Snapshots are supposed to be readonly"

*And I can absolutely live with an occasional file corruption if I know which file it is at least. My crucial core data is all additionally backed up, the rest is not so important but I can not afford a full backup. I fear scrubbing (24hours) too often will extremely shorten my consumer devices. My money is not that limitless that I can switch a drive every month. I never imagine to be in such a mess because of one file corruption, though.

CySlider · 2016-06-05T23:09:48Z

Another thing that I now see & that is strange is, that he is still resilvering. The stats it shows are totally wrong.

3 Hours ago:
8,52T scanned out of 8,90T at 165M/s, 0h40m to go
2,13T resilvered, 95,75% done

now:
8,63T scanned out of 8,90T at 140M/s, 0h33m to go
2,16T resilvered, 97,00% done

And no matter when I do zpool status -v tank it shows its resilvering at over 100M/s.
Maybe because he is resilvering 3 devices simultaneously now?

Ok, solved this at least. There was another backup running rsyncing data from the tank, slowing down resilvering considerably. Does not really explain why the status showed the wrong speed & time estimation though, maybe its a "if I am allowed to resilver (prio), this is the speed I can go".

rincebrain · 2016-06-06T00:11:14Z

As has been said, some of the parts of how ZFS is designed at its core make it impractical (to put it mildly) for the contents of a snapshot to be made read/write.

It might be possible to manually rewrite the checksums to be "correct", but since one of ZFS's goals is not to allow silent data corruption, I would be mildly surprised if code to permit that landed anywhere except maybe in zhack.

If you're on a RAID-Z1, then losing an entire disk does mean you have no redundancy left in that RAID-Z1 until it finishes resilvering, and any errors found while resilvering are going to be uncorrectable. (This is why people often choose either higher RAID-Z levels for important data, or keep backups, or ideally both.)

If you've not done any sort of periodic scrubs to detect errors of this nature, then the only time this will show up is on failure - leading to situations like this.

It should let you read the contents of those files that don't fail reconstruction, but in an N-disk wide RAID-ZX, for any logical stretch of {block1 block2 ... block(N-X)}, it's going to be written out as {block1 block2 ... block(N-X) parity1 ... parityX}, and if you try to compute the (N-X) blocks from some combination of exactly enough blocks and parity data, and the checksums in the result don't match, you have no idea which disks/blocks are at fault, and so ZFS won't let you read any of those blocks. (Good luck trying to bruteforce [blocksize] * [stripesize] presuming a single bit error, let alone double-bit.)

What you might have found useful (or still find useful) would have been using something like ddrescue to extract the blocks you could read from the failing disk onto a new disk, put the new disk in place of the failed disk, and see what you can recover. (Ideally, you'd have backup block-for-block copies of all the disks involved so you don't accidentally scribble over any data you want to recover.)

In general, you should scrub periodically, even if you don't have any disk failures, to avoid minor failure cases turning into catastrophic ones, particularly in RAID-Zx scenarios.

(You might also want to look into whether your system has some other reason for checksum errors on disks at times; having run a bunch of different ZFS configs, I don't really expect to see any CKSUM errors outside of some failing piece of hardware.)

CySlider · 2016-06-06T00:38:25Z

Thank you I understand most of what you said well and already know that.

But, recalculating the checksums based on the corruption to accept the data loss should always be an option in any file system. See "Lost + Found" approach. I understand that ZFS is very proud of not letting this happen in the first place, but as you pointed out yourself in an RAIDZ1 this can very well still happen during reconstruction even when doing everything else right. So please implement a way to handle this, else I will have to take a look at brtfs in the hope it handles this in a better way. I think I saw snapshots are writable in brtfs.

Also can someone please confirm, if a permanent file corruption in the tank will, by design, lead to a resilver on each reboot? Or is there something else off in my tank?

PS:
Now resilvering finished and everything is fine (beside the corrupted files, they stay the same). But as pointed out, it sill resilvers after reboot

pool: tank
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
  see: http://zfsonlinux.org/msg/ZFS-8000-8A
 scan: resilvered 2,22T in 22h51m with 49 errors on Mon Jun  6 02:32:24 2016
 config:

NAME            STATE     READ WRITE CKSUM
tank            ONLINE       0     0   128
  raidz1-0      ONLINE       0     0   256
    sdb_crypt   ONLINE       0     0     0
    sdz_crypt2  ONLINE       0     0     1
    sdz_crypt3  ONLINE       0     0     0
    sdz_crypt4  ONLINE       0     0     2
spares
  sdz_crypt5    AVAIL

errors: 35 data errors, use '-v' for a list

midjji · 2019-11-07T16:22:53Z

is there a way to delete a single corrupted file assuming there are no snapshots?

FlorianHeigl · 2019-11-12T18:56:44Z

it would seem that the request to continue on mailing list was simply a diversion to turn the question into a "support request" (which it wasn't).

it's not easy to see this as done in really good faith.

behlendorf · 2019-11-12T19:04:40Z

is there a way to delete a single corrupted file assuming there are no snapshots?

@midjji normally you can use the unlink command for this, assuming there are no snapshots.

@FlorianHeigl there has been recent work proposed in this area. I'd encourage you to take a look at PR #9372 and see if the functionality proposed that would meet your needs.

FlorianHeigl · 2019-11-12T19:13:36Z

See that's a really nice answer that will help every person here now and in the future.
The recent improvements for dealing with inconsistent data are very welcomed by many people.

midjji · 2019-11-12T22:36:21Z

This is also the top google result for this question. Which by the way I still haven't found an answer to, or like a non idiot I would have posted here, because it is the top google result for the question. Deleting this thread would also be acceptable.

…

On Tue, 12 Nov 2019, 20:14 Florian Heigl, ***@***.***> wrote: See that's a really nice answer that will help every person here now and in the future. The recent improvements for dealing with inconsistent data are very welcomed by *many* people. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4732>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJQYJOWDZFWI7DIB6TIEJLQTL575ANCNFSM4CFZPIMQ> .

midjji · 2019-11-12T22:37:23Z

Fixing the file isn't of interest, I'd just like to be able to rm it without a kernel hang. On Tue, 12 Nov 2019, 23:36 Mikael Persson, <mikael.p.persson@gmail.com> wrote:

…

This is also the top google result for this question. Which by the way I still haven't found an answer to, or like a non idiot I would have posted here, because it is the top google result for the question. Deleting this thread would also be acceptable. On Tue, 12 Nov 2019, 20:14 Florian Heigl, ***@***.***> wrote: > See that's a really nice answer that will help every person here now and > in the future. > The recent improvements for dealing with inconsistent data are very > welcomed by *many* people. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#4732>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABJQYJOWDZFWI7DIB6TIEJLQTL575ANCNFSM4CFZPIMQ> > . >

midjji · 2019-11-13T20:12:23Z

So thats not supposed to be the case?

…

On Wed, 13 Nov 2019, 17:07 kpande, ***@***.***> wrote: then open a new bug. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4732>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABJQYJLB6JRLKRYZEUV7YFTQTQQ5ZANCNFSM4CFZPIMQ> .

openzfs locked as resolved and limited conversation to collaborators Nov 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing mechanism to fix permanent errors/delete file over all snapshots #4732

Missing mechanism to fix permanent errors/delete file over all snapshots #4732

CySlider commented Jun 5, 2016 •

edited

Loading

DeHackEd commented Jun 5, 2016

CySlider commented Jun 5, 2016

CySlider commented Jun 5, 2016

richardelling commented Jun 5, 2016

richardelling commented Jun 5, 2016

CySlider commented Jun 5, 2016 •

edited

Loading

CySlider commented Jun 5, 2016 •

edited

Loading

CySlider commented Jun 5, 2016

rincebrain commented Jun 6, 2016

CySlider commented Jun 6, 2016

midjji commented Nov 7, 2019

FlorianHeigl commented Nov 12, 2019

behlendorf commented Nov 12, 2019 •

edited

Loading

FlorianHeigl commented Nov 12, 2019

midjji commented Nov 12, 2019 via email

midjji commented Nov 12, 2019 via email

midjji commented Nov 13, 2019 via email

Missing mechanism to fix permanent errors/delete file over all snapshots #4732

Missing mechanism to fix permanent errors/delete file over all snapshots #4732

Comments

CySlider commented Jun 5, 2016 • edited Loading

Situation

Expected behaviour

Actual behaviour

A solution

DeHackEd commented Jun 5, 2016

CySlider commented Jun 5, 2016

CySlider commented Jun 5, 2016

richardelling commented Jun 5, 2016

richardelling commented Jun 5, 2016

CySlider commented Jun 5, 2016 • edited Loading

CySlider commented Jun 5, 2016 • edited Loading

CySlider commented Jun 5, 2016

rincebrain commented Jun 6, 2016

CySlider commented Jun 6, 2016

midjji commented Nov 7, 2019

FlorianHeigl commented Nov 12, 2019

behlendorf commented Nov 12, 2019 • edited Loading

FlorianHeigl commented Nov 12, 2019

midjji commented Nov 12, 2019 via email

midjji commented Nov 12, 2019 via email

midjji commented Nov 13, 2019 via email

CySlider commented Jun 5, 2016 •

edited

Loading

CySlider commented Jun 5, 2016 •

edited

Loading

CySlider commented Jun 5, 2016 •

edited

Loading

behlendorf commented Nov 12, 2019 •

edited

Loading