Rewriting Records Via Scrub #15335

Haravikk · 2023-09-30T10:53:40Z

Describe the feature would like to see added to OpenZFS

I would like to see the ability to fully "rewrite" records added to ZFS as an alternative to the elusive "block pointer rewrite" feature that so often blocks other features from being implemented.

The idea is simple; given one or more records, ZFS will write out the same data as "new" records before atomically retiring the old ones, no block pointer rewrite necessary. This would behave exactly the same as if the user copied a file and renamed it into place, but more granular and with guaranteed atomicity.

To allow it to immediately be put to use, this feature could be accompanied by a "rewrite records" option on zfs scrub which, when enabled, tells scrub to rewrite any record it encounters that does not match certain properties of its dataset (e.g- different compression algorithm). The option would require a freespace size (in bytes or as a percentage) and if rewriting the record would take the dataset below this amount of freespace, scrub will skip the rewrite but continue scrubbing as normal. This will allow datasets to be progressively rewritten via multiple scrubs over time until any rewriting is complete, without risking consumption of all remaining freespace. The amount of records rewritten or skipped would be reported to zpool status.

How will this feature improve OpenZFS?

This will enable records to be rewritten to take advantage of new properties, special devices etc. without the need to perform a full zfs send/receive cycle to recreate the entire dataset (and requiring downtime for the switchover). If we ever do gain a proper block pointer rewrite capability that can account for snapshots etc., it can then simply be used to optimise this feature, applying the benefit to anything that was implemented using this "full rewrite" method.

Implementing this feature could allow issues such #9762 to be closed since the scrubbing should meet its requirements. While issues such as #15226 could be implemented by adding support for their cases to the rewriting scrub (i.e- check if a record should be on the special device, and rewrite it to do so).

The text was updated successfully, but these errors were encountered:

GregorKopka · 2023-10-02T12:53:58Z

This feature would permanently inflate the allocated space requirements of all datasets that have snapshots.
Or require all snapshots prior the operation to be dropped, which is impossible if there are clones.
Or would require block pointer rewrite.

The only benefit over employing send/recv (which could keep the shapshot chain and, with careful ordering, could also preserve clone hierarchy) would be that it happens online, compared to the minimal offline time needed for a final incremental [stop services];send|recv; destroy; rename;[start sevices].

mschilli87 · 2023-10-02T13:36:12Z

@GregorKopka: Maybe I am missing sth but isn't there the added advantage of using whatever space is available to little by little migrate the blocks compared to send/recv requiering the availabilty of enough free space to (temporaririly) double the entire dataset?

Haravikk · 2023-10-02T15:47:24Z

@GregorKopka: Maybe I am missing sth but isn't there the added advantage of using whatever space is available to little by little migrate the blocks compared to send/recv requiering the availabilty of enough free space to (temporaririly) double the entire dataset?

This is exactly what I meant, not sure why @GregorKopka ignored that part?

I specifically mention the rewrite option for scrub requiring a freespace limit, below which it will stop rewriting records for the dataset/pool. This allows the rewriting to occur incrementally over time as more freespace becomes available, i.e- as older snapshots are destroyed.

And as you say, it allows this to be done without requiring a minimum of double the space be available either on the same or another pool for handling the send/receive. We don't all just have a handy second, unused, storage array just lying around, and it's a big overkill operation to perform when only a fraction of records may actually require rewriting.

GregorKopka · 2023-10-04T14:34:34Z

You two seem to blissfully ignore that snapshots, clones, bookmarks and checkpoints exist and are used by people using ZFS.

How should they be treated, in regard to this feature?

Haravikk · 2023-10-05T16:09:03Z

You two seem to blissfully ignore that snapshots, clones, bookmarks and checkpoints exist and are used by people using ZFS.

How about instead of insulting us, you could try reading what we said? The freespace limit is intended specifically to address this; it means that if the "new" records are consuming too much space (because snapshots won't release the "old" records) then the rewriting will stop until a later scrub.

So long as snapshots are being destroyed when they're no longer needed, this allows multiple rounds of scrubbing (which is something you should be doing periodically anyway) to eventually rewrite everything that needs rewriting over time. If a pool never discards snapshots the rewriting will never be able to complete, but it's up to the documentation to make that clear.

For example, if your dataset/pool has 150gb of freespace, and you set the limit to 100gb, then you can do up to 50gb of rewrites. Following that, if you free up 25gb of space as a result of destroying old or intermediate snapshots, then the next scrub can complete 25gb of rewrites and so-on until eventually it's finished (no more rewrites left to find).

If block pointer rewriting is ever implemented, this caveat can then be removed as it will no longer be a problem.

GregorKopka · 2023-10-06T06:51:16Z

Let's look at the simplest possible backup scheme with send|recv:

create snapshot on source
send incremental (or, if there is no prior snapshot like on the first iteration, everything) to destination
destroy all snapshot except the last one successfully sent on source
goto 1

Logical conclusion #1: All existing data in the dataset is always held by a snapshot,
Logical conclusion #2: None of the existing data will ever be eligible to be affected by this feature.
Logical conclusion #3: This feature is an expensive NOP, for all real-world setups.

mschilli87 · 2023-10-06T07:52:58Z

I disagree with your third conclusion. In all of my setups, there is an additional step that expires snapshots on the backup/receiver side which results in all data deleted on the source side to eventually free previously used space. As far as I can tell, the proposed feature will allow to eventually rewrite most if not all data (depending on the use case / data turnover) without the need for block pointer rewrites.

GregorKopka · 2023-10-06T09:33:45Z

As long as an incremental backup routine is active, there will always be a snapshot on both the source and destination.

Newly data written on the source will be distributed to the special vdevs anyway: feature is a NOP.
Newly received data on the destination will be distributed to the special vdevs anyway: feature is a NOP.
Existing on-disk data for which a snapshot exists (=always): feature is a NOP.

Haravikk · 2023-10-06T09:52:34Z

Newly data written on the source will be distributed to the special vdevs anyway: feature is a NOP.

The problem isn't newly written data. So this "argument" is a NOP.

Newly received data on the destination will be distributed to the special vdevs anyway: feature is a NOP.

The problem isn't data being received (which is essentially also newly written data) but data that is already present. So another "argument" that is a NOP.

Existing on-disk data for which a snapshot exists (=always): feature is a NOP.

You're assuming that all snapshots are retained indefinitely? So another "argument" that is a NOP!

Even on a backup target (which isn't really what this feature is for) you don't want to retain every snapshot because you'd need to continuously upgrade storage even if the size of current data isn't growing (additions are matched by deletions), while if it is growing you'd increase the rate at which capacity needs to be upgraded even further.

But on the sending side of most setups you're not going to need such a high level of snapshot retention, and you're certainly not going to want to keep them indefinitely unless you can throw a continuous stream of new hardware at it. Most setups in reality are going to keep local snapshots for a shorter time for the purpose of potential emergency rollbacks, and more snapshots on a backup for recovery in the even the main pool fails entirely somehow. If you don't want local snapshots you also have the option of using bookmarks (or using these separately for backups) as they don't tie up old records (that's the whole point of them; they enable you to discard the snapshot they are created from and still send to another target).

Sure, there will be a point at which both "old" and "new" versions of records exist thanks to spapshots/clones/whatever, but again that's exactly what the freespace limit is intended to address. But once the record has been rewritten newer snapshots will no longer reference the old version at all, so as older snapshots are destroyed eventually the old version of the record disappears entirely, just like any other data that is no longer referenced such as files that were deleted etc.

But we've now been over this at least three times, I'm sick of explaining it, and I'm starting to think the problem here is ideological rather than practical.

GregorKopka · 2023-10-07T06:28:46Z

There are quite some professions where the ability to prove that certain data was created at a certain point in time is important, as is the ability to restore data from points back in time (30 years of professional IT taught me that users take between days to even even years to notice that they accidentally deleted something that they need now).

Having a snapshot regime that thins out frequent snapshots to a reasonable daily/weekly/monthly/quarterly/yearly retention is by no way unusable, tools like https://github.com/zfsonlinux/zfs-auto-snapshot or https://github.com/jimsalterjrs/sanoid (which aim at doing exactly that) are quite popular for a reason.

Please don't ass-u-me that your way of using technology is the norm for everyone, or even close to a relevant fraction of users.

Haravikk · 2023-10-07T08:55:19Z

This feature doesn't prevent anyone from keeping snapshots indefinitely if they want to or need to, but you're acting like this use case is a blocker to the feature when it's not at all. This is a bad faith straw-man argument and a distraction that has gone on more than long enough.

Keeping snapshots indefinitely means you need to aware of the extra storage cost of doing so, and any documentation around this feature (which it's going to require since it's a new option) merely needs to make clear that the data is currently copied (so will appear twice for as long as snapshots hold onto the old version) if that's how it needs to be implemented.

That's just a caveat of using this entirely optional feature which users with indefinite snapshots are perfectly free to simply not use until block pointer rewriting is implemented (if ever), as they need to balance their priorities between storage used and updating properties for older records exactly as they already need to using the "overkill" methods (full send/receive or copying affected files).

NicholasRush · 2023-11-11T13:54:56Z

That's just a caveat of using this entirely optional feature which users with indefinite snapshots are perfectly free to simply not use until block pointer rewriting is implemented (if ever), as they need to balance their priorities between storage used and updating properties for older records exactly as they already need to using the "overkill" methods (full send/receive or copying affected files).

So you can practically say from the start: If ZFS development does not implement this feature in the future, ZFS will eventually disappear into obscurity in the professional sector because it is not competitive with the competition from the COW file system sector.

This does not mean that it will no longer be used in the semi-professional sector, but rather that it is becoming increasingly less important in memory management and long-term data storage.

In the same way, proper storage tier ring can finally be implemented. The solution using special devices works, but it is not a correct solution. Especially since you can no longer remove the special devices from the pool.

A function could also be implemented that scans older data and then compresses and de-duplicates it to a higher degree, thus saving more storage space.

This would mean that ZFS would finally be competitive again on the storage market.

Haravikk added the Type: Feature Feature request or new feature label Sep 30, 2023

Haravikk mentioned this issue Oct 8, 2023

Filling special vdevs #15226

Open

Haravikk mentioned this issue Feb 4, 2024

Combined/Grouped Records to Maximise Compression #13107

Open

Xyz00777 mentioned this issue Dec 21, 2024

Enhance RAIDZ expansion feature to allowing expanding across multiple new drives at once #15678

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewriting Records Via Scrub #15335

Rewriting Records Via Scrub #15335

Haravikk commented Sep 30, 2023

GregorKopka commented Oct 2, 2023

mschilli87 commented Oct 2, 2023

Haravikk commented Oct 2, 2023

GregorKopka commented Oct 4, 2023

Haravikk commented Oct 5, 2023 •

edited

Loading

GregorKopka commented Oct 6, 2023 •

edited

Loading

mschilli87 commented Oct 6, 2023

GregorKopka commented Oct 6, 2023

Haravikk commented Oct 6, 2023 •

edited

Loading

GregorKopka commented Oct 7, 2023

Haravikk commented Oct 7, 2023 •

edited

Loading

NicholasRush commented Nov 11, 2023

Rewriting Records Via Scrub #15335

Rewriting Records Via Scrub #15335

Comments

Haravikk commented Sep 30, 2023

Describe the feature would like to see added to OpenZFS

How will this feature improve OpenZFS?

GregorKopka commented Oct 2, 2023

mschilli87 commented Oct 2, 2023

Haravikk commented Oct 2, 2023

GregorKopka commented Oct 4, 2023

Haravikk commented Oct 5, 2023 • edited Loading

GregorKopka commented Oct 6, 2023 • edited Loading

mschilli87 commented Oct 6, 2023

GregorKopka commented Oct 6, 2023

Haravikk commented Oct 6, 2023 • edited Loading

GregorKopka commented Oct 7, 2023

Haravikk commented Oct 7, 2023 • edited Loading

NicholasRush commented Nov 11, 2023

Haravikk commented Oct 5, 2023 •

edited

Loading

GregorKopka commented Oct 6, 2023 •

edited

Loading

Haravikk commented Oct 6, 2023 •

edited

Loading

Haravikk commented Oct 7, 2023 •

edited

Loading