Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewriting Records Via Scrub #15335

Open
Haravikk opened this issue Sep 30, 2023 · 12 comments
Open

Rewriting Records Via Scrub #15335

Haravikk opened this issue Sep 30, 2023 · 12 comments
Labels
Type: Feature Feature request or new feature

Comments

@Haravikk
Copy link

Describe the feature would like to see added to OpenZFS

I would like to see the ability to fully "rewrite" records added to ZFS as an alternative to the elusive "block pointer rewrite" feature that so often blocks other features from being implemented.

The idea is simple; given one or more records, ZFS will write out the same data as "new" records before atomically retiring the old ones, no block pointer rewrite necessary. This would behave exactly the same as if the user copied a file and renamed it into place, but more granular and with guaranteed atomicity.

To allow it to immediately be put to use, this feature could be accompanied by a "rewrite records" option on zfs scrub which, when enabled, tells scrub to rewrite any record it encounters that does not match certain properties of its dataset (e.g- different compression algorithm). The option would require a freespace size (in bytes or as a percentage) and if rewriting the record would take the dataset below this amount of freespace, scrub will skip the rewrite but continue scrubbing as normal. This will allow datasets to be progressively rewritten via multiple scrubs over time until any rewriting is complete, without risking consumption of all remaining freespace. The amount of records rewritten or skipped would be reported to zpool status.

How will this feature improve OpenZFS?

This will enable records to be rewritten to take advantage of new properties, special devices etc. without the need to perform a full zfs send/receive cycle to recreate the entire dataset (and requiring downtime for the switchover). If we ever do gain a proper block pointer rewrite capability that can account for snapshots etc., it can then simply be used to optimise this feature, applying the benefit to anything that was implemented using this "full rewrite" method.

Implementing this feature could allow issues such #9762 to be closed since the scrubbing should meet its requirements. While issues such as #15226 could be implemented by adding support for their cases to the rewriting scrub (i.e- check if a record should be on the special device, and rewrite it to do so).

@Haravikk Haravikk added the Type: Feature Feature request or new feature label Sep 30, 2023
@GregorKopka
Copy link
Contributor

This feature would permanently inflate the allocated space requirements of all datasets that have snapshots.
Or require all snapshots prior the operation to be dropped, which is impossible if there are clones.
Or would require block pointer rewrite.

The only benefit over employing send/recv (which could keep the shapshot chain and, with careful ordering, could also preserve clone hierarchy) would be that it happens online, compared to the minimal offline time needed for a final incremental [stop services];send|recv; destroy; rename;[start sevices].

@mschilli87
Copy link
Contributor

@GregorKopka: Maybe I am missing sth but isn't there the added advantage of using whatever space is available to little by little migrate the blocks compared to send/recv requiering the availabilty of enough free space to (temporaririly) double the entire dataset?

@Haravikk
Copy link
Author

Haravikk commented Oct 2, 2023

@GregorKopka: Maybe I am missing sth but isn't there the added advantage of using whatever space is available to little by little migrate the blocks compared to send/recv requiering the availabilty of enough free space to (temporaririly) double the entire dataset?

This is exactly what I meant, not sure why @GregorKopka ignored that part?

I specifically mention the rewrite option for scrub requiring a freespace limit, below which it will stop rewriting records for the dataset/pool. This allows the rewriting to occur incrementally over time as more freespace becomes available, i.e- as older snapshots are destroyed.

And as you say, it allows this to be done without requiring a minimum of double the space be available either on the same or another pool for handling the send/receive. We don't all just have a handy second, unused, storage array just lying around, and it's a big overkill operation to perform when only a fraction of records may actually require rewriting.

@GregorKopka
Copy link
Contributor

You two seem to blissfully ignore that snapshots, clones, bookmarks and checkpoints exist and are used by people using ZFS.

How should they be treated, in regard to this feature?

@Haravikk
Copy link
Author

Haravikk commented Oct 5, 2023

You two seem to blissfully ignore that snapshots, clones, bookmarks and checkpoints exist and are used by people using ZFS.

How about instead of insulting us, you could try reading what we said? The freespace limit is intended specifically to address this; it means that if the "new" records are consuming too much space (because snapshots won't release the "old" records) then the rewriting will stop until a later scrub.

So long as snapshots are being destroyed when they're no longer needed, this allows multiple rounds of scrubbing (which is something you should be doing periodically anyway) to eventually rewrite everything that needs rewriting over time. If a pool never discards snapshots the rewriting will never be able to complete, but it's up to the documentation to make that clear.

For example, if your dataset/pool has 150gb of freespace, and you set the limit to 100gb, then you can do up to 50gb of rewrites. Following that, if you free up 25gb of space as a result of destroying old or intermediate snapshots, then the next scrub can complete 25gb of rewrites and so-on until eventually it's finished (no more rewrites left to find).

If block pointer rewriting is ever implemented, this caveat can then be removed as it will no longer be a problem.

@GregorKopka
Copy link
Contributor

GregorKopka commented Oct 6, 2023

Let's look at the simplest possible backup scheme with send|recv:

  1. create snapshot on source
  2. send incremental (or, if there is no prior snapshot like on the first iteration, everything) to destination
  3. destroy all snapshot except the last one successfully sent on source
  4. goto 1

Logical conclusion #1: All existing data in the dataset is always held by a snapshot,
Logical conclusion #2: None of the existing data will ever be eligible to be affected by this feature.
Logical conclusion #3: This feature is an expensive NOP, for all real-world setups.

@mschilli87
Copy link
Contributor

I disagree with your third conclusion. In all of my setups, there is an additional step that expires snapshots on the backup/receiver side which results in all data deleted on the source side to eventually free previously used space. As far as I can tell, the proposed feature will allow to eventually rewrite most if not all data (depending on the use case / data turnover) without the need for block pointer rewrites.

@GregorKopka
Copy link
Contributor

As long as an incremental backup routine is active, there will always be a snapshot on both the source and destination.

Newly data written on the source will be distributed to the special vdevs anyway: feature is a NOP.
Newly received data on the destination will be distributed to the special vdevs anyway: feature is a NOP.
Existing on-disk data for which a snapshot exists (=always): feature is a NOP.

@Haravikk
Copy link
Author

Haravikk commented Oct 6, 2023

Newly data written on the source will be distributed to the special vdevs anyway: feature is a NOP.

The problem isn't newly written data. So this "argument" is a NOP.

Newly received data on the destination will be distributed to the special vdevs anyway: feature is a NOP.

The problem isn't data being received (which is essentially also newly written data) but data that is already present. So another "argument" that is a NOP.

Existing on-disk data for which a snapshot exists (=always): feature is a NOP.

You're assuming that all snapshots are retained indefinitely? So another "argument" that is a NOP!

Even on a backup target (which isn't really what this feature is for) you don't want to retain every snapshot because you'd need to continuously upgrade storage even if the size of current data isn't growing (additions are matched by deletions), while if it is growing you'd increase the rate at which capacity needs to be upgraded even further.

But on the sending side of most setups you're not going to need such a high level of snapshot retention, and you're certainly not going to want to keep them indefinitely unless you can throw a continuous stream of new hardware at it. Most setups in reality are going to keep local snapshots for a shorter time for the purpose of potential emergency rollbacks, and more snapshots on a backup for recovery in the even the main pool fails entirely somehow. If you don't want local snapshots you also have the option of using bookmarks (or using these separately for backups) as they don't tie up old records (that's the whole point of them; they enable you to discard the snapshot they are created from and still send to another target).

Sure, there will be a point at which both "old" and "new" versions of records exist thanks to spapshots/clones/whatever, but again that's exactly what the freespace limit is intended to address. But once the record has been rewritten newer snapshots will no longer reference the old version at all, so as older snapshots are destroyed eventually the old version of the record disappears entirely, just like any other data that is no longer referenced such as files that were deleted etc.

But we've now been over this at least three times, I'm sick of explaining it, and I'm starting to think the problem here is ideological rather than practical.

@GregorKopka
Copy link
Contributor

There are quite some professions where the ability to prove that certain data was created at a certain point in time is important, as is the ability to restore data from points back in time (30 years of professional IT taught me that users take between days to even even years to notice that they accidentally deleted something that they need now).

Having a snapshot regime that thins out frequent snapshots to a reasonable daily/weekly/monthly/quarterly/yearly retention is by no way unusable, tools like https://github.com/zfsonlinux/zfs-auto-snapshot or https://github.com/jimsalterjrs/sanoid (which aim at doing exactly that) are quite popular for a reason.

Please don't ass-u-me that your way of using technology is the norm for everyone, or even close to a relevant fraction of users.

@Haravikk
Copy link
Author

Haravikk commented Oct 7, 2023

This feature doesn't prevent anyone from keeping snapshots indefinitely if they want to or need to, but you're acting like this use case is a blocker to the feature when it's not at all. This is a bad faith straw-man argument and a distraction that has gone on more than long enough.

Keeping snapshots indefinitely means you need to aware of the extra storage cost of doing so, and any documentation around this feature (which it's going to require since it's a new option) merely needs to make clear that the data is currently copied (so will appear twice for as long as snapshots hold onto the old version) if that's how it needs to be implemented.

That's just a caveat of using this entirely optional feature which users with indefinite snapshots are perfectly free to simply not use until block pointer rewriting is implemented (if ever), as they need to balance their priorities between storage used and updating properties for older records exactly as they already need to using the "overkill" methods (full send/receive or copying affected files).

@NicholasRush
Copy link

That's just a caveat of using this entirely optional feature which users with indefinite snapshots are perfectly free to simply not use until block pointer rewriting is implemented (if ever), as they need to balance their priorities between storage used and updating properties for older records exactly as they already need to using the "overkill" methods (full send/receive or copying affected files).

So you can practically say from the start: If ZFS development does not implement this feature in the future, ZFS will eventually disappear into obscurity in the professional sector because it is not competitive with the competition from the COW file system sector.

This does not mean that it will no longer be used in the semi-professional sector, but rather that it is becoming increasingly less important in memory management and long-term data storage.

In the same way, proper storage tier ring can finally be implemented. The solution using special devices works, but it is not a correct solution. Especially since you can no longer remove the special devices from the pool.

A function could also be implemented that scans older data and then compresses and de-duplicates it to a higher degree, thus saving more storage space.

This would mean that ZFS would finally be competitive again on the storage market.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests

4 participants