-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement corruption correcting recv #9372
Conversation
This feature is indeed really nice to have. However, I am curious about whether it would - even theoretically - be possible to heal metadata using a corrective receive. Support for that would definitely be a killer feature. |
I agree that it would be great to be able to heal metadata but as far as I know, there isn't enough information in the send file to do that. |
I've fixed the re-encryption code so this is ready for review now. |
Codecov Report
@@ Coverage Diff @@
## master #9372 +/- ##
========================================
- Coverage 80% 79% -<1%
========================================
Files 384 384
Lines 121788 122069 +281
========================================
- Hits 96900 96897 -3
- Misses 24888 25172 +284
Continue to review full report at Codecov.
|
lib/libzfs/libzfs_sendrecv.c
Outdated
"key must be loaded to do a non-raw correc" | ||
"tive recv on an encrypted dataset.")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For human readability of the source code, I'd recommend line-breaking the string after a space.
lib/libzfs/libzfs_sendrecv.c
Outdated
"corrective receive was not able to recon" | ||
"struct the data needed for healing.")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For human readability of the source code, I'd recommend line-breaking the string after a space.
|
||
/* | ||
* Removes all of the recv healed errors from both on-disk error logs | ||
*/ | ||
static void | ||
spa_remove_healed_errors(spa_t *spa, avl_tree_t *s, avl_tree_t *l, dmu_tx_t *tx) | ||
{ | ||
char name[NAME_MAX_LEN]; | ||
spa_error_entry_t *se; | ||
void *cookie = NULL; | ||
|
||
ASSERT(MUTEX_HELD(&spa->spa_errlog_lock)); | ||
|
||
while ((se = avl_destroy_nodes(&spa->spa_errlist_healed, | ||
&cookie)) != NULL) { | ||
remove_error_from_list(spa, s, &se->se_bookmark); | ||
remove_error_from_list(spa, l, &se->se_bookmark); | ||
bookmark_to_name(&se->se_bookmark, name, sizeof (name)); | ||
kmem_free(se, sizeof (spa_error_entry_t)); | ||
(void) zap_remove(spa->spa_meta_objset, | ||
spa->spa_errlog_last, name, tx); | ||
(void) zap_remove(spa->spa_meta_objset, | ||
spa->spa_errlog_scrub, name, tx); | ||
} | ||
} | ||
|
||
/* | ||
* Stash away healed bookmarks to remove them from the on-disk error logs | ||
* later in spa_remove_healed_errors(). | ||
*/ | ||
void | ||
spa_remove_error(spa_t *spa, zbookmark_phys_t *zb) | ||
{ | ||
char name[NAME_MAX_LEN]; | ||
|
||
bookmark_to_name(zb, name, sizeof (name)); | ||
|
||
spa_add_healed_error(spa, spa->spa_errlog_last, zb); | ||
spa_add_healed_error(spa, spa->spa_errlog_scrub, zb); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely convinced of the utility of this, because (AIUI) we only remove errors that were reported against the snapshot that we are doing the healing receive into. A block with a checksum error will often be referenced by multiple datasets (due to snapshots), and an error will be reported on any datasets via which we access the bad block. Typical cases are:
- running a scrub, in which case the error will be reported against the first snapshot that references it
- reading from a filesystem, in which case the error will be reported against the filesystem.
In either case, the dataset that we are receiving into may be a different dataset then the one that the error was reported against, in which case the "remove healed errors" logic accomplishes nothing.
Running a scrub after the healing receive (as recommended in the manpage additions) is really the only way to get an updated list of errors.
That said, with the addition of #9175, we could potentially remove all relevant error reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove the error clean up easily if we think that's cleaner.
I figured a relatively common use case for healing recv would be trying to heal using the snapshot that a scrub has IDed as corrupted. We then take this snapshot from the remote side and use it for healing. In this scenario, I figured it would make sense to then remove the errlog entry associated with the fixed blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that does seem like a realistic use case.
module/zfs/dmu_recv.c
Outdated
int buf_size = MIN(drrw->drr_logical_size, 32); | ||
void *buf = kmem_alloc(buf_size, KM_SLEEP); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would drr_logical_size be less than 32? I think it couldn't since the minimum block size is 512. Why are we reading 32 bytes? Why do we need to kmem_alloc 32 bytes, vs allocating on the stack? Why not just one byte? Do we want to check that drr_logical_size is the same as the object's block size (which I think it might not be if we are toggling the --large-block flag).
module/zfs/dmu_recv.c
Outdated
* We only try to heal when dmu_read() returns a ECKSUMs. | ||
* Other errors (even EIO) get returned to caller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explains what is happening, which the code also makes fairly obvious. It would be helpful for the comment to explain why we want to do this. For example, EIO indicates that the device is not present/accessible, so writing to it will likely fail. And if the block is healthy, we don't want the added i/o cost, and/or we don't want to overwrite stuff unnecessarily.
module/zfs/dmu_recv.c
Outdated
return (SET_ERROR(EACCES)); | ||
} | ||
|
||
err = zio_do_crypt_abd(B_TRUE, &dck->dck_key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] there's an extra space (2 spaces) after the first comma:
, &
module/zfs/dmu_recv.c
Outdated
return (do_corrective_recv(rwa, drrw->drr_object, abuf, | ||
drrw->drr_logical_size, bp, blkid, drrw->drr_offset)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might simplify the error handling in do_corrective_recv() if we expected it to never consume the abuf, in which case we would do if (err == 0) dmu_return_arcbuf(abuf);
here
module/zfs/dmu_recv.c
Outdated
dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); | ||
dsl_pool_config_exit(dp, FTAG); | ||
|
||
if (err != 0 || no_crypt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err == 0
but no_crypt != 0
, I think that we are expected to consume arc_buf
(per our caller's requirements). However, in practice I don't think that we can get no_crypt !=0
since we are not dealing with DMU_OT_[INTENT_LOG,DNODE]
. I think we could instead ASSERT0(no_crypt)
, and simply return (err)
below.
module/zfs/dmu_recv.c
Outdated
if (BP_GET_COMPRESS(bp) != ZIO_COMPRESS_OFF) { | ||
/* Recompress the data */ | ||
if (buf != NULL) | ||
abd_free(abd); | ||
size = zio_compress_data(BP_GET_COMPRESS(bp), abd, buf, | ||
lsize); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure that compression happens before encryption. Compressing the encrypted data will rarely yield any savings. Assuming I'm right, it seems that these code paths have not been tested. It would be good to add some tests to the test suite to exercise them. e.g.:
- uncompressed stream heals compressed block
- unencrypted stream heals encrypted block
- stream w/different compression heals differently-compressed block
- uncompressed (& unencrypted) stream heals compressed & encrypted block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
module/zfs/dmu_recv.c
Outdated
io = zio_rewrite(NULL, rwa->os->os_spa, 0, bp, abd, | ||
size, NULL, NULL, ZIO_PRIORITY_SYNC_WRITE, flags, &zb); | ||
|
||
/* compute new bp checksum value and make sure it matches the old one */ | ||
zio_checksum_compute(io, BP_GET_CHECKSUM(bp), abd, size); | ||
if (size != BP_GET_PSIZE(bp) || | ||
!ZIO_CHECKSUM_EQUAL(bp_cksum, io->io_bp->blk_cksum)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be simplified by using zio_checksum_error_impl()
, in which case you could verify the checksum before creation the zio, and not have to save the bp_cksum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that using this rather complicated function is simpler than stashing the BP and using the macro
Thanks for the review Matt! I'll start working through these comments this weekend. |
datasetexists $TESTPOOL/$TESTFS1 && \ | ||
log_must zfs destroy -r $TESTPOOL/$TESTFS1 | ||
datasetexists $TESTPOOL/$TESTFS2 && \ | ||
log_must zfs destroy -r $TESTPOOL/$TESTFS2 | ||
datasetexists $TESTPOOL/testfs3 && \ | ||
log_must zfs destroy -r $TESTPOOL/testfs3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already destroying the pool, this isn't necessary.
for f in $ibackup $backup; do | ||
[[ -f $f ]] && log_must rm -f $f | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just rm -f $ibackup $raw_backup $backup
.
typeset snap2="$TESTPOOL/$TESTFS1@snap2" | ||
typeset file="/$TESTPOOL/$TESTFS1/$TESTFILE0" | ||
|
||
log_must zpool destroy $TESTPOOL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use a different pool name and leave $TESTPOOL, so you don't have to worry about destroying and recreating it? Otherwise, this should be a bit more tolerant, with poolexists $TESTPOOL && destroy_pool $TESTPOOL
here and in cleanup, and cleanup should recreate the pool more like it is set up in setup.ksh to keep surprises to a minimum for the next test (when added).
Thanks for the review Ryan, I need to expand the the testing to include spill block healing and will try to incorporate your feedback then. |
I've rebased this code, it's close to ready but it still needs more work with regards to how abd/zio is handled. There are some leaks present in the current version... |
Codecov Report
@@ Coverage Diff @@
## master #9372 +/- ##
==========================================
+ Coverage 75.17% 79.59% +4.41%
==========================================
Files 402 395 -7
Lines 128071 125378 -2693
==========================================
+ Hits 96283 99789 +3506
+ Misses 31788 25589 -6199
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Hi @alek-p - in regards to the memory leaks, the receive_process_write_record (with rwa->heal set) is returning EAGAIN, but it does not add the receive_record_arg structure to the rwa->write_batch listhead. The kmem_free was also removed when it doesn't hit the EAGAIN cases for non corrective use cases. I modified the writer thread to circumvent this situation in a local branch:
|
thanks for looking into this Tony! Afaik the only thing left now is making sure the testing is robust enough and included spill records healing. |
I've hard a hard time trying to get spill block generated so I did a manual test by running the added corrective recv test with the following patch applied:
I think this is ready for the next round of reviews. |
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a send file for example). With the provided send stream, corrective receive will read from disk blocks described by the WRITE records. When any of the reads come back with ECKSUM we use the data from the corresponding WRITE record to rewrite the corrupted block. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Zuchowski <pzuchowski@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@axcient.com> Closes openzfs#9372
Would this fix make it to release soon? |
|
This feature will make it in to the OpenZFS 2.2 release. |
This is a huge improvement. The last missing piece for my use case with offsite backup, no RaidZ and having low throughput link which makes it difficult to recreate datasets with ease in case of errors. Quite certain i am not the only one on this boat. |
@pepsinio you're not the only one on that boat. I follow this topic since before the PR was opened. no WAN use case but the generally understanding that this is a major resilience feature that stabilizes many related bits and bytes. |
New features: - Fully adaptive ARC eviction (openzfs#14359) - Block cloning (openzfs#13392) - Scrub error log (openzfs#12812, openzfs#12355) - Linux container support (openzfs#14070, openzfs#14097, openzfs#12263) - BLAKE3 Checksums (openzfs#12918) - Corrective "zfs receive" (openzfs#9372) Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This patch implements a new type of zfs receive: corrective receive (-c). This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a sendfile for example).
Metadata can not be healed using a corrective receive.
This patch enables us to receive a send stream into an existing snapshot for the purpose of correcting data corruption.
This is the updated version of the patch in #9323
Motivation and Context
In the past in the rare cases where ZFS has experienced permanent data corruption, full recovery of the dataset(s) has not always been possible even if replicas existed.
This patch makes recovery from permanent data corruption possible.
Description
For every write record in the send stream, we read the corresponding block from disk and if that read fails with a checksum error we overwrite that block with data from the send stream.
After the data is healed we reread the block to make sure it's healed and remove the healed blocks form the corruption lists seen in zpool status.
To makes sure will have correctly matched the data in the send stream to the right dataset to heal there is a restriction that the GUID for the snapshot being received into must match the GUID in the send stream. There are likely several snapshots referring to the same potentially corrupted data so there may be many snapshots with the above condition holding that are able to heal a single block.
The other thing to point out is that we can only correct data. Specifically, we are only able to heal records of type DRR_WRITE.
To help with the review you can see my OpenZFS dev summit 2019 talk for more context on this work:
video: https://www.youtube.com/watch?v=JldbtDATrOo
slides: https://drive.google.com/file/d/1Ysc_3bJWmsJCETFNTRCzyvpseDpzjjf2/view
How Has This Been Tested?
I've been running unit testing very similar to the test that I've added to the zfs-tests
Future Work
Since DRR_SPILL record also (like DRR_WRITE) contains all of the data needed to recreate the damaged block - a future project could add support for healing of DRR_SPILL records.
The next logical extension for part two of this work is to provide a way for a corrupted pool to tell a backup system to generate a minimal send stream in such a way as to enable the corrupted pool to be healed with this generated send stream.
The interface could be something like the following, but maybe there are better suggestions?
Types of changes
Checklist:
Signed-off-by
.