-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement corruption correcting recv #9323
Conversation
The next logical extension for part two of this work is to provide a way for a corrupted pool to tell a backup system to generate a minimal send stream in such a way as to enable the corrupted pool to be healed with this generated send stream.
|
5412575
to
1052058
Compare
f0859db
to
3dd910a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using zfs recv
is this way to correct damaged blocks is an interesting idea. I've left some initial comments and we should be able to get you some additional feedback on the approach.
module/zfs/dmu_recv.c
Outdated
|
||
/* We can only heal write and spill records; other ones get ignored */ | ||
if (drr.drr_type != DRR_WRITE && drr.drr_type != DRR_SPILL) | ||
goto cleanup; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below, but I'd suggest moving this check in to the top of receive_process_record
for healing receives.
module/zfs/dmu_recv.c
Outdated
DMU_READ_NO_PREFETCH); | ||
kmem_free(buf, lsize); | ||
if (err != ECKSUM) | ||
goto cleanup; /* no corruption found */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally this looks right, but it is possible for dmu_read
to return errors other than ECKSUM
, for example EIO
. Could you add a comment to clarify it's also possible the block couldn't be read at all and was skipped,
module/zfs/dmu_recv.c
Outdated
break; | ||
} | ||
default: | ||
ASSERT0(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should be unreachable but please go ahead remove this debugging.
module/zfs/dmu_recv.c
Outdated
|
||
err = dmu_buf_hold_noread(os, obj, offset, FTAG, &dbp); | ||
if (err != 0) { | ||
err = SET_ERROR(EBUSY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why EBUSY
rather than returning the actual errno?
module/zfs/dmu_recv.c
Outdated
@@ -2577,7 +2775,10 @@ receive_writer_thread(void *arg) | |||
* can exit. | |||
*/ | |||
if (rwa->err == 0) { | |||
rwa->err = receive_process_record(rwa, rrd); | |||
if (rwa->heal) | |||
rwa->err = receive_heal_record(rwa, rrd); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than add a new top-level receive_heal_record()
, did you try moving this logic in to receive_write()
and receive_spill()
respectively. This will allow you to leverage all of the existing stream sanity checks, and the cleanup logic which consumes the arc_buf
automatically on error. The common bits at the end of receive_heal_record()
which rewrites in-place can be left in it's own function which is called by both.
{ | ||
log_must dd bs=512 count=1 if=garbage conv=notrunc \ | ||
oflag=sync of=$DEV_RDSKDIR/$DISK seek=$((($1 / 512) + (0x400000 / 512))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good news, commit b63e2d8 added support to master to inject targeted damage in to file blocks. Please use it to ensure this test is reliable.
corrupt_offset "$mid_offset" | ||
|
||
log_must zpool scrub $TESTPOOL | ||
log_must sleep 5 # let scrub finish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added to master is the new zpool wait
subcommand, you can now run log_must zpool wait -t scrub $TESTPOOL
and avoid this ugliness.
log_assert "ZFS corrective receive should be able to heal corruption" | ||
|
||
# we will use this as the source of corruption | ||
log_must dd if=/dev/urandom of=garbage bs=512 count=1 oflag=sync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to get rid of the garbage file after switching to the targeted injection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this interact with encryption and zfs send --raw
? E.g. if the snapshot is encrypted, do they need to do a raw send, or does the receive encrypt it on the fly (using the crypt params in the BP)?
@@ -4487,7 +4487,7 @@ zfs_do_receive(int argc, char **argv) | |||
nomem(); | |||
|
|||
/* check options */ | |||
while ((c = getopt(argc, argv, ":o:x:dehnuvFsA")) != -1) { | |||
while ((c = getopt(argc, argv, ":o:x:dehnuvFsAc")) != -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's time to add --long-opts
for zfs receive
. It's really too bad we didn't do this from the beginning for all of the subcommands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
part two of this work will likely need long-ops but I'd like to avoid adding it now
boolean_t force, boolean_t resumable, boolean_t raw, int input_fd, | ||
const dmu_replay_record_t *begin_record, int cleanup_fd, | ||
boolean_t force, boolean_t heal, boolean_t resumable, boolean_t raw, | ||
int input_fd, const dmu_replay_record_t *begin_record, int cleanup_fd, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we don't want to change existing libzfs_core function signatures if we can help it. Let's add a new function instead.
module/zfs/dmu_recv.c
Outdated
* snapshot as the one we are trying to heal. | ||
*/ | ||
struct drr_begin *drrb = drba->drba_cookie->drc_drrb; | ||
error = dsl_dataset_hold_obj(dp, val, FTAG, &snap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if error
is nonzero, shouldn't we be returning that?
module/zfs/dmu_recv.c
Outdated
@@ -361,12 +365,16 @@ recv_begin_check_existing_impl(dmu_recv_begin_arg_t *drba, dsl_dataset_t *ds, | |||
if (dsl_dataset_has_resume_receive_state(ds)) | |||
return (SET_ERROR(EBUSY)); | |||
|
|||
/* New snapshot name must not exist. */ | |||
/* New snapshot name must not exist if we're not healing it */ | |||
error = zap_lookup(dp->dp_meta_objset, | |||
dsl_dataset_phys(ds)->ds_snapnames_zapobj, | |||
drba->drba_cookie->drc_tosnap, 8, 1, &val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're now using val
as the snapshot's object number, how about renaming it to reflect that, e.g. snapobj
module/zfs/spa_errlog.c
Outdated
if (avl_is_empty(&spa->spa_errlist_healed)) { | ||
mutex_exit(&spa->spa_errlist_lock); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this avl_is_empty
case is needed - the right thing will happen below (i.e. the first call to avl_destroy_nodes will return NULL).
module/zfs/dmu_recv.c
Outdated
goto cleanup; | ||
} | ||
blkid = | ||
dbuf_whichblock(DB_DNODE((dmu_buf_impl_t *)dbp), 0, offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to use DB_DNODE_ENTER/EXIT
around the DB_DNODE()
, or better yet dmu_buf_dnode_enter()
as you do below. Or if you're going to cast the dbuf, you might as well just dereference db_blkid
module/zfs/dmu_recv.c
Outdated
buf = kmem_alloc(lsize, KM_SLEEP); | ||
/* Try to read the object to see if it needs healing */ | ||
err = dmu_read(os, obj, offset, lsize, buf, | ||
DMU_READ_NO_PREFETCH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why no prefetching?
module/zfs/dmu_recv.c
Outdated
|
||
/* Correct the corruption in place */ | ||
err = zio_wait(zio_rewrite(NULL, os->os_spa, 0, bp, abd, size, NULL, | ||
NULL, ZIO_PRIORITY_SYNC_WRITE, flags, NULL)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any check that the size
is the same as the BP's psize (possibly rounded up to ashift)? Or at least that this can't write past the end of what's allocated for the BP?
module/zfs/dmu_recv.c
Outdated
if (arc_get_compression(rrd->arc_buf) != BP_GET_COMPRESS(bp)) { | ||
/* | ||
* The compression in the stream doesn't match what we had | ||
* on disk; we need to re-compress the buf into the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to handle this case? Seems like we could say that it needs to be a zfs send --compressed
stream to use it with zfs recv -c
. Plus, trying to recompress it and get the exact same byte stream adds more restrictions on changing the checksum algorithm implementations, which we'd like to avoid. (cc @allanjude)
module/zfs/dmu_recv.c
Outdated
|
||
/* Correct the corruption in place */ | ||
err = zio_wait(zio_rewrite(NULL, os->os_spa, 0, bp, abd, size, NULL, | ||
NULL, ZIO_PRIORITY_SYNC_WRITE, flags, NULL)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the current functionality, we would expect there to be a LOT more records for non-corrupt blocks, than for corrupt blocks. So I'm even more concerned about the synchronous (and not eligible for predictive prefetch) dmu_read()
's to determine if we want to correct this block.
We'd typically (e.g. with >1 vdev) get better performance by simply always asynchronously zio_rewrite()
ing every record (i.e. without checking if its actually corrupt). Since you'd have Nrecords async writes vs the current PR's Nrecords sync reads. Plus the code would be a lot simpler.
I wonder if we should wait for the extensions (to send only the corrupt blocks) are ready, or at least design recv -c
to work best in that mode. E.g. by assuming that nearly all records will be for corrupt blocks?
537c5bb
to
90a196a
Compare
Thanks for the reviews guys. I've addressed most of the comments in the version I just pushed. I'm still thinking about the right way to do the rewrite I/O. Perhaps always rewriting (instead of readin first) is the right way to go hmm... It seems to be too big a limitation to impose saying we have to use |
90a196a
to
1857323
Compare
Codecov Report
@@ Coverage Diff @@
## master #9323 +/- ##
==========================================
- Coverage 79.79% 79.12% -0.68%
==========================================
Files 279 401 +122
Lines 81396 122676 +41280
==========================================
+ Hits 64951 97065 +32114
- Misses 16445 25611 +9166
Continue to review full report at Codecov.
|
After talking with my colleagues @datto we think it may be too dangerous to rewrite everything that we encounter in the send stream as writing has the potential to do more damage in the cases where corruption is coming from failing HW for example. The other open question was about the way to handle compression alogos that may not recompress the same block the same way. The suggestion here to avoid this problem we will only heal when the checksum of the data to be used for healing matches the checksum already on disk. The last thing I'm still working on is to make sure raw and large block send streams are compatible with healing recv. |
This patch implements a new type of zfs receive: corrective receive (-c). Thistype of recv is used to heal corrupted data when a replica of the data already exists (in the form of a sendfile for example). Metadata can not be healed using a corrective receive. Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
1857323
to
adb30d5
Compare
Superseded by #9372 |
This patch implements a new type of zfs receive: corrective receive (-c). This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a sendfile for example).
Metadata can not be healed using a corrective receive.
This patch enables us to receive a send stream into an existing snapshot for the purpose of correcting data corruption.
Motivation and Context
In the past in the rare cases where ZFS has experienced permanent data corruption, full recovery of the dataset(s) has not always been possible even if replicas existed.
This patch makes recovery from permanent data corruption possible.
Description
For every write and spill record in the send stream, we read the corresponding block from disk and if that read fails with a checksum error we overwrite that block with data from the send stream.
After the data is healed we reread the block to make sure it's healed and remove the healed blocks form the corruption lists seen in zpool status.
To makes sure will have correctly matched the data in the send stream to the right dataset to heal there is a restriction that the GUID for the snapshot being received into must match the GUID in the send stream. There are likely several snapshots referring to the same potentially corrupted data so there may be many snapshots with the above condition holding that are able to heal a single block.
The other thing to point out is that we can only correct data. Specifically, we are only able to heal records of type DRR_WRITE and DRR_SPILL since those are the only ones that contain all of the data needed to recreate the damaged block.
How Has This Been Tested?
I've been running unit testing very similar to the test that I've added to the zfs-tests
Types of changes
Checklist:
Signed-off-by
.