implement corruption correcting recv #9323

alek-p · 2019-09-13T21:38:28Z

This patch implements a new type of zfs receive: corrective receive (-c). This type of recv is used to heal corrupted data when a replica of the data already exists (in the form of a sendfile for example).
Metadata can not be healed using a corrective receive.

This patch enables us to receive a send stream into an existing snapshot for the purpose of correcting data corruption.

Motivation and Context

In the past in the rare cases where ZFS has experienced permanent data corruption, full recovery of the dataset(s) has not always been possible even if replicas existed.
This patch makes recovery from permanent data corruption possible.

Description

For every write and spill record in the send stream, we read the corresponding block from disk and if that read fails with a checksum error we overwrite that block with data from the send stream.
After the data is healed we reread the block to make sure it's healed and remove the healed blocks form the corruption lists seen in zpool status.

To makes sure will have correctly matched the data in the send stream to the right dataset to heal there is a restriction that the GUID for the snapshot being received into must match the GUID in the send stream. There are likely several snapshots referring to the same potentially corrupted data so there may be many snapshots with the above condition holding that are able to heal a single block.

The other thing to point out is that we can only correct data. Specifically, we are only able to heal records of type DRR_WRITE and DRR_SPILL since those are the only ones that contain all of the data needed to recreate the damaged block.

How Has This Been Tested?

I've been running unit testing very similar to the test that I've added to the zfs-tests

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.

alek-p · 2019-09-13T21:39:09Z

The next logical extension for part two of this work is to provide a way for a corrupted pool to tell a backup system to generate a minimal send stream in such a way as to enable the corrupted pool to be healed with this generated send stream.
The interface could be something like the following, but maybe there are better suggestions?

# dumps spa err list that are part of this snapshot and the snapshot guid
zfs send -C data/fs@snap > /tmp/errlist 

# on replica system generates healing sendfile based on the errors list
zfs send -cc /tmp/errlist backup_data > /tmp/healing_sendfile

# heal our data with the minimal healing sendfile
zfs recv -c data/fs@snap < /tmp/healing_sendfile

behlendorf

Using zfs recv is this way to correct damaged blocks is an interesting idea. I've left some initial comments and we should be able to get you some additional feedback on the approach.

behlendorf · 2019-09-17T22:41:45Z

module/zfs/dmu_recv.c

+
+	/* We can only heal write and spill records; other ones get ignored */
+	if (drr.drr_type != DRR_WRITE && drr.drr_type != DRR_SPILL)
+		goto cleanup;


See comment below, but I'd suggest moving this check in to the top of receive_process_record for healing receives.

behlendorf · 2019-09-17T22:54:59Z

module/zfs/dmu_recv.c

+		    DMU_READ_NO_PREFETCH);
+		kmem_free(buf, lsize);
+		if (err != ECKSUM)
+			goto cleanup; /* no corruption found */


Functionally this looks right, but it is possible for dmu_read to return errors other than ECKSUM, for example EIO. Could you add a comment to clarify it's also possible the block couldn't be read at all and was skipped,

behlendorf · 2019-09-17T22:57:59Z

module/zfs/dmu_recv.c

+		break;
+	}
+	default:
+		ASSERT0(1);


nit: this should be unreachable but please go ahead remove this debugging.

behlendorf · 2019-09-17T23:04:50Z

module/zfs/dmu_recv.c

+
+		err = dmu_buf_hold_noread(os, obj, offset, FTAG, &dbp);
+		if (err != 0) {
+			err = SET_ERROR(EBUSY);


Why EBUSY rather than returning the actual errno?

behlendorf · 2019-09-17T23:14:43Z

module/zfs/dmu_recv.c

@@ -2577,7 +2775,10 @@ receive_writer_thread(void *arg)
 		 * can exit.
 		 */
 		if (rwa->err == 0) {
-			rwa->err = receive_process_record(rwa, rrd);
+			if (rwa->heal)
+				rwa->err = receive_heal_record(rwa, rrd);


Rather than add a new top-level receive_heal_record(), did you try moving this logic in to receive_write() and receive_spill() respectively. This will allow you to leverage all of the existing stream sanity checks, and the cleanup logic which consumes the arc_buf automatically on error. The common bits at the end of receive_heal_record() which rewrites in-place can be left in it's own function which is called by both.

module/zfs/spa_errlog.c

module/zfs/zfs_ioctl.c

behlendorf · 2019-09-18T00:36:43Z

tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_corrective.ksh

+{
+	log_must dd bs=512 count=1 if=garbage conv=notrunc \
+	    oflag=sync of=$DEV_RDSKDIR/$DISK seek=$((($1 / 512) + (0x400000 / 512)))
+}


Good news, commit b63e2d8 added support to master to inject targeted damage in to file blocks. Please use it to ensure this test is reliable.

behlendorf · 2019-09-18T00:39:14Z

tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_corrective.ksh

+	corrupt_offset "$mid_offset"
+
+	log_must zpool scrub $TESTPOOL
+	log_must sleep 5 # let scrub finish


Also added to master is the new zpool wait subcommand, you can now run log_must zpool wait -t scrub $TESTPOOL and avoid this ugliness.

behlendorf · 2019-09-18T00:41:19Z

tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_corrective.ksh

+log_assert "ZFS corrective receive should be able to heal corruption"
+
+# we will use this as the source of corruption
+log_must dd if=/dev/urandom of=garbage bs=512 count=1 oflag=sync


You should be able to get rid of the garbage file after switching to the targeted injection.

ahrens

How does this interact with encryption and zfs send --raw? E.g. if the snapshot is encrypted, do they need to do a raw send, or does the receive encrypt it on the fly (using the crypt params in the BP)?

ahrens · 2019-09-18T18:26:47Z

cmd/zfs/zfs_main.c

@@ -4487,7 +4487,7 @@ zfs_do_receive(int argc, char **argv)
 		nomem();

 	/* check options */
-	while ((c = getopt(argc, argv, ":o:x:dehnuvFsA")) != -1) {
+	while ((c = getopt(argc, argv, ":o:x:dehnuvFsAc")) != -1) {


I think it's time to add --long-opts for zfs receive. It's really too bad we didn't do this from the beginning for all of the subcommands.

part two of this work will likely need long-ops but I'd like to avoid adding it now

ahrens · 2019-09-18T18:32:52Z

lib/libzfs_core/libzfs_core.c

-    boolean_t force, boolean_t resumable, boolean_t raw, int input_fd,
-    const dmu_replay_record_t *begin_record, int cleanup_fd,
+    boolean_t force, boolean_t heal, boolean_t resumable, boolean_t raw,
+    int input_fd, const dmu_replay_record_t *begin_record, int cleanup_fd,


I think that we don't want to change existing libzfs_core function signatures if we can help it. Let's add a new function instead.

ahrens · 2019-09-18T20:04:41Z

module/zfs/dmu_recv.c

+		 * snapshot as the one we are trying to heal.
+		 */
+		struct drr_begin *drrb = drba->drba_cookie->drc_drrb;
+		error = dsl_dataset_hold_obj(dp, val, FTAG, &snap);


if error is nonzero, shouldn't we be returning that?

ahrens · 2019-09-18T20:06:31Z

module/zfs/dmu_recv.c

@@ -361,12 +365,16 @@ recv_begin_check_existing_impl(dmu_recv_begin_arg_t *drba, dsl_dataset_t *ds,
 	if (dsl_dataset_has_resume_receive_state(ds))
 		return (SET_ERROR(EBUSY));

-	/* New snapshot name must not exist. */
+	/* New snapshot name must not exist if we're not healing it */
 	error = zap_lookup(dp->dp_meta_objset,
 	    dsl_dataset_phys(ds)->ds_snapnames_zapobj,
 	    drba->drba_cookie->drc_tosnap, 8, 1, &val);


Since we're now using val as the snapshot's object number, how about renaming it to reflect that, e.g. snapobj

ahrens · 2019-09-18T20:10:08Z

module/zfs/spa_errlog.c

+	if (avl_is_empty(&spa->spa_errlist_healed)) {
+		mutex_exit(&spa->spa_errlist_lock);
+		return;
+	}


I don't think this avl_is_empty case is needed - the right thing will happen below (i.e. the first call to avl_destroy_nodes will return NULL).

ahrens · 2019-09-18T20:19:55Z

module/zfs/dmu_recv.c

+			goto cleanup;
+		}
+		blkid =
+		    dbuf_whichblock(DB_DNODE((dmu_buf_impl_t *)dbp), 0, offset);


You need to use DB_DNODE_ENTER/EXIT around the DB_DNODE(), or better yet dmu_buf_dnode_enter() as you do below. Or if you're going to cast the dbuf, you might as well just dereference db_blkid

ahrens · 2019-09-18T20:30:41Z

module/zfs/dmu_recv.c

+		buf = kmem_alloc(lsize, KM_SLEEP);
+		/* Try to read the object to see if it needs healing */
+		err = dmu_read(os, obj, offset, lsize, buf,
+		    DMU_READ_NO_PREFETCH);


Why no prefetching?

ahrens · 2019-09-18T20:32:26Z

module/zfs/dmu_recv.c

+
+	/* Correct the corruption in place */
+	err = zio_wait(zio_rewrite(NULL, os->os_spa, 0, bp, abd, size, NULL,
+	    NULL, ZIO_PRIORITY_SYNC_WRITE, flags, NULL));


Is there any check that the size is the same as the BP's psize (possibly rounded up to ashift)? Or at least that this can't write past the end of what's allocated for the BP?

ahrens · 2019-09-18T20:35:48Z

module/zfs/dmu_recv.c

+	if (arc_get_compression(rrd->arc_buf) != BP_GET_COMPRESS(bp)) {
+		/*
+		 * The compression in the stream doesn't match what we had
+		 * on disk; we need to re-compress the buf into the


Do we really need to handle this case? Seems like we could say that it needs to be a zfs send --compressed stream to use it with zfs recv -c. Plus, trying to recompress it and get the exact same byte stream adds more restrictions on changing the checksum algorithm implementations, which we'd like to avoid. (cc @allanjude)

ahrens · 2019-09-18T20:40:01Z

module/zfs/dmu_recv.c

+
+	/* Correct the corruption in place */
+	err = zio_wait(zio_rewrite(NULL, os->os_spa, 0, bp, abd, size, NULL,
+	    NULL, ZIO_PRIORITY_SYNC_WRITE, flags, NULL));


Given the current functionality, we would expect there to be a LOT more records for non-corrupt blocks, than for corrupt blocks. So I'm even more concerned about the synchronous (and not eligible for predictive prefetch) dmu_read()'s to determine if we want to correct this block.

We'd typically (e.g. with >1 vdev) get better performance by simply always asynchronously zio_rewrite()ing every record (i.e. without checking if its actually corrupt). Since you'd have Nrecords async writes vs the current PR's Nrecords sync reads. Plus the code would be a lot simpler.

I wonder if we should wait for the extensions (to send only the corrupt blocks) are ready, or at least design recv -c to work best in that mode. E.g. by assuming that nearly all records will be for corrupt blocks?

alek-p · 2019-09-18T23:48:43Z

Thanks for the reviews guys. I've addressed most of the comments in the version I just pushed. I'm still thinking about the right way to do the rewrite I/O. Perhaps always rewriting (instead of readin first) is the right way to go hmm...

It seems to be too big a limitation to impose saying we have to use send --compressed with corrective recv so I'd like to keep the parts that are able to switch compressions if possible.
Perhaps we can deal with the compression algos that don't recompress the same way seperatley from the other compressison types and only make those send streams be compressed.

codecov · 2019-09-19T02:29:43Z

Codecov Report

Merging #9323 into master will decrease coverage by 0.67%.
The diff coverage is 73.8%.

@@            Coverage Diff             @@
##           master    #9323      +/-   ##
==========================================
- Coverage   79.79%   79.12%   -0.68%     
==========================================
  Files         279      401     +122     
  Lines       81396   122676   +41280     
==========================================
+ Hits        64951    97065   +32114     
- Misses      16445    25611    +9166

Flag	Coverage Δ
#kernel	`79.77% <74.09%> (-0.02%)`	⬇️
#user	`66.63% <18.93%> (?)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update afc8f0a...adb30d5. Read the comment docs.

alek-p · 2019-09-20T17:18:44Z

After talking with my colleagues @datto we think it may be too dangerous to rewrite everything that we encounter in the send stream as writing has the potential to do more damage in the cases where corruption is coming from failing HW for example.
We want to prioritize getting the dataset healthy and not necessarily the performance of the healing. In theory, recv healing should be a rare event that does not need to be highly performant. Having said that I'm still working on the way that I/O is done for corrective recv.

The other open question was about the way to handle compression alogos that may not recompress the same block the same way. The suggestion here to avoid this problem we will only heal when the checksum of the data to be used for healing matches the checksum already on disk.

The last thing I'm still working on is to make sure raw and large block send streams are compatible with healing recv.

This patch implements a new type of zfs receive: corrective receive (-c). Thistype of recv is used to heal corrupted data when a replica of the data already exists (in the form of a sendfile for example). Metadata can not be healed using a corrective receive. Signed-off-by: Alek Pinchuk <apinchuk@datto.com>

ahrens · 2019-10-01T21:07:01Z

Superseded by #9372

alek-p force-pushed the healing_recv branch from 5412575 to 1052058 Compare September 13, 2019 22:52

alek-p added Component: Send/Recv "zfs send/recv" feature Status: Code Review Needed Ready for review and testing Status: Design Review Needed Architecture or design is under discussion labels Sep 16, 2019

alek-p force-pushed the healing_recv branch 2 times, most recently from f0859db to 3dd910a Compare September 17, 2019 21:52

behlendorf reviewed Sep 18, 2019

View reviewed changes

behlendorf requested review from pcd1193182 and ahrens September 18, 2019 00:47

ahrens reviewed Sep 18, 2019

View reviewed changes

alek-p force-pushed the healing_recv branch 2 times, most recently from 537c5bb to 90a196a Compare September 18, 2019 23:48

alek-p force-pushed the healing_recv branch from 90a196a to 1857323 Compare September 19, 2019 02:29

alek-p force-pushed the healing_recv branch from 1857323 to adb30d5 Compare September 20, 2019 19:26

alek-p mentioned this pull request Sep 27, 2019

implement corruption correcting recv #9372

Merged

12 tasks

ahrens closed this Oct 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement corruption correcting recv #9323

implement corruption correcting recv #9323

alek-p commented Sep 13, 2019

alek-p commented Sep 13, 2019

behlendorf left a comment

behlendorf Sep 17, 2019

behlendorf Sep 17, 2019

behlendorf Sep 17, 2019

behlendorf Sep 17, 2019

behlendorf Sep 17, 2019

behlendorf Sep 18, 2019

behlendorf Sep 18, 2019

behlendorf Sep 18, 2019

ahrens left a comment

ahrens Sep 18, 2019

alek-p Sep 18, 2019

ahrens Sep 18, 2019

ahrens Sep 18, 2019

ahrens Sep 18, 2019

ahrens Sep 18, 2019

ahrens Sep 18, 2019

ahrens Sep 18, 2019

ahrens Sep 18, 2019

ahrens Sep 18, 2019

ahrens Sep 18, 2019

alek-p commented Sep 18, 2019

codecov bot commented Sep 19, 2019 •

edited

Loading

alek-p commented Sep 20, 2019

ahrens commented Oct 1, 2019

implement corruption correcting recv #9323

implement corruption correcting recv #9323

Conversation

alek-p commented Sep 13, 2019

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

alek-p commented Sep 13, 2019

behlendorf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahrens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alek-p commented Sep 18, 2019

codecov bot commented Sep 19, 2019 • edited Loading

Codecov Report

alek-p commented Sep 20, 2019

ahrens commented Oct 1, 2019

codecov bot commented Sep 19, 2019 •

edited

Loading