Fixed data integrity issue when underlying disk returns error to zfs #12443

arun-kv · 2021-07-28T16:29:16Z

zil_lwb_write_done error was not propagated to zil_lwb_flush_vdevs_done, due to which zil_commit_impl was
returning and application gets write success even though zfs was not able to write data to the disk.

Motivation and Context

zvol (sync=always): Write(O_SYNC) returns success when the disk is not accessible.
The application gets a write success from zvol created with option sync=always even when the disk is not accessible.
Note: Only the first write after the disk becomes inaccessible is wrongly acknowledged as successfully written, after that all other writes from the application wait.

#12391

Description

Capture the zil_lwb_write_done error into zil_commit_waiter and in zil_lwb_flush_vdevs_done if the zio->io_error is 0 then use
zil_lwb_write_done error. which will make sure zil_commit_impl won't return to user until we fix the disk/network error and clear the zpool_suspend state.

How Has This Been Tested?

zpool create -o ashift=12 zpool /dev/sdc -f
zfs create -o volblocksize=4k -o dedup=off -o compression=off -o sync=always -o primarycache=metadata -V 6G zpool/zvol
Wrote some data (keep the application waiting for the user input)
echo offline > /sys/block/sdc/device/state
When I entered the new data, I got success message from write (Ideally this should wait until I bring the disk online)
do another write, this time application waits until i fix the disk issue and run the zpool clear command.
reboot the system, read the offset (Verify On Disk Compatibility #5) and I don't find the data.

Below is the program I used for write data to zvol,

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <errno.h>
#include <fcntl.h>

#define SIZE	4 * 1024
#define DEVICE  "/dev/zd0"

int main()
{
	FILE *fp1, *fp2;
	char *rbuf, *wbuf;
	int ret = 0;
	int fd;
	int len = 0;
	int pagesize = getpagesize();

	rbuf = malloc(SIZE+pagesize);
	wbuf = (char *)((((unsigned long) rbuf + pagesize -1) / pagesize) * pagesize);

//	fd = open(DEVICE, O_RDWR|O_DIRECT);
	fd = open(DEVICE, O_RDWR|O_SYNC);
	if(!fd) {
		printf("Failed to open file %s \n", DEVICE);
		return(1);
	}
	while (1) {
		printf("Enter char to write: ");
		char c = getchar();
		getchar();
		fflush(stdin);
		if(c != '\n') {
			printf("%c\n", c);
			memset(wbuf, c, SIZE);
			ret = write(fd, wbuf, SIZE);
			if (ret < 0) {
				printf("Write failed error: %d\n", -errno);
				break;
			} else {
				printf("write success ret: %d offset: %d\n", ret, len);
			}
		}
		len += SIZE;
	}
	close(fd);
}

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

zil_lwb_write_done error was not propagated to zil_lwb_flush_vdevs_done, due to which zil_commit_impl was returning and application gets write success even though zfs was not able to write data to the disk. Signed-off-by: Arun KV <arun.kv@datacore.com>

arun-kv · 2021-07-30T11:27:31Z

@behlendorf zloop test fails even without my changes, what should I do next?

problame · 2021-08-06T22:14:21Z

module/zfs/zil.c

@@ -1179,7 +1179,8 @@ zil_lwb_flush_vdevs_done(zio_t *zio)
 		ASSERT3P(zcw->zcw_lwb, ==, lwb);
 		zcw->zcw_lwb = NULL;

-		zcw->zcw_zio_error = zio->io_error;
+		if (zio->io_error != 0)
+			zcw->zcw_zio_error = zio->io_error;


There should be a comment here explaining why we need to do this check.
Also, we should VERIFY that zcw->zcw_zio_error == 0 before overwriting it with zio->io_error.
IIUC, we can assert that because, IIUC, we don't issue the flush if the write fails.
We should actually VERIFY because
a) it's not on a hot code path and
b) it's critical for correctness.

The comment should address the assertion as well.

problame · 2021-08-06T22:18:35Z

module/zfs/zil.c

+		 * error is not propagated to zil_lwb_flush_vdevs_done, which
+		 * will cause zil_commit_impl to return without committing
+		 * the data.
+		 * Refer https://github.com/openzfs/zfs/issues/12391


typo in becaues

I find the comment here more confusing than useful. I'd prefer a comment in zil_commit_impl in the place where we check zcw_zio_error that explains the entire error propagation path (for both the write and flush done callbacks)

@problame thank you for the comments. I have updated the PR.

problame · 2021-08-13T12:39:08Z

module/zfs/zil.c

+			mutex_enter(&zcw->zcw_lock);
+			ASSERT(list_link_active(&zcw->zcw_node));
+			ASSERT3P(zcw->zcw_lwb, ==, lwb);
+			zcw->zcw_zio_error = zio->io_error;


AFAIK this is the first place where we might set zcw_zio_error => if the zcw is zero-initialized, we could ASSERT3S(zcw->zcw_zio_error, ==, 0);

if the zcw is zero-initialized

it is in zil_alloc_commit_waiter.. further, we already rely on this in the flush function (with these changes):

VERIFY(zcw->zcw_zio_error == 0);

so I think we could do this verification here too.

Signed-off-by: Arun KV <arun.kv@datacore.com>

problame · 2021-08-16T07:50:47Z

The changes in this PR seem correct to me and like a net improvement.
The cherry on top would be a regression test that uses zinject or similar to exercise the error handling paths.

However, I just quickly reviewed the WR_INDIRECT code path, and it seems like there is no proper error propagation either:

zfs/module/zfs/zil.c

Lines 1756 to 1764 in 41bee40

    
           if (error == EIO) { 
        
           	txg_wait_synced(zilog->zl_dmu_pool, txg); 
        
           	return (lwb); 
        
           } 
        
           if (error != 0) { 
        
           	ASSERT(error == ENOENT || error == EEXIST || 
        
           	    error == EALREADY); 
        
           	return (lwb); 
        
           }

prakashsurya · 2021-08-16T16:41:30Z

module/zfs/zil.c

+			mutex_enter(&zcw->zcw_lock);
+			ASSERT(list_link_active(&zcw->zcw_node));
+			ASSERT3P(zcw->zcw_lwb, ==, lwb);
+			zcw->zcw_zio_error = zio->io_error;


if the zcw is zero-initialized

it is in zil_alloc_commit_waiter.. further, we already rely on this in the flush function (with these changes):

VERIFY(zcw->zcw_zio_error == 0);

so I think we could do this verification here too.

prakashsurya · 2021-08-16T16:41:50Z

module/zfs/zil.c

+			 * If the flush has failed, then the write must have
+			 * been successful. VERIFY the same.
+			 */
+			VERIFY(zcw->zcw_zio_error == 0);


lets use VERIFY3S

prakashsurya · 2021-08-16T16:48:15Z

module/zfs/zil.c

@@ -1253,6 +1267,17 @@ zil_lwb_write_done(zio_t *zio)
 	 * written out.
 	 */
 	if (zio->io_error != 0) {
+		zil_commit_waiter_t *zcw;


I'm curious.. this block of code you're adding, looks similar to the block of code in the "flush done" function.. i.e. starting at line 1173.. but I see some differences, such as the fact that this block you're adding doesn't call:

cv_broadcast(&zcw->zcw_cv);

nor does it set zcw_done..

are these differences intentional? it's been awhile since I've been in this code, so I'm just curious if we should be using the same exact logic in both cases, here and in the flush function?

in this error case, do we still call the "flush done" function? I presume not, which is why this change is needed.. but please correct me if I'm wrong.

prakashsurya

Sorry if this is a silly question, it's been awhile since I've been in this code, but if we create the ZIO such that "flush done" is the completion handler, why isn't it getting called, and setting zcw_zio_error correctly?

1395                 lwb->lwb_root_zio = zio_root(zilog->zl_spa,                     
1396                     zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL);

or is it that zil_lwb_flush_vdevs_done is getting called, but it's called with a zio->io_error of zero.. and called after zil_lwb_write_done is called, with a zio->io_error of non-zero?

Again, I apologize if these answers are obvious, but I'm trying to refresh my memory on how I intended this to work, originally.. and understand what exactly is going awry, now..

prakashsurya · 2021-08-16T18:25:57Z

why isn't it getting called, and setting zcw_zio_error correctly?

Now I see my question has already been discussed here: #12391 (comment)

So, if I'm understanding correctly, if we don't set ZIO_FLAG_DONT_PROPAGATE, then I think the error from "write done" would get propagated to the "flush done" call.. let me think about that, and see if that'd be a cleaner solution than what's being proposed here.. right now, I can't recall why I used ZIO_FLAG_DONT_PROPAGATE to begin with.

prakashsurya · 2021-08-16T19:18:18Z

The prior code (i.e. prior to 1ce23dc) did this:

 975                 lwb->lwb_zio = zio_rewrite(zilog->zl_root_zio, zilog->zl_spa,   
 976                     0, &lwb->lwb_blk, lwb_abd, BP_GET_LSIZE(&lwb->lwb_blk),     
 977                     zil_lwb_write_done, lwb, prio,                              
 978                     ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |                
 979                     ZIO_FLAG_FASTWRITE, &zb);

so, my best guess, is we're currently using ZIO_FLAG_DONT_PROPAGATE because we used it prior to my changes, I and didn't really consider it when making the changes I made in 1ce23dc.

I think my changes in 1ce23dc, were expecting any errors seen in zil_lwb_write_done, would also be seen in zil_lwb_flush_vdevs_done, hence the reason why zil_lwb_write_done doesn't do much on an error, since I figured zil_lwb_flush_vdevs_done would handle the error.. but as has been pointed out, it appears that's not occurring, due to the use of ZIO_FLAG_DONT_PROPAGATE.

So, I think it'd be reasonable to more simply not use ZIO_FLAG_DONT_PROPAGATE when calling zio_rewrite to create the lwb_write_zio.

With that said, I do worry that if we use ZIO_FLAG_DONT_PROPAGATE, we may propagate the error to more LWBs than strictly necessary. For example, given LWB linkages like the following:

If "lwb 1 write" fails, we'd want that error to propagate to "lwb 1" (the root zio), but not to "lwb 2" or "lwb 3" (assuming "lwb 2 write" and "lwb 3 write" did not fail).. right now, if we removed the ZIO_FLAG_DONT_PROPAGATE flag, I'm not sure if the error would propagate to "lwb 2" and "lwb 3", or not.

prakashsurya · 2021-08-16T19:34:29Z

Perhaps we should remove ZIO_FLAG_DONT_PROPAGATE when creating the write ZIO (i.e. the call to zio_rewrite), and add it when creating the root ZIO (i.e. the call to zio_root)?

1395                 lwb->lwb_root_zio = zio_root(zilog->zl_spa,                     
1396                     zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL);           
1397                 ASSERT3P(lwb->lwb_root_zio, !=, NULL);                          
1398                                                                                 
1399                 lwb->lwb_write_zio = zio_rewrite(lwb->lwb_root_zio,             
1400                     zilog->zl_spa, 0, &lwb->lwb_blk, lwb_abd,                   
1401                     BP_GET_LSIZE(&lwb->lwb_blk), zil_lwb_write_done, lwb,       
1402                     prio, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |          
1403                     ZIO_FLAG_FASTWRITE, &zb);                                   
1404                 ASSERT3P(lwb->lwb_write_zio, !=, NULL);

Would that solve the problem, while also ensuring errors don't get propagated up to unrelated LWBs?

arun-kv · 2021-08-17T02:39:14Z

@prakashsurya Thank you for the details. @jxdking suggested about ZIO_FLAG_DONT_PROPAGATE in #12391. But I proceeded with the current fix because I was not sure about the implications of removing the ZIO_FLAG_DONT_PROPAGATE.
I will moving ZIO_FLAG_DONT_PROPAGATE from zio_rewrite to zio_root and let you know the result. Thanks you.

arun-kv · 2021-08-17T06:40:44Z

Below change solves my issue, but I'm not sure how to test "ensuring errors don't get propagated up to unrelated LWBs_"

diff --git a/module/zfs/zil.c b/module/zfs/zil.c
index 78d0711cc..8ed5793d3 100644
--- a/module/zfs/zil.c
+++ b/module/zfs/zil.c
@@ -1393,13 +1393,14 @@ zil_lwb_write_open(zilog_t *zilog, lwb_t *lwb)
                        prio = ZIO_PRIORITY_ASYNC_WRITE;
 
                lwb->lwb_root_zio = zio_root(zilog->zl_spa,
-                   zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL);
+                   zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL |
+                   ZIO_FLAG_DONT_PROPAGATE);
                ASSERT3P(lwb->lwb_root_zio, !=, NULL);
 
                lwb->lwb_write_zio = zio_rewrite(lwb->lwb_root_zio,
                    zilog->zl_spa, 0, &lwb->lwb_blk, lwb_abd,
                    BP_GET_LSIZE(&lwb->lwb_blk), zil_lwb_write_done, lwb,
-                   prio, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
+                   prio, ZIO_FLAG_CANFAIL |
                    ZIO_FLAG_FASTWRITE, &zb);
                ASSERT3P(lwb->lwb_write_zio, !=, NULL);

jxdking · 2021-08-17T13:33:07Z

Perhaps we should remove ZIO_FLAG_DONT_PROPAGATE when creating the write ZIO (i.e. the call to zio_rewrite), and add it when creating the root ZIO (i.e. the call to zio_root)?
Would that solve the problem, while also ensuring errors don't get propagated up to unrelated LWBs?

I am looking at the code.
We call zio_add_child(lwb->lwb_write_zio, last_lwb_opened->lwb_write_zio) in zil_lwb_set_zio_dependency() when (last_lwb_opened->lwb_state != LWB_STATE_WRITE_DONE).
This may propagate the error.

Another issue, which is not discussed yet.
Inside zio_flush(), ZIO_FLAG_DONT_PROPAGATE is hard coded. I don't think zil_lwb_flush_vdevs_done() can catch any error regarding zio_flush. It makes error handling in zil_lwb_flush_vdevs_done() unnecessary. The question is, do we care if the flush operation is failed? Probably not. The flush may be delayed and never issued for current lwb.
If that is the case, we may remove all the error handling from zil_lwb_flush_vdevs_done(), just keep it in zil_lwb_write_done().

prakashsurya · 2021-08-17T16:00:17Z

The question is, do we care if the flush operation is failed? Probably not. The flush may be delayed and never issued for current lwb.

I'm not sure I'm following this. Why don't we care if the flush failed? I ask, because I believe we do care.. e.g. if the flush fails, the data may not be on stable storage (this is the point of issuing the flush to begin with, to ensure it's safe on disk), so we wouldn't want to tell the consumer (i.e. the caller of zil_commit) that it is..

prakashsurya · 2021-08-17T16:10:31Z

Thinking a bit more, perhaps we do want to propagate errors "up the chain" of LWBs. Since if an earlier LWB fails (either the write or the flush), I think it means the later LWBs can't be trusted, due to the on-disk linked list structure of the ZIL. If one of the "middle" LWBs on disk did not get written out properly, then it necessarily means the LWBs later in the on-disk linked list will be "leaked" since the chain will have been broken.

So, in that case, I think we want all LWBs after that failed one, result in a txg_wait_synced, which what would occur if we propagate the ZIO error to all the parent LWBs; thus, we do not want to use ZIO_FLAG_DONT_PROPAGATE for the root ZIO, nor the write ZIO.

I hope that makes sense...?

jxdking · 2021-08-17T17:52:37Z

I'm not sure I'm following this. Why don't we care if the flush failed? I ask, because I believe we do care.. e.g. if the flush fails, the data may not be on stable storage (this is the point of issuing the flush to begin with, to ensure it's safe on disk), so we wouldn't want to tell the consumer (i.e. the caller of zil_commit) that it is..

In current zfs head, ZIO_FLAG_DONT_PROPAGATE is hard coded inside zio_flush()
Thus, you will not get any error regarding flushing from zil_lwb_flush_vdevs_done(), even you put error handling code there, unless we fix zio_flush() also.

Here is my speculation (maybe I am wrong). Since we never really handled error of zio_flush in zil_lwb_flush_vdevs_done(), we also deliberately skip flush operation for some of lwbs for more performance (see zil_lwb_flush_defer()), and no one complained about loosing unflushed data because of zil_lwb_flush_defer(), that gives me the impression that the chance of loosing data solely because of the zio_flush() is so low that it is never discussed.

prakashsurya · 2021-08-17T19:04:34Z

OK, thanks for your patience, I think we're on the same page now.

So, based on the above discussion, I recommend we only remove the ZIO_FLAG_DONT_PROPAGATE flag from the call to zio_rewrite. That should ensure the write error is seen when zil_lwb_flush_vdevs_done is called, and handled at that point.

Further, if we do (perhaps later, in another bug/PR; or maybe not at all) address the flush error propagation, the same error handling in zil_lwb_flush_vdevs_done will handle both cases; i.e. it'll handle the case of a LWB write, or LWB flush error (as I originally intended).

IMO, removing the ZIO_FLAG_DONT_PROPAGATE flag is a better solution than the proposed changes, since it consolidates the error handling for flush and write errors to a single location (rather than having error handling split between the "write done" and "flush done" handlers).

Let me know how you all feel about my recommendation.

jxdking · 2021-08-17T19:44:24Z

So, based on the above discussion, I recommend we only remove the ZIO_FLAG_DONT_PROPAGATE flag from the call to zio_rewrite. That should ensure the write error is seen when zil_lwb_flush_vdevs_done is called, and handled at that point.

Let me know how you all feel about my recommendation.

Agree.

prakashsurya · 2021-08-17T19:53:06Z

Great.

@arun-kv assuming you're also in agreement with the recommendation, can you update the proposed changes in the PR; then I can ping some folks and try and get another review on this.

since it consolidates the error handling for flush and write errors to a single location (rather than having error handling split between the "write done" and "flush done" handlers). Signed-off-by: Arun KV <arun.kv@datacore.com>

arun-kv · 2021-08-18T12:57:46Z

@prakashsurya @jxdking Thank you for the details. I have updated the PR.

prakashsurya · 2021-08-18T13:30:22Z

@behlendorf @ahrens @grwilson @mmaybee Can we get another pair of eyes on this? I think it looks good to me.

arun-kv · 2021-08-18T15:18:42Z

@problame @jxdking @prakashsurya Thank you all for the detailed explanation, it was really helpful.

problame

IMHO there should be a comment in zil_lwb_flush_vdevs_done, where we assign zcw_zio_error, that summarizes #12443 (comment) and subsequent comments.

prakashsurya · 2021-08-19T17:02:13Z

IMHO there should be a comment in zil_lwb_flush_vdevs_done, where we assign zcw_zio_error, that summarizes #12443 (comment) and subsequent comments.

Yea, I agree that'd be good.

Here's my take on coming up with some helpful comments:

diff --git a/module/zfs/zil.c b/module/zfs/zil.c
index 2eeb4fa4fe..6e3d15d57d 100644
--- a/module/zfs/zil.c
+++ b/module/zfs/zil.c
@@ -1179,6 +1179,20 @@ zil_lwb_flush_vdevs_done(zio_t *zio)
        ASSERT3P(zcw->zcw_lwb, ==, lwb);
        zcw->zcw_lwb = NULL;
 
+       /*
+        * We expect any ZIO errors from child ZIOs to have been
+        * propagated "up" to this specific LWB's root ZIO, in
+        * order for this error handling to work correctly. This
+        * includes ZIO errors from either this LWB's write or
+        * flush, as well as any errors from other dependent LWBs
+        * (e.g. a root LWB ZIO that might be a child of this LWB).
+        *
+        * With that said, it's important to note that LWB flush
+        * errors are not propagated up to the LWB root ZIO.
+        * This is incorrect behavior, and results in VDEV flush
+        * errors not being handled correctly here. See the
+        * comment above the call to "zio_flush" for details.
+        */
        zcw->zcw_zio_error = zio->io_error;
 
        ASSERT3B(zcw->zcw_done, ==, B_FALSE);
@@ -1251,6 +1265,12 @@ zil_lwb_write_done(zio_t *zio)
     * nodes. We avoid calling zio_flush() since there isn't any
     * good reason for doing so, after the lwb block failed to be
     * written out.
+    *
+    * Additionally, we don't perform any further error handling at
+    * this point (e.g. setting "zcw_zio_error" appropriately), as
+    * we expect that to occur in "zil_lwb_flush_vdevs_done" (thus,
+    * we expect any error seen here, to have been propagated to
+    * that function).
     */
    if (zio->io_error != 0) {
        while ((zv = avl_destroy_nodes(t, &cookie)) != NULL)
@@ -1281,8 +1301,19 @@ zil_lwb_write_done(zio_t *zio)
 
    while ((zv = avl_destroy_nodes(t, &cookie)) != NULL) {
        vdev_t *vd = vdev_lookup_top(spa, zv->zv_vdev);
-       if (vd != NULL)
+
+       if (vd != NULL) {
+           /*
+            * The "ZIO_FLAG_DONT_PROPAGATE" is currently
+            * always used within "zio_flush". This means,
+            * any errors when flushing the vdev(s), will
+            * (unfortunately) not be handled correctly,
+            * since these "zio_flush" errors will not be
+            * propagated up to "zil_lwb_flush_vdevs_done".
+            */
            zio_flush(lwb->lwb_root_zio, vd);
+       }
+
        kmem_free(zv, sizeof (*zv));
    }
 }

Does this help?

If folks think this is helpful and sufficient, feel free to incorporate this into the PR. Feel free to modify it as necessary too.

I also think it'd be nice to incorporate the reproducer into the test suite, too, if possible (as I believe @problame also mentioned). If we could modify step 7 to avoid the reboot (e.g. perhaps use export/import instead? would that work with the disk offlined?), it seems like it might not be too difficult. With that said, I'm not opposed to landing this without the test; it just means it'll be more difficult the verify this error handling as future changes are made.

sempervictus · 2021-08-19T17:28:32Z

+1 on adding the test - thats a fun error condition and absolutely non-trivial to discern for users.

behlendorf · 2021-08-30T21:37:35Z

@arun-kv when you get a chance it'd be great if you could incorporate the suggested comments and rebase this PR. I've love to see the test case added for this, but I agree with @prakashsurya we could land this without.

Signed-off-by: Arun KV <arun.kv@datacore.com>

arun-kv · 2021-08-31T12:14:45Z

@behlendorf sorry for the late reply. I have done the relevant changes.

Errors in zil_lwb_write_done() are not propagated to zil_lwb_flush_vdevs_done() which can result in zil_commit_impl() not returning an error to applications even when zfs was not able to write data to the disk. Remove the ZIO_FLAG_DONT_PROPAGATE flag from zio_rewrite() to allow errors to propagate and consolidate the error handling for flush and write errors to a single location (rather than having error handling split between the "write done" and "flush done" handlers). Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Arun KV <arun.kv@datacore.com> Closes openzfs#12391 Closes openzfs#12443

Since the beginning, ZFS' "flush" operation has always ignored errors[1]. Write errors are captured and dealt with, but if a write succeeds but the subsequent flush fails, the operation as a whole will appear to succeed[2]. In the end-of-transaction uberblock+label write+flush ceremony, it's very difficult for this situation to occur. Since all devices are written to, typically the first write will succeed, the first flush will fail unobserved, but then the second write will fail, and the entire transaction is aborted. It's difficult to imagine a real-world scenario where all the writes in that sequence could succeed even as the flushes are failing (understanding that the OS is still seeing hardware problems and taking devices offline). In the ZIL however, it's another story. Since only the write response is checked, if that write succeeds but the flush then fails, the ZIL will believe that it succeeds, and zil_commit() (and thus fsync()) will return success rather than the "correct" behaviour of falling back into txg_wait_synced()[3]. This commit fixes this by adding a simple flag to zio_flush() to indicate whether or not the caller wants to receive flush errors. This flag is enabled for ZIL calls. The existing zio chaining inside the ZIL and the flush handler zil_lwb_flush_vdevs_done() already has all the necessary support to properly handle a flush failure and fail the entire zio chain. This causes zil_commit() to correct fall back to txg_wait_synced() rather than returning success prematurely. 1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support for flushing devices with write caches with the DKIOCFLUSHWRITECACHE ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from from the time shows the thinking: /* * Wait for all the flushes to complete. Not all devices actually * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails. */ 2. It's not entirely clear from the code history why this was acceptable for devices that _do_ have write caches. Our best guess is that this was an oversight: between the combination of hardware, pool topology and application behaviour required to hit this, it basically didn't come up. 3. Somewhat frustratingly, zil.c contains comments describing this exact behaviour, and further discussion in openzfs#12443 (September 2021). It appears that those involved saw the potential, but were looking at a different problem and so didn't have the context to recognise it for what it was. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

behlendorf added the Status: Code Review Needed Ready for review and testing label Jul 28, 2021

arun-kv force-pushed the master branch 2 times, most recently from d781ef5 to 4e24f58 Compare July 29, 2021 04:08

arun-kv force-pushed the master branch from 4e24f58 to d3e1def Compare July 29, 2021 15:29

behlendorf requested review from ahrens and prakashsurya July 29, 2021 17:34

ahrens assigned mmaybee Jul 29, 2021

problame suggested changes Aug 6, 2021

View reviewed changes

arun-kv and others added 2 commits August 13, 2021 12:55

Merge branch 'openzfs:master' into master

5f72b15

Incorporated all review comments.

50cf019

problame reviewed Aug 13, 2021

View reviewed changes

arun-kv and others added 5 commits August 16, 2021 09:17

Incorporated all review comments.

6256570

Signed-off-by: Arun KV <arun.kv@datacore.com>

Merge branch 'master' of github.com:arun-kv/zfs

35a74a6

Merge branch 'openzfs:master' into master

4a45938

Merge branch 'master' of github.com:arun-kv/zfs

25de752

Added assert zcw_zio_error is zero-initialized

5bbbcd0

Signed-off-by: Arun KV <arun.kv@datacore.com>

prakashsurya reviewed Aug 16, 2021

View reviewed changes

arun-kv and others added 2 commits August 18, 2021 17:45

Merge branch 'openzfs:master' into master

1b70a86

Removing the ZIO_FLAG_DONT_PROPAGATE from zio_rewrite

bad4ca3

since it consolidates the error handling for flush and write errors to a single location (rather than having error handling split between the "write done" and "flush done" handlers). Signed-off-by: Arun KV <arun.kv@datacore.com>

prakashsurya approved these changes Aug 18, 2021

View reviewed changes

grwilson approved these changes Aug 19, 2021

View reviewed changes

problame reviewed Aug 19, 2021

View reviewed changes

arun-kv and others added 2 commits August 31, 2021 17:22

Merge branch 'openzfs:master' into master

05bf1fd

Incorporate all the suggested comments

5183aab

Signed-off-by: Arun KV <arun.kv@datacore.com>

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 31, 2021

behlendorf merged commit f82f027 into openzfs:master Sep 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed data integrity issue when underlying disk returns error to zfs #12443

Fixed data integrity issue when underlying disk returns error to zfs #12443

arun-kv commented Jul 28, 2021

arun-kv commented Jul 30, 2021

problame Aug 6, 2021

problame Aug 6, 2021

arun-kv Aug 13, 2021

problame Aug 13, 2021

prakashsurya Aug 16, 2021

problame commented Aug 16, 2021

prakashsurya Aug 16, 2021

prakashsurya Aug 16, 2021

prakashsurya Aug 16, 2021 •

edited

Loading

prakashsurya left a comment

prakashsurya commented Aug 16, 2021

prakashsurya commented Aug 16, 2021 •

edited

Loading

prakashsurya commented Aug 16, 2021 •

edited

Loading

arun-kv commented Aug 17, 2021

arun-kv commented Aug 17, 2021

jxdking commented Aug 17, 2021 •

edited

Loading

prakashsurya commented Aug 17, 2021

prakashsurya commented Aug 17, 2021

jxdking commented Aug 17, 2021 •

edited

Loading

prakashsurya commented Aug 17, 2021

jxdking commented Aug 17, 2021

prakashsurya commented Aug 17, 2021

arun-kv commented Aug 18, 2021

prakashsurya commented Aug 18, 2021

arun-kv commented Aug 18, 2021

problame left a comment

prakashsurya commented Aug 19, 2021 •

edited

Loading

sempervictus commented Aug 19, 2021

behlendorf commented Aug 30, 2021

arun-kv commented Aug 31, 2021

Fixed data integrity issue when underlying disk returns error to zfs #12443

Fixed data integrity issue when underlying disk returns error to zfs #12443

Conversation

arun-kv commented Jul 28, 2021

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

arun-kv commented Jul 30, 2021

problame Aug 6, 2021

Choose a reason for hiding this comment

problame Aug 6, 2021

Choose a reason for hiding this comment

arun-kv Aug 13, 2021

Choose a reason for hiding this comment

problame Aug 13, 2021

Choose a reason for hiding this comment

prakashsurya Aug 16, 2021

Choose a reason for hiding this comment

problame commented Aug 16, 2021

prakashsurya Aug 16, 2021

Choose a reason for hiding this comment

prakashsurya Aug 16, 2021

Choose a reason for hiding this comment

prakashsurya Aug 16, 2021 • edited Loading

Choose a reason for hiding this comment

prakashsurya left a comment

Choose a reason for hiding this comment

prakashsurya commented Aug 16, 2021

prakashsurya commented Aug 16, 2021 • edited Loading

prakashsurya commented Aug 16, 2021 • edited Loading

arun-kv commented Aug 17, 2021

arun-kv commented Aug 17, 2021

jxdking commented Aug 17, 2021 • edited Loading

prakashsurya commented Aug 17, 2021

prakashsurya commented Aug 17, 2021

jxdking commented Aug 17, 2021 • edited Loading

prakashsurya commented Aug 17, 2021

jxdking commented Aug 17, 2021

prakashsurya commented Aug 17, 2021

arun-kv commented Aug 18, 2021

prakashsurya commented Aug 18, 2021

arun-kv commented Aug 18, 2021

problame left a comment

Choose a reason for hiding this comment

prakashsurya commented Aug 19, 2021 • edited Loading

sempervictus commented Aug 19, 2021

behlendorf commented Aug 30, 2021

arun-kv commented Aug 31, 2021

prakashsurya Aug 16, 2021 •

edited

Loading

prakashsurya commented Aug 16, 2021 •

edited

Loading

prakashsurya commented Aug 16, 2021 •

edited

Loading

jxdking commented Aug 17, 2021 •

edited

Loading

jxdking commented Aug 17, 2021 •

edited

Loading

prakashsurya commented Aug 19, 2021 •

edited

Loading