Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed data integrity issue when underlying disk returns error to zfs #12443

Merged
merged 12 commits into from
Sep 13, 2021

Conversation

arun-kv
Copy link
Contributor

@arun-kv arun-kv commented Jul 28, 2021

zil_lwb_write_done error was not propagated to zil_lwb_flush_vdevs_done, due to which zil_commit_impl was
returning and application gets write success even though zfs was not able to write data to the disk.

Motivation and Context

zvol (sync=always): Write(O_SYNC) returns success when the disk is not accessible.
The application gets a write success from zvol created with option sync=always even when the disk is not accessible.
Note: Only the first write after the disk becomes inaccessible is wrongly acknowledged as successfully written, after that all other writes from the application wait.

#12391

Description

Capture the zil_lwb_write_done error into zil_commit_waiter and in zil_lwb_flush_vdevs_done if the zio->io_error is 0 then use
zil_lwb_write_done error. which will make sure zil_commit_impl won't return to user until we fix the disk/network error and clear the zpool_suspend state.

How Has This Been Tested?

  1. zpool create -o ashift=12 zpool /dev/sdc -f
  2. zfs create -o volblocksize=4k -o dedup=off -o compression=off -o sync=always -o primarycache=metadata -V 6G zpool/zvol
  3. Wrote some data (keep the application waiting for the user input)
  4. echo offline > /sys/block/sdc/device/state
  5. When I entered the new data, I got success message from write (Ideally this should wait until I bring the disk online)
  6. do another write, this time application waits until i fix the disk issue and run the zpool clear command.
  7. reboot the system, read the offset (Verify On Disk Compatibility #5) and I don't find the data.

Below is the program I used for write data to zvol,

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <errno.h>
#include <fcntl.h>

#define SIZE	4 * 1024
#define DEVICE  "/dev/zd0"

int main()
{
	FILE *fp1, *fp2;
	char *rbuf, *wbuf;
	int ret = 0;
	int fd;
	int len = 0;
	int pagesize = getpagesize();

	rbuf = malloc(SIZE+pagesize);
	wbuf = (char *)((((unsigned long) rbuf + pagesize -1) / pagesize) * pagesize);

//	fd = open(DEVICE, O_RDWR|O_DIRECT);
	fd = open(DEVICE, O_RDWR|O_SYNC);
	if(!fd) {
		printf("Failed to open file %s \n", DEVICE);
		return(1);
	}
	while (1) {
		printf("Enter char to write: ");
		char c = getchar();
		getchar();
		fflush(stdin);
		if(c != '\n') {
			printf("%c\n", c);
			memset(wbuf, c, SIZE);
			ret = write(fd, wbuf, SIZE);
			if (ret < 0) {
				printf("Write failed error: %d\n", -errno);
				break;
			} else {
				printf("write success ret: %d offset: %d\n", ret, len);
			}
		}
		len += SIZE;
	}
	close(fd);
}

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Jul 28, 2021
@arun-kv arun-kv force-pushed the master branch 2 times, most recently from d781ef5 to 4e24f58 Compare July 29, 2021 04:08
zil_lwb_write_done error was not propagated to
zil_lwb_flush_vdevs_done, due to which zil_commit_impl was
returning and application gets write success even though zfs was
not able to write data to the disk.

Signed-off-by: Arun KV <arun.kv@datacore.com>
@arun-kv
Copy link
Contributor Author

arun-kv commented Jul 30, 2021

@behlendorf zloop test fails even without my changes, what should I do next?

module/zfs/zil.c Outdated
@@ -1179,7 +1179,8 @@ zil_lwb_flush_vdevs_done(zio_t *zio)
ASSERT3P(zcw->zcw_lwb, ==, lwb);
zcw->zcw_lwb = NULL;

zcw->zcw_zio_error = zio->io_error;
if (zio->io_error != 0)
zcw->zcw_zio_error = zio->io_error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a comment here explaining why we need to do this check.
Also, we should VERIFY that zcw->zcw_zio_error == 0 before overwriting it with zio->io_error.
IIUC, we can assert that because, IIUC, we don't issue the flush if the write fails.
We should actually VERIFY because
a) it's not on a hot code path and
b) it's critical for correctness.

The comment should address the assertion as well.

module/zfs/zil.c Outdated
* error is not propagated to zil_lwb_flush_vdevs_done, which
* will cause zil_commit_impl to return without committing
* the data.
* Refer https://github.com/openzfs/zfs/issues/12391
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. typo in becaues
  2. I find the comment here more confusing than useful. I'd prefer a comment in zil_commit_impl in the place where we check zcw_zio_error that explains the entire error propagation path (for both the write and flush done callbacks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@problame thank you for the comments. I have updated the PR.

module/zfs/zil.c Outdated
mutex_enter(&zcw->zcw_lock);
ASSERT(list_link_active(&zcw->zcw_node));
ASSERT3P(zcw->zcw_lwb, ==, lwb);
zcw->zcw_zio_error = zio->io_error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK this is the first place where we might set zcw_zio_error => if the zcw is zero-initialized, we could ASSERT3S(zcw->zcw_zio_error, ==, 0);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the zcw is zero-initialized

it is in zil_alloc_commit_waiter.. further, we already rely on this in the flush function (with these changes):

VERIFY(zcw->zcw_zio_error == 0);

so I think we could do this verification here too.

@problame
Copy link
Contributor

The changes in this PR seem correct to me and like a net improvement.
The cherry on top would be a regression test that uses zinject or similar to exercise the error handling paths.

However, I just quickly reviewed the WR_INDIRECT code path, and it seems like there is no proper error propagation either:

zfs/module/zfs/zil.c

Lines 1756 to 1764 in 41bee40

if (error == EIO) {
txg_wait_synced(zilog->zl_dmu_pool, txg);
return (lwb);
}
if (error != 0) {
ASSERT(error == ENOENT || error == EEXIST ||
error == EALREADY);
return (lwb);
}

module/zfs/zil.c Outdated
mutex_enter(&zcw->zcw_lock);
ASSERT(list_link_active(&zcw->zcw_node));
ASSERT3P(zcw->zcw_lwb, ==, lwb);
zcw->zcw_zio_error = zio->io_error;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the zcw is zero-initialized

it is in zil_alloc_commit_waiter.. further, we already rely on this in the flush function (with these changes):

VERIFY(zcw->zcw_zio_error == 0);

so I think we could do this verification here too.

module/zfs/zil.c Outdated
* If the flush has failed, then the write must have
* been successful. VERIFY the same.
*/
VERIFY(zcw->zcw_zio_error == 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets use VERIFY3S

module/zfs/zil.c Outdated
@@ -1253,6 +1267,17 @@ zil_lwb_write_done(zio_t *zio)
* written out.
*/
if (zio->io_error != 0) {
zil_commit_waiter_t *zcw;
Copy link
Member

@prakashsurya prakashsurya Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious.. this block of code you're adding, looks similar to the block of code in the "flush done" function.. i.e. starting at line 1173.. but I see some differences, such as the fact that this block you're adding doesn't call:

cv_broadcast(&zcw->zcw_cv);

nor does it set zcw_done..

are these differences intentional? it's been awhile since I've been in this code, so I'm just curious if we should be using the same exact logic in both cases, here and in the flush function?

in this error case, do we still call the "flush done" function? I presume not, which is why this change is needed.. but please correct me if I'm wrong.

Copy link
Member

@prakashsurya prakashsurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this is a silly question, it's been awhile since I've been in this code, but if we create the ZIO such that "flush done" is the completion handler, why isn't it getting called, and setting zcw_zio_error correctly?

1395                 lwb->lwb_root_zio = zio_root(zilog->zl_spa,                     
1396                     zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL); 

or is it that zil_lwb_flush_vdevs_done is getting called, but it's called with a zio->io_error of zero.. and called after zil_lwb_write_done is called, with a zio->io_error of non-zero?

Again, I apologize if these answers are obvious, but I'm trying to refresh my memory on how I intended this to work, originally.. and understand what exactly is going awry, now..

@prakashsurya
Copy link
Member

why isn't it getting called, and setting zcw_zio_error correctly?

Now I see my question has already been discussed here: #12391 (comment)

So, if I'm understanding correctly, if we don't set ZIO_FLAG_DONT_PROPAGATE, then I think the error from "write done" would get propagated to the "flush done" call.. let me think about that, and see if that'd be a cleaner solution than what's being proposed here.. right now, I can't recall why I used ZIO_FLAG_DONT_PROPAGATE to begin with.

@prakashsurya
Copy link
Member

prakashsurya commented Aug 16, 2021

The prior code (i.e. prior to 1ce23dc) did this:

 975                 lwb->lwb_zio = zio_rewrite(zilog->zl_root_zio, zilog->zl_spa,   
 976                     0, &lwb->lwb_blk, lwb_abd, BP_GET_LSIZE(&lwb->lwb_blk),     
 977                     zil_lwb_write_done, lwb, prio,                              
 978                     ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |                
 979                     ZIO_FLAG_FASTWRITE, &zb);   

so, my best guess, is we're currently using ZIO_FLAG_DONT_PROPAGATE because we used it prior to my changes, I and didn't really consider it when making the changes I made in 1ce23dc.

I think my changes in 1ce23dc, were expecting any errors seen in zil_lwb_write_done, would also be seen in zil_lwb_flush_vdevs_done, hence the reason why zil_lwb_write_done doesn't do much on an error, since I figured zil_lwb_flush_vdevs_done would handle the error.. but as has been pointed out, it appears that's not occurring, due to the use of ZIO_FLAG_DONT_PROPAGATE.

So, I think it'd be reasonable to more simply not use ZIO_FLAG_DONT_PROPAGATE when calling zio_rewrite to create the lwb_write_zio.

With that said, I do worry that if we use ZIO_FLAG_DONT_PROPAGATE, we may propagate the error to more LWBs than strictly necessary. For example, given LWB linkages like the following:

changes-after-01

If "lwb 1 write" fails, we'd want that error to propagate to "lwb 1" (the root zio), but not to "lwb 2" or "lwb 3" (assuming "lwb 2 write" and "lwb 3 write" did not fail).. right now, if we removed the ZIO_FLAG_DONT_PROPAGATE flag, I'm not sure if the error would propagate to "lwb 2" and "lwb 3", or not.

@prakashsurya
Copy link
Member

prakashsurya commented Aug 16, 2021

Perhaps we should remove ZIO_FLAG_DONT_PROPAGATE when creating the write ZIO (i.e. the call to zio_rewrite), and add it when creating the root ZIO (i.e. the call to zio_root)?

1395                 lwb->lwb_root_zio = zio_root(zilog->zl_spa,                     
1396                     zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL);           
1397                 ASSERT3P(lwb->lwb_root_zio, !=, NULL);                          
1398                                                                                 
1399                 lwb->lwb_write_zio = zio_rewrite(lwb->lwb_root_zio,             
1400                     zilog->zl_spa, 0, &lwb->lwb_blk, lwb_abd,                   
1401                     BP_GET_LSIZE(&lwb->lwb_blk), zil_lwb_write_done, lwb,       
1402                     prio, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |          
1403                     ZIO_FLAG_FASTWRITE, &zb);                                   
1404                 ASSERT3P(lwb->lwb_write_zio, !=, NULL); 

Would that solve the problem, while also ensuring errors don't get propagated up to unrelated LWBs?

@arun-kv
Copy link
Contributor Author

arun-kv commented Aug 17, 2021

@prakashsurya Thank you for the details. @jxdking suggested about ZIO_FLAG_DONT_PROPAGATE in #12391. But I proceeded with the current fix because I was not sure about the implications of removing the ZIO_FLAG_DONT_PROPAGATE.
I will moving ZIO_FLAG_DONT_PROPAGATE from zio_rewrite to zio_root and let you know the result. Thanks you.

@arun-kv
Copy link
Contributor Author

arun-kv commented Aug 17, 2021

Below change solves my issue, but I'm not sure how to test "ensuring errors don't get propagated up to unrelated LWBs_"

diff --git a/module/zfs/zil.c b/module/zfs/zil.c
index 78d0711cc..8ed5793d3 100644
--- a/module/zfs/zil.c
+++ b/module/zfs/zil.c
@@ -1393,13 +1393,14 @@ zil_lwb_write_open(zilog_t *zilog, lwb_t *lwb)
                        prio = ZIO_PRIORITY_ASYNC_WRITE;
 
                lwb->lwb_root_zio = zio_root(zilog->zl_spa,
-                   zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL);
+                   zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL |
+                   ZIO_FLAG_DONT_PROPAGATE);
                ASSERT3P(lwb->lwb_root_zio, !=, NULL);
 
                lwb->lwb_write_zio = zio_rewrite(lwb->lwb_root_zio,
                    zilog->zl_spa, 0, &lwb->lwb_blk, lwb_abd,
                    BP_GET_LSIZE(&lwb->lwb_blk), zil_lwb_write_done, lwb,
-                   prio, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
+                   prio, ZIO_FLAG_CANFAIL |
                    ZIO_FLAG_FASTWRITE, &zb);
                ASSERT3P(lwb->lwb_write_zio, !=, NULL);

@jxdking
Copy link
Contributor

jxdking commented Aug 17, 2021

Perhaps we should remove ZIO_FLAG_DONT_PROPAGATE when creating the write ZIO (i.e. the call to zio_rewrite), and add it when creating the root ZIO (i.e. the call to zio_root)?
Would that solve the problem, while also ensuring errors don't get propagated up to unrelated LWBs?

I am looking at the code.
We call zio_add_child(lwb->lwb_write_zio, last_lwb_opened->lwb_write_zio) in zil_lwb_set_zio_dependency() when (last_lwb_opened->lwb_state != LWB_STATE_WRITE_DONE).
This may propagate the error.

Another issue, which is not discussed yet.
Inside zio_flush(), ZIO_FLAG_DONT_PROPAGATE is hard coded. I don't think zil_lwb_flush_vdevs_done() can catch any error regarding zio_flush. It makes error handling in zil_lwb_flush_vdevs_done() unnecessary. The question is, do we care if the flush operation is failed? Probably not. The flush may be delayed and never issued for current lwb.
If that is the case, we may remove all the error handling from zil_lwb_flush_vdevs_done(), just keep it in zil_lwb_write_done().

@prakashsurya
Copy link
Member

The question is, do we care if the flush operation is failed? Probably not. The flush may be delayed and never issued for current lwb.

I'm not sure I'm following this. Why don't we care if the flush failed? I ask, because I believe we do care.. e.g. if the flush fails, the data may not be on stable storage (this is the point of issuing the flush to begin with, to ensure it's safe on disk), so we wouldn't want to tell the consumer (i.e. the caller of zil_commit) that it is..

@prakashsurya
Copy link
Member

Thinking a bit more, perhaps we do want to propagate errors "up the chain" of LWBs. Since if an earlier LWB fails (either the write or the flush), I think it means the later LWBs can't be trusted, due to the on-disk linked list structure of the ZIL. If one of the "middle" LWBs on disk did not get written out properly, then it necessarily means the LWBs later in the on-disk linked list will be "leaked" since the chain will have been broken.

So, in that case, I think we want all LWBs after that failed one, result in a txg_wait_synced, which what would occur if we propagate the ZIO error to all the parent LWBs; thus, we do not want to use ZIO_FLAG_DONT_PROPAGATE for the root ZIO, nor the write ZIO.

I hope that makes sense...?

@jxdking
Copy link
Contributor

jxdking commented Aug 17, 2021

I'm not sure I'm following this. Why don't we care if the flush failed? I ask, because I believe we do care.. e.g. if the flush fails, the data may not be on stable storage (this is the point of issuing the flush to begin with, to ensure it's safe on disk), so we wouldn't want to tell the consumer (i.e. the caller of zil_commit) that it is..

In current zfs head, ZIO_FLAG_DONT_PROPAGATE is hard coded inside zio_flush()
Thus, you will not get any error regarding flushing from zil_lwb_flush_vdevs_done(), even you put error handling code there, unless we fix zio_flush() also.

Here is my speculation (maybe I am wrong). Since we never really handled error of zio_flush in zil_lwb_flush_vdevs_done(), we also deliberately skip flush operation for some of lwbs for more performance (see zil_lwb_flush_defer()), and no one complained about loosing unflushed data because of zil_lwb_flush_defer(), that gives me the impression that the chance of loosing data solely because of the zio_flush() is so low that it is never discussed.

@prakashsurya
Copy link
Member

OK, thanks for your patience, I think we're on the same page now.

So, based on the above discussion, I recommend we only remove the ZIO_FLAG_DONT_PROPAGATE flag from the call to zio_rewrite. That should ensure the write error is seen when zil_lwb_flush_vdevs_done is called, and handled at that point.

Further, if we do (perhaps later, in another bug/PR; or maybe not at all) address the flush error propagation, the same error handling in zil_lwb_flush_vdevs_done will handle both cases; i.e. it'll handle the case of a LWB write, or LWB flush error (as I originally intended).

IMO, removing the ZIO_FLAG_DONT_PROPAGATE flag is a better solution than the proposed changes, since it consolidates the error handling for flush and write errors to a single location (rather than having error handling split between the "write done" and "flush done" handlers).

Let me know how you all feel about my recommendation.

@jxdking
Copy link
Contributor

jxdking commented Aug 17, 2021

So, based on the above discussion, I recommend we only remove the ZIO_FLAG_DONT_PROPAGATE flag from the call to zio_rewrite. That should ensure the write error is seen when zil_lwb_flush_vdevs_done is called, and handled at that point.

Let me know how you all feel about my recommendation.

Agree.

@prakashsurya
Copy link
Member

Great.

@arun-kv assuming you're also in agreement with the recommendation, can you update the proposed changes in the PR; then I can ping some folks and try and get another review on this.

arun-kv and others added 2 commits August 18, 2021 17:45
since it consolidates the error handling for flush and write errors to a
single location (rather than having error handling split between the
"write done" and "flush done" handlers).

Signed-off-by: Arun KV <arun.kv@datacore.com>
@arun-kv
Copy link
Contributor Author

arun-kv commented Aug 18, 2021

@prakashsurya @jxdking Thank you for the details. I have updated the PR.

@prakashsurya
Copy link
Member

@behlendorf @ahrens @grwilson @mmaybee Can we get another pair of eyes on this? I think it looks good to me.

@arun-kv
Copy link
Contributor Author

arun-kv commented Aug 18, 2021

@problame @jxdking @prakashsurya Thank you all for the detailed explanation, it was really helpful.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO there should be a comment in zil_lwb_flush_vdevs_done, where we assign zcw_zio_error, that summarizes #12443 (comment) and subsequent comments.

@prakashsurya
Copy link
Member

prakashsurya commented Aug 19, 2021

IMHO there should be a comment in zil_lwb_flush_vdevs_done, where we assign zcw_zio_error, that summarizes #12443 (comment) and subsequent comments.

Yea, I agree that'd be good.

Here's my take on coming up with some helpful comments:

diff --git a/module/zfs/zil.c b/module/zfs/zil.c
index 2eeb4fa4fe..6e3d15d57d 100644
--- a/module/zfs/zil.c
+++ b/module/zfs/zil.c
@@ -1179,6 +1179,20 @@ zil_lwb_flush_vdevs_done(zio_t *zio)
        ASSERT3P(zcw->zcw_lwb, ==, lwb);
        zcw->zcw_lwb = NULL;
 
+       /*
+        * We expect any ZIO errors from child ZIOs to have been
+        * propagated "up" to this specific LWB's root ZIO, in
+        * order for this error handling to work correctly. This
+        * includes ZIO errors from either this LWB's write or
+        * flush, as well as any errors from other dependent LWBs
+        * (e.g. a root LWB ZIO that might be a child of this LWB).
+        *
+        * With that said, it's important to note that LWB flush
+        * errors are not propagated up to the LWB root ZIO.
+        * This is incorrect behavior, and results in VDEV flush
+        * errors not being handled correctly here. See the
+        * comment above the call to "zio_flush" for details.
+        */
        zcw->zcw_zio_error = zio->io_error;
 
        ASSERT3B(zcw->zcw_done, ==, B_FALSE);
@@ -1251,6 +1265,12 @@ zil_lwb_write_done(zio_t *zio)
     * nodes. We avoid calling zio_flush() since there isn't any
     * good reason for doing so, after the lwb block failed to be
     * written out.
+    *
+    * Additionally, we don't perform any further error handling at
+    * this point (e.g. setting "zcw_zio_error" appropriately), as
+    * we expect that to occur in "zil_lwb_flush_vdevs_done" (thus,
+    * we expect any error seen here, to have been propagated to
+    * that function).
     */
    if (zio->io_error != 0) {
        while ((zv = avl_destroy_nodes(t, &cookie)) != NULL)
@@ -1281,8 +1301,19 @@ zil_lwb_write_done(zio_t *zio)
 
    while ((zv = avl_destroy_nodes(t, &cookie)) != NULL) {
        vdev_t *vd = vdev_lookup_top(spa, zv->zv_vdev);
-       if (vd != NULL)
+
+       if (vd != NULL) {
+           /*
+            * The "ZIO_FLAG_DONT_PROPAGATE" is currently
+            * always used within "zio_flush". This means,
+            * any errors when flushing the vdev(s), will
+            * (unfortunately) not be handled correctly,
+            * since these "zio_flush" errors will not be
+            * propagated up to "zil_lwb_flush_vdevs_done".
+            */
            zio_flush(lwb->lwb_root_zio, vd);
+       }
+
        kmem_free(zv, sizeof (*zv));
    }
 }

Does this help?

If folks think this is helpful and sufficient, feel free to incorporate this into the PR. Feel free to modify it as necessary too.

I also think it'd be nice to incorporate the reproducer into the test suite, too, if possible (as I believe @problame also mentioned). If we could modify step 7 to avoid the reboot (e.g. perhaps use export/import instead? would that work with the disk offlined?), it seems like it might not be too difficult. With that said, I'm not opposed to landing this without the test; it just means it'll be more difficult the verify this error handling as future changes are made.

@sempervictus
Copy link
Contributor

+1 on adding the test - thats a fun error condition and absolutely non-trivial to discern for users.

@behlendorf
Copy link
Contributor

@arun-kv when you get a chance it'd be great if you could incorporate the suggested comments and rebase this PR. I've love to see the test case added for this, but I agree with @prakashsurya we could land this without.

arun-kv and others added 2 commits August 31, 2021 17:22
@arun-kv
Copy link
Contributor Author

arun-kv commented Aug 31, 2021

@behlendorf sorry for the late reply. I have done the relevant changes.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 31, 2021
@behlendorf behlendorf merged commit f82f027 into openzfs:master Sep 13, 2021
tonyhutter pushed a commit to tonyhutter/zfs that referenced this pull request Sep 15, 2021
Errors in zil_lwb_write_done() are not propagated to
zil_lwb_flush_vdevs_done() which can result in zil_commit_impl()
not returning an error to applications even when zfs was not able
to write data to the disk.

Remove the ZIO_FLAG_DONT_PROPAGATE flag from zio_rewrite() to
allow errors to propagate and consolidate the error handling for
flush and write errors to a single location (rather than having
error handling split between the "write done" and "flush done"
handlers).

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Arun KV <arun.kv@datacore.com>
Closes openzfs#12391
Closes openzfs#12443
rincebrain pushed a commit to rincebrain/zfs that referenced this pull request Sep 22, 2021
Errors in zil_lwb_write_done() are not propagated to
zil_lwb_flush_vdevs_done() which can result in zil_commit_impl()
not returning an error to applications even when zfs was not able
to write data to the disk.

Remove the ZIO_FLAG_DONT_PROPAGATE flag from zio_rewrite() to
allow errors to propagate and consolidate the error handling for
flush and write errors to a single location (rather than having
error handling split between the "write done" and "flush done"
handlers).

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Arun KV <arun.kv@datacore.com>
Closes openzfs#12391
Closes openzfs#12443
robn added a commit to robn/zfs that referenced this pull request Jun 30, 2024
Since the beginning, ZFS' "flush" operation has always ignored
errors[1]. Write errors are captured and dealt with, but if a write
succeeds but the subsequent flush fails, the operation as a whole will
appear to succeed[2].

In the end-of-transaction uberblock+label write+flush ceremony, it's
very difficult for this situation to occur. Since all devices are
written to, typically the first write will succeed, the first flush will
fail unobserved, but then the second write will fail, and the entire
transaction is aborted. It's difficult to imagine a real-world scenario
where all the writes in that sequence could succeed even as the flushes
are failing (understanding that the OS is still seeing hardware problems
and taking devices offline).

In the ZIL however, it's another story. Since only the write response is
checked, if that write succeeds but the flush then fails, the ZIL will
believe that it succeeds, and zil_commit() (and thus fsync()) will
return success rather than the "correct" behaviour of falling back into
txg_wait_synced()[3].

This commit fixes this by adding a simple flag to zio_flush() to
indicate whether or not the caller wants to receive flush errors. This
flag is enabled for ZIL calls. The existing zio chaining inside the ZIL
and the flush handler zil_lwb_flush_vdevs_done() already has all the
necessary support to properly handle a flush failure and fail the entire
zio chain. This causes zil_commit() to correct fall back to
txg_wait_synced() rather than returning success prematurely.

1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support
   for flushing devices with write caches with the DKIOCFLUSHWRITECACHE
   ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from
   from the time shows the thinking:

   /*
    * Wait for all the flushes to complete.  Not all devices actually
    * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails.
    */

2. It's not entirely clear from the code history why this was acceptable
   for devices that _do_ have write caches. Our best guess is that this
   was an oversight: between the combination of hardware, pool topology
   and application behaviour required to hit this, it basically didn't
   come up.

3. Somewhat frustratingly, zil.c contains comments describing this exact
   behaviour, and further discussion in openzfs#12443 (September 2021). It
   appears that those involved saw the potential, but were looking at a
   different problem and so didn't have the context to recognise it for
   what it was.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
robn added a commit to robn/zfs that referenced this pull request Jun 30, 2024
Since the beginning, ZFS' "flush" operation has always ignored
errors[1]. Write errors are captured and dealt with, but if a write
succeeds but the subsequent flush fails, the operation as a whole will
appear to succeed[2].

In the end-of-transaction uberblock+label write+flush ceremony, it's
very difficult for this situation to occur. Since all devices are
written to, typically the first write will succeed, the first flush will
fail unobserved, but then the second write will fail, and the entire
transaction is aborted. It's difficult to imagine a real-world scenario
where all the writes in that sequence could succeed even as the flushes
are failing (understanding that the OS is still seeing hardware problems
and taking devices offline).

In the ZIL however, it's another story. Since only the write response is
checked, if that write succeeds but the flush then fails, the ZIL will
believe that it succeeds, and zil_commit() (and thus fsync()) will
return success rather than the "correct" behaviour of falling back into
txg_wait_synced()[3].

This commit fixes this by adding a simple flag to zio_flush() to
indicate whether or not the caller wants to receive flush errors. This
flag is enabled for ZIL calls. The existing zio chaining inside the ZIL
and the flush handler zil_lwb_flush_vdevs_done() already has all the
necessary support to properly handle a flush failure and fail the entire
zio chain. This causes zil_commit() to correct fall back to
txg_wait_synced() rather than returning success prematurely.

1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support
   for flushing devices with write caches with the DKIOCFLUSHWRITECACHE
   ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from
   from the time shows the thinking:

   /*
    * Wait for all the flushes to complete.  Not all devices actually
    * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails.
    */

2. It's not entirely clear from the code history why this was acceptable
   for devices that _do_ have write caches. Our best guess is that this
   was an oversight: between the combination of hardware, pool topology
   and application behaviour required to hit this, it basically didn't
   come up.

3. Somewhat frustratingly, zil.c contains comments describing this exact
   behaviour, and further discussion in openzfs#12443 (September 2021). It
   appears that those involved saw the potential, but were looking at a
   different problem and so didn't have the context to recognise it for
   what it was.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
robn added a commit to robn/zfs that referenced this pull request Jul 11, 2024
Since the beginning, ZFS' "flush" operation has always ignored
errors[1]. Write errors are captured and dealt with, but if a write
succeeds but the subsequent flush fails, the operation as a whole will
appear to succeed[2].

In the end-of-transaction uberblock+label write+flush ceremony, it's
very difficult for this situation to occur. Since all devices are
written to, typically the first write will succeed, the first flush will
fail unobserved, but then the second write will fail, and the entire
transaction is aborted. It's difficult to imagine a real-world scenario
where all the writes in that sequence could succeed even as the flushes
are failing (understanding that the OS is still seeing hardware problems
and taking devices offline).

In the ZIL however, it's another story. Since only the write response is
checked, if that write succeeds but the flush then fails, the ZIL will
believe that it succeeds, and zil_commit() (and thus fsync()) will
return success rather than the "correct" behaviour of falling back into
txg_wait_synced()[3].

This commit fixes this by adding a simple flag to zio_flush() to
indicate whether or not the caller wants to receive flush errors. This
flag is enabled for ZIL calls. The existing zio chaining inside the ZIL
and the flush handler zil_lwb_flush_vdevs_done() already has all the
necessary support to properly handle a flush failure and fail the entire
zio chain. This causes zil_commit() to correct fall back to
txg_wait_synced() rather than returning success prematurely.

1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support
   for flushing devices with write caches with the DKIOCFLUSHWRITECACHE
   ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from
   from the time shows the thinking:

   /*
    * Wait for all the flushes to complete.  Not all devices actually
    * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails.
    */

2. It's not entirely clear from the code history why this was acceptable
   for devices that _do_ have write caches. Our best guess is that this
   was an oversight: between the combination of hardware, pool topology
   and application behaviour required to hit this, it basically didn't
   come up.

3. Somewhat frustratingly, zil.c contains comments describing this exact
   behaviour, and further discussion in openzfs#12443 (September 2021). It
   appears that those involved saw the potential, but were looking at a
   different problem and so didn't have the context to recognise it for
   what it was.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
robn added a commit to robn/zfs that referenced this pull request Jul 19, 2024
Since the beginning, ZFS' "flush" operation has always ignored
errors[1]. Write errors are captured and dealt with, but if a write
succeeds but the subsequent flush fails, the operation as a whole will
appear to succeed[2].

In the end-of-transaction uberblock+label write+flush ceremony, it's
very difficult for this situation to occur. Since all devices are
written to, typically the first write will succeed, the first flush will
fail unobserved, but then the second write will fail, and the entire
transaction is aborted. It's difficult to imagine a real-world scenario
where all the writes in that sequence could succeed even as the flushes
are failing (understanding that the OS is still seeing hardware problems
and taking devices offline).

In the ZIL however, it's another story. Since only the write response is
checked, if that write succeeds but the flush then fails, the ZIL will
believe that it succeeds, and zil_commit() (and thus fsync()) will
return success rather than the "correct" behaviour of falling back into
txg_wait_synced()[3].

This commit fixes this by adding a simple flag to zio_flush() to
indicate whether or not the caller wants to receive flush errors. This
flag is enabled for ZIL calls. The existing zio chaining inside the ZIL
and the flush handler zil_lwb_flush_vdevs_done() already has all the
necessary support to properly handle a flush failure and fail the entire
zio chain. This causes zil_commit() to correct fall back to
txg_wait_synced() rather than returning success prematurely.

1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support
   for flushing devices with write caches with the DKIOCFLUSHWRITECACHE
   ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from
   from the time shows the thinking:

   /*
    * Wait for all the flushes to complete.  Not all devices actually
    * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails.
    */

2. It's not entirely clear from the code history why this was acceptable
   for devices that _do_ have write caches. Our best guess is that this
   was an oversight: between the combination of hardware, pool topology
   and application behaviour required to hit this, it basically didn't
   come up.

3. Somewhat frustratingly, zil.c contains comments describing this exact
   behaviour, and further discussion in openzfs#12443 (September 2021). It
   appears that those involved saw the potential, but were looking at a
   different problem and so didn't have the context to recognise it for
   what it was.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
robn added a commit to robn/zfs that referenced this pull request Jul 22, 2024
Since the beginning, ZFS' "flush" operation has always ignored
errors[1]. Write errors are captured and dealt with, but if a write
succeeds but the subsequent flush fails, the operation as a whole will
appear to succeed[2].

In the end-of-transaction uberblock+label write+flush ceremony, it's
very difficult for this situation to occur. Since all devices are
written to, typically the first write will succeed, the first flush will
fail unobserved, but then the second write will fail, and the entire
transaction is aborted. It's difficult to imagine a real-world scenario
where all the writes in that sequence could succeed even as the flushes
are failing (understanding that the OS is still seeing hardware problems
and taking devices offline).

In the ZIL however, it's another story. Since only the write response is
checked, if that write succeeds but the flush then fails, the ZIL will
believe that it succeeds, and zil_commit() (and thus fsync()) will
return success rather than the "correct" behaviour of falling back into
txg_wait_synced()[3].

This commit fixes this by adding a simple flag to zio_flush() to
indicate whether or not the caller wants to receive flush errors. This
flag is enabled for ZIL calls. The existing zio chaining inside the ZIL
and the flush handler zil_lwb_flush_vdevs_done() already has all the
necessary support to properly handle a flush failure and fail the entire
zio chain. This causes zil_commit() to correct fall back to
txg_wait_synced() rather than returning success prematurely.

1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support
   for flushing devices with write caches with the DKIOCFLUSHWRITECACHE
   ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from
   from the time shows the thinking:

   /*
    * Wait for all the flushes to complete.  Not all devices actually
    * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails.
    */

2. It's not entirely clear from the code history why this was acceptable
   for devices that _do_ have write caches. Our best guess is that this
   was an oversight: between the combination of hardware, pool topology
   and application behaviour required to hit this, it basically didn't
   come up.

3. Somewhat frustratingly, zil.c contains comments describing this exact
   behaviour, and further discussion in openzfs#12443 (September 2021). It
   appears that those involved saw the potential, but were looking at a
   different problem and so didn't have the context to recognise it for
   what it was.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
robn added a commit to robn/zfs that referenced this pull request Aug 1, 2024
Since the beginning, ZFS' "flush" operation has always ignored
errors[1]. Write errors are captured and dealt with, but if a write
succeeds but the subsequent flush fails, the operation as a whole will
appear to succeed[2].

In the end-of-transaction uberblock+label write+flush ceremony, it's
very difficult for this situation to occur. Since all devices are
written to, typically the first write will succeed, the first flush will
fail unobserved, but then the second write will fail, and the entire
transaction is aborted. It's difficult to imagine a real-world scenario
where all the writes in that sequence could succeed even as the flushes
are failing (understanding that the OS is still seeing hardware problems
and taking devices offline).

In the ZIL however, it's another story. Since only the write response is
checked, if that write succeeds but the flush then fails, the ZIL will
believe that it succeeds, and zil_commit() (and thus fsync()) will
return success rather than the "correct" behaviour of falling back into
txg_wait_synced()[3].

This commit fixes this by adding a simple flag to zio_flush() to
indicate whether or not the caller wants to receive flush errors. This
flag is enabled for ZIL calls. The existing zio chaining inside the ZIL
and the flush handler zil_lwb_flush_vdevs_done() already has all the
necessary support to properly handle a flush failure and fail the entire
zio chain. This causes zil_commit() to correct fall back to
txg_wait_synced() rather than returning success prematurely.

1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support
   for flushing devices with write caches with the DKIOCFLUSHWRITECACHE
   ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from
   from the time shows the thinking:

   /*
    * Wait for all the flushes to complete.  Not all devices actually
    * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails.
    */

2. It's not entirely clear from the code history why this was acceptable
   for devices that _do_ have write caches. Our best guess is that this
   was an oversight: between the combination of hardware, pool topology
   and application behaviour required to hit this, it basically didn't
   come up.

3. Somewhat frustratingly, zil.c contains comments describing this exact
   behaviour, and further discussion in openzfs#12443 (September 2021). It
   appears that those involved saw the potential, but were looking at a
   different problem and so didn't have the context to recognise it for
   what it was.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
robn added a commit to robn/zfs that referenced this pull request Aug 17, 2024
Since the beginning, ZFS' "flush" operation has always ignored
errors[1]. Write errors are captured and dealt with, but if a write
succeeds but the subsequent flush fails, the operation as a whole will
appear to succeed[2].

In the end-of-transaction uberblock+label write+flush ceremony, it's
very difficult for this situation to occur. Since all devices are
written to, typically the first write will succeed, the first flush will
fail unobserved, but then the second write will fail, and the entire
transaction is aborted. It's difficult to imagine a real-world scenario
where all the writes in that sequence could succeed even as the flushes
are failing (understanding that the OS is still seeing hardware problems
and taking devices offline).

In the ZIL however, it's another story. Since only the write response is
checked, if that write succeeds but the flush then fails, the ZIL will
believe that it succeeds, and zil_commit() (and thus fsync()) will
return success rather than the "correct" behaviour of falling back into
txg_wait_synced()[3].

This commit fixes this by adding a simple flag to zio_flush() to
indicate whether or not the caller wants to receive flush errors. This
flag is enabled for ZIL calls. The existing zio chaining inside the ZIL
and the flush handler zil_lwb_flush_vdevs_done() already has all the
necessary support to properly handle a flush failure and fail the entire
zio chain. This causes zil_commit() to correct fall back to
txg_wait_synced() rather than returning success prematurely.

1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support
   for flushing devices with write caches with the DKIOCFLUSHWRITECACHE
   ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from
   from the time shows the thinking:

   /*
    * Wait for all the flushes to complete.  Not all devices actually
    * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails.
    */

2. It's not entirely clear from the code history why this was acceptable
   for devices that _do_ have write caches. Our best guess is that this
   was an oversight: between the combination of hardware, pool topology
   and application behaviour required to hit this, it basically didn't
   come up.

3. Somewhat frustratingly, zil.c contains comments describing this exact
   behaviour, and further discussion in openzfs#12443 (September 2021). It
   appears that those involved saw the potential, but were looking at a
   different problem and so didn't have the context to recognise it for
   what it was.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
robn added a commit to robn/zfs that referenced this pull request Aug 17, 2024
Since the beginning, ZFS' "flush" operation has always ignored
errors[1]. Write errors are captured and dealt with, but if a write
succeeds but the subsequent flush fails, the operation as a whole will
appear to succeed[2].

In the end-of-transaction uberblock+label write+flush ceremony, it's
very difficult for this situation to occur. Since all devices are
written to, typically the first write will succeed, the first flush will
fail unobserved, but then the second write will fail, and the entire
transaction is aborted. It's difficult to imagine a real-world scenario
where all the writes in that sequence could succeed even as the flushes
are failing (understanding that the OS is still seeing hardware problems
and taking devices offline).

In the ZIL however, it's another story. Since only the write response is
checked, if that write succeeds but the flush then fails, the ZIL will
believe that it succeeds, and zil_commit() (and thus fsync()) will
return success rather than the "correct" behaviour of falling back into
txg_wait_synced()[3].

This commit fixes this by adding a simple flag to zio_flush() to
indicate whether or not the caller wants to receive flush errors. This
flag is enabled for ZIL calls. The existing zio chaining inside the ZIL
and the flush handler zil_lwb_flush_vdevs_done() already has all the
necessary support to properly handle a flush failure and fail the entire
zio chain. This causes zil_commit() to correct fall back to
txg_wait_synced() rather than returning success prematurely.

1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support
   for flushing devices with write caches with the DKIOCFLUSHWRITECACHE
   ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from
   from the time shows the thinking:

   /*
    * Wait for all the flushes to complete.  Not all devices actually
    * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails.
    */

2. It's not entirely clear from the code history why this was acceptable
   for devices that _do_ have write caches. Our best guess is that this
   was an oversight: between the combination of hardware, pool topology
   and application behaviour required to hit this, it basically didn't
   come up.

3. Somewhat frustratingly, zil.c contains comments describing this exact
   behaviour, and further discussion in openzfs#12443 (September 2021). It
   appears that those involved saw the potential, but were looking at a
   different problem and so didn't have the context to recognise it for
   what it was.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants