-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed data integrity issue when underlying disk returns error to zfs #12443
Conversation
d781ef5
to
4e24f58
Compare
zil_lwb_write_done error was not propagated to zil_lwb_flush_vdevs_done, due to which zil_commit_impl was returning and application gets write success even though zfs was not able to write data to the disk. Signed-off-by: Arun KV <arun.kv@datacore.com>
@behlendorf zloop test fails even without my changes, what should I do next? |
module/zfs/zil.c
Outdated
@@ -1179,7 +1179,8 @@ zil_lwb_flush_vdevs_done(zio_t *zio) | |||
ASSERT3P(zcw->zcw_lwb, ==, lwb); | |||
zcw->zcw_lwb = NULL; | |||
|
|||
zcw->zcw_zio_error = zio->io_error; | |||
if (zio->io_error != 0) | |||
zcw->zcw_zio_error = zio->io_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a comment here explaining why we need to do this check.
Also, we should VERIFY that zcw->zcw_zio_error == 0
before overwriting it with zio->io_error
.
IIUC, we can assert that because, IIUC, we don't issue the flush if the write fails.
We should actually VERIFY
because
a) it's not on a hot code path and
b) it's critical for correctness.
The comment should address the assertion as well.
module/zfs/zil.c
Outdated
* error is not propagated to zil_lwb_flush_vdevs_done, which | ||
* will cause zil_commit_impl to return without committing | ||
* the data. | ||
* Refer https://github.com/openzfs/zfs/issues/12391 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
inbecaues
- I find the comment here more confusing than useful. I'd prefer a comment in
zil_commit_impl
in the place where we checkzcw_zio_error
that explains the entire error propagation path (for both thewrite
andflush
done
callbacks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@problame thank you for the comments. I have updated the PR.
module/zfs/zil.c
Outdated
mutex_enter(&zcw->zcw_lock); | ||
ASSERT(list_link_active(&zcw->zcw_node)); | ||
ASSERT3P(zcw->zcw_lwb, ==, lwb); | ||
zcw->zcw_zio_error = zio->io_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK this is the first place where we might set zcw_zio_error
=> if the zcw
is zero-initialized, we could ASSERT3S(zcw->zcw_zio_error, ==, 0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the zcw is zero-initialized
it is in zil_alloc_commit_waiter
.. further, we already rely on this in the flush function (with these changes):
VERIFY(zcw->zcw_zio_error == 0);
so I think we could do this verification here too.
Signed-off-by: Arun KV <arun.kv@datacore.com>
Signed-off-by: Arun KV <arun.kv@datacore.com>
The changes in this PR seem correct to me and like a net improvement. However, I just quickly reviewed the Lines 1756 to 1764 in 41bee40
|
module/zfs/zil.c
Outdated
mutex_enter(&zcw->zcw_lock); | ||
ASSERT(list_link_active(&zcw->zcw_node)); | ||
ASSERT3P(zcw->zcw_lwb, ==, lwb); | ||
zcw->zcw_zio_error = zio->io_error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the zcw is zero-initialized
it is in zil_alloc_commit_waiter
.. further, we already rely on this in the flush function (with these changes):
VERIFY(zcw->zcw_zio_error == 0);
so I think we could do this verification here too.
module/zfs/zil.c
Outdated
* If the flush has failed, then the write must have | ||
* been successful. VERIFY the same. | ||
*/ | ||
VERIFY(zcw->zcw_zio_error == 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets use VERIFY3S
module/zfs/zil.c
Outdated
@@ -1253,6 +1267,17 @@ zil_lwb_write_done(zio_t *zio) | |||
* written out. | |||
*/ | |||
if (zio->io_error != 0) { | |||
zil_commit_waiter_t *zcw; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious.. this block of code you're adding, looks similar to the block of code in the "flush done" function.. i.e. starting at line 1173.. but I see some differences, such as the fact that this block you're adding doesn't call:
cv_broadcast(&zcw->zcw_cv);
nor does it set zcw_done
..
are these differences intentional? it's been awhile since I've been in this code, so I'm just curious if we should be using the same exact logic in both cases, here and in the flush function?
in this error case, do we still call the "flush done" function? I presume not, which is why this change is needed.. but please correct me if I'm wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if this is a silly question, it's been awhile since I've been in this code, but if we create the ZIO such that "flush done" is the completion handler, why isn't it getting called, and setting zcw_zio_error
correctly?
1395 lwb->lwb_root_zio = zio_root(zilog->zl_spa,
1396 zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL);
or is it that zil_lwb_flush_vdevs_done
is getting called, but it's called with a zio->io_error
of zero.. and called after zil_lwb_write_done
is called, with a zio->io_error
of non-zero?
Again, I apologize if these answers are obvious, but I'm trying to refresh my memory on how I intended this to work, originally.. and understand what exactly is going awry, now..
Now I see my question has already been discussed here: #12391 (comment) So, if I'm understanding correctly, if we don't set |
The prior code (i.e. prior to 1ce23dc) did this:
so, my best guess, is we're currently using I think my changes in 1ce23dc, were expecting any errors seen in So, I think it'd be reasonable to more simply not use With that said, I do worry that if we use If "lwb 1 write" fails, we'd want that error to propagate to "lwb 1" (the root zio), but not to "lwb 2" or "lwb 3" (assuming "lwb 2 write" and "lwb 3 write" did not fail).. right now, if we removed the |
Perhaps we should remove
Would that solve the problem, while also ensuring errors don't get propagated up to unrelated LWBs? |
@prakashsurya Thank you for the details. @jxdking suggested about ZIO_FLAG_DONT_PROPAGATE in #12391. But I proceeded with the current fix because I was not sure about the implications of removing the ZIO_FLAG_DONT_PROPAGATE. |
Below change solves my issue, but I'm not sure how to test "ensuring errors don't get propagated up to unrelated LWBs_" diff --git a/module/zfs/zil.c b/module/zfs/zil.c
index 78d0711cc..8ed5793d3 100644
--- a/module/zfs/zil.c
+++ b/module/zfs/zil.c
@@ -1393,13 +1393,14 @@ zil_lwb_write_open(zilog_t *zilog, lwb_t *lwb)
prio = ZIO_PRIORITY_ASYNC_WRITE;
lwb->lwb_root_zio = zio_root(zilog->zl_spa,
- zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL);
+ zil_lwb_flush_vdevs_done, lwb, ZIO_FLAG_CANFAIL |
+ ZIO_FLAG_DONT_PROPAGATE);
ASSERT3P(lwb->lwb_root_zio, !=, NULL);
lwb->lwb_write_zio = zio_rewrite(lwb->lwb_root_zio,
zilog->zl_spa, 0, &lwb->lwb_blk, lwb_abd,
BP_GET_LSIZE(&lwb->lwb_blk), zil_lwb_write_done, lwb,
- prio, ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE |
+ prio, ZIO_FLAG_CANFAIL |
ZIO_FLAG_FASTWRITE, &zb);
ASSERT3P(lwb->lwb_write_zio, !=, NULL); |
I am looking at the code. Another issue, which is not discussed yet. |
I'm not sure I'm following this. Why don't we care if the flush failed? I ask, because I believe we do care.. e.g. if the flush fails, the data may not be on stable storage (this is the point of issuing the flush to begin with, to ensure it's safe on disk), so we wouldn't want to tell the consumer (i.e. the caller of |
Thinking a bit more, perhaps we do want to propagate errors "up the chain" of LWBs. Since if an earlier LWB fails (either the write or the flush), I think it means the later LWBs can't be trusted, due to the on-disk linked list structure of the ZIL. If one of the "middle" LWBs on disk did not get written out properly, then it necessarily means the LWBs later in the on-disk linked list will be "leaked" since the chain will have been broken. So, in that case, I think we want all LWBs after that failed one, result in a I hope that makes sense...? |
In current zfs head, ZIO_FLAG_DONT_PROPAGATE is hard coded inside zio_flush() Here is my speculation (maybe I am wrong). Since we never really handled error of zio_flush in zil_lwb_flush_vdevs_done(), we also deliberately skip flush operation for some of lwbs for more performance (see zil_lwb_flush_defer()), and no one complained about loosing unflushed data because of zil_lwb_flush_defer(), that gives me the impression that the chance of loosing data solely because of the zio_flush() is so low that it is never discussed. |
OK, thanks for your patience, I think we're on the same page now. So, based on the above discussion, I recommend we only remove the Further, if we do (perhaps later, in another bug/PR; or maybe not at all) address the flush error propagation, the same error handling in IMO, removing the Let me know how you all feel about my recommendation. |
Agree. |
Great. @arun-kv assuming you're also in agreement with the recommendation, can you update the proposed changes in the PR; then I can ping some folks and try and get another review on this. |
since it consolidates the error handling for flush and write errors to a single location (rather than having error handling split between the "write done" and "flush done" handlers). Signed-off-by: Arun KV <arun.kv@datacore.com>
@prakashsurya @jxdking Thank you for the details. I have updated the PR. |
@behlendorf @ahrens @grwilson @mmaybee Can we get another pair of eyes on this? I think it looks good to me. |
@problame @jxdking @prakashsurya Thank you all for the detailed explanation, it was really helpful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO there should be a comment in zil_lwb_flush_vdevs_done
, where we assign zcw_zio_error
, that summarizes #12443 (comment) and subsequent comments.
Yea, I agree that'd be good. Here's my take on coming up with some helpful comments:
Does this help? If folks think this is helpful and sufficient, feel free to incorporate this into the PR. Feel free to modify it as necessary too. I also think it'd be nice to incorporate the reproducer into the test suite, too, if possible (as I believe @problame also mentioned). If we could modify step 7 to avoid the reboot (e.g. perhaps use export/import instead? would that work with the disk offlined?), it seems like it might not be too difficult. With that said, I'm not opposed to landing this without the test; it just means it'll be more difficult the verify this error handling as future changes are made. |
+1 on adding the test - thats a fun error condition and absolutely non-trivial to discern for users. |
@arun-kv when you get a chance it'd be great if you could incorporate the suggested comments and rebase this PR. I've love to see the test case added for this, but I agree with @prakashsurya we could land this without. |
Signed-off-by: Arun KV <arun.kv@datacore.com>
@behlendorf sorry for the late reply. I have done the relevant changes. |
Errors in zil_lwb_write_done() are not propagated to zil_lwb_flush_vdevs_done() which can result in zil_commit_impl() not returning an error to applications even when zfs was not able to write data to the disk. Remove the ZIO_FLAG_DONT_PROPAGATE flag from zio_rewrite() to allow errors to propagate and consolidate the error handling for flush and write errors to a single location (rather than having error handling split between the "write done" and "flush done" handlers). Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Arun KV <arun.kv@datacore.com> Closes openzfs#12391 Closes openzfs#12443
Errors in zil_lwb_write_done() are not propagated to zil_lwb_flush_vdevs_done() which can result in zil_commit_impl() not returning an error to applications even when zfs was not able to write data to the disk. Remove the ZIO_FLAG_DONT_PROPAGATE flag from zio_rewrite() to allow errors to propagate and consolidate the error handling for flush and write errors to a single location (rather than having error handling split between the "write done" and "flush done" handlers). Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Arun KV <arun.kv@datacore.com> Closes openzfs#12391 Closes openzfs#12443
Since the beginning, ZFS' "flush" operation has always ignored errors[1]. Write errors are captured and dealt with, but if a write succeeds but the subsequent flush fails, the operation as a whole will appear to succeed[2]. In the end-of-transaction uberblock+label write+flush ceremony, it's very difficult for this situation to occur. Since all devices are written to, typically the first write will succeed, the first flush will fail unobserved, but then the second write will fail, and the entire transaction is aborted. It's difficult to imagine a real-world scenario where all the writes in that sequence could succeed even as the flushes are failing (understanding that the OS is still seeing hardware problems and taking devices offline). In the ZIL however, it's another story. Since only the write response is checked, if that write succeeds but the flush then fails, the ZIL will believe that it succeeds, and zil_commit() (and thus fsync()) will return success rather than the "correct" behaviour of falling back into txg_wait_synced()[3]. This commit fixes this by adding a simple flag to zio_flush() to indicate whether or not the caller wants to receive flush errors. This flag is enabled for ZIL calls. The existing zio chaining inside the ZIL and the flush handler zil_lwb_flush_vdevs_done() already has all the necessary support to properly handle a flush failure and fail the entire zio chain. This causes zil_commit() to correct fall back to txg_wait_synced() rather than returning success prematurely. 1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support for flushing devices with write caches with the DKIOCFLUSHWRITECACHE ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from from the time shows the thinking: /* * Wait for all the flushes to complete. Not all devices actually * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails. */ 2. It's not entirely clear from the code history why this was acceptable for devices that _do_ have write caches. Our best guess is that this was an oversight: between the combination of hardware, pool topology and application behaviour required to hit this, it basically didn't come up. 3. Somewhat frustratingly, zil.c contains comments describing this exact behaviour, and further discussion in openzfs#12443 (September 2021). It appears that those involved saw the potential, but were looking at a different problem and so didn't have the context to recognise it for what it was. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Since the beginning, ZFS' "flush" operation has always ignored errors[1]. Write errors are captured and dealt with, but if a write succeeds but the subsequent flush fails, the operation as a whole will appear to succeed[2]. In the end-of-transaction uberblock+label write+flush ceremony, it's very difficult for this situation to occur. Since all devices are written to, typically the first write will succeed, the first flush will fail unobserved, but then the second write will fail, and the entire transaction is aborted. It's difficult to imagine a real-world scenario where all the writes in that sequence could succeed even as the flushes are failing (understanding that the OS is still seeing hardware problems and taking devices offline). In the ZIL however, it's another story. Since only the write response is checked, if that write succeeds but the flush then fails, the ZIL will believe that it succeeds, and zil_commit() (and thus fsync()) will return success rather than the "correct" behaviour of falling back into txg_wait_synced()[3]. This commit fixes this by adding a simple flag to zio_flush() to indicate whether or not the caller wants to receive flush errors. This flag is enabled for ZIL calls. The existing zio chaining inside the ZIL and the flush handler zil_lwb_flush_vdevs_done() already has all the necessary support to properly handle a flush failure and fail the entire zio chain. This causes zil_commit() to correct fall back to txg_wait_synced() rather than returning success prematurely. 1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support for flushing devices with write caches with the DKIOCFLUSHWRITECACHE ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from from the time shows the thinking: /* * Wait for all the flushes to complete. Not all devices actually * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails. */ 2. It's not entirely clear from the code history why this was acceptable for devices that _do_ have write caches. Our best guess is that this was an oversight: between the combination of hardware, pool topology and application behaviour required to hit this, it basically didn't come up. 3. Somewhat frustratingly, zil.c contains comments describing this exact behaviour, and further discussion in openzfs#12443 (September 2021). It appears that those involved saw the potential, but were looking at a different problem and so didn't have the context to recognise it for what it was. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Since the beginning, ZFS' "flush" operation has always ignored errors[1]. Write errors are captured and dealt with, but if a write succeeds but the subsequent flush fails, the operation as a whole will appear to succeed[2]. In the end-of-transaction uberblock+label write+flush ceremony, it's very difficult for this situation to occur. Since all devices are written to, typically the first write will succeed, the first flush will fail unobserved, but then the second write will fail, and the entire transaction is aborted. It's difficult to imagine a real-world scenario where all the writes in that sequence could succeed even as the flushes are failing (understanding that the OS is still seeing hardware problems and taking devices offline). In the ZIL however, it's another story. Since only the write response is checked, if that write succeeds but the flush then fails, the ZIL will believe that it succeeds, and zil_commit() (and thus fsync()) will return success rather than the "correct" behaviour of falling back into txg_wait_synced()[3]. This commit fixes this by adding a simple flag to zio_flush() to indicate whether or not the caller wants to receive flush errors. This flag is enabled for ZIL calls. The existing zio chaining inside the ZIL and the flush handler zil_lwb_flush_vdevs_done() already has all the necessary support to properly handle a flush failure and fail the entire zio chain. This causes zil_commit() to correct fall back to txg_wait_synced() rather than returning success prematurely. 1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support for flushing devices with write caches with the DKIOCFLUSHWRITECACHE ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from from the time shows the thinking: /* * Wait for all the flushes to complete. Not all devices actually * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails. */ 2. It's not entirely clear from the code history why this was acceptable for devices that _do_ have write caches. Our best guess is that this was an oversight: between the combination of hardware, pool topology and application behaviour required to hit this, it basically didn't come up. 3. Somewhat frustratingly, zil.c contains comments describing this exact behaviour, and further discussion in openzfs#12443 (September 2021). It appears that those involved saw the potential, but were looking at a different problem and so didn't have the context to recognise it for what it was. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Since the beginning, ZFS' "flush" operation has always ignored errors[1]. Write errors are captured and dealt with, but if a write succeeds but the subsequent flush fails, the operation as a whole will appear to succeed[2]. In the end-of-transaction uberblock+label write+flush ceremony, it's very difficult for this situation to occur. Since all devices are written to, typically the first write will succeed, the first flush will fail unobserved, but then the second write will fail, and the entire transaction is aborted. It's difficult to imagine a real-world scenario where all the writes in that sequence could succeed even as the flushes are failing (understanding that the OS is still seeing hardware problems and taking devices offline). In the ZIL however, it's another story. Since only the write response is checked, if that write succeeds but the flush then fails, the ZIL will believe that it succeeds, and zil_commit() (and thus fsync()) will return success rather than the "correct" behaviour of falling back into txg_wait_synced()[3]. This commit fixes this by adding a simple flag to zio_flush() to indicate whether or not the caller wants to receive flush errors. This flag is enabled for ZIL calls. The existing zio chaining inside the ZIL and the flush handler zil_lwb_flush_vdevs_done() already has all the necessary support to properly handle a flush failure and fail the entire zio chain. This causes zil_commit() to correct fall back to txg_wait_synced() rather than returning success prematurely. 1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support for flushing devices with write caches with the DKIOCFLUSHWRITECACHE ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from from the time shows the thinking: /* * Wait for all the flushes to complete. Not all devices actually * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails. */ 2. It's not entirely clear from the code history why this was acceptable for devices that _do_ have write caches. Our best guess is that this was an oversight: between the combination of hardware, pool topology and application behaviour required to hit this, it basically didn't come up. 3. Somewhat frustratingly, zil.c contains comments describing this exact behaviour, and further discussion in openzfs#12443 (September 2021). It appears that those involved saw the potential, but were looking at a different problem and so didn't have the context to recognise it for what it was. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Since the beginning, ZFS' "flush" operation has always ignored errors[1]. Write errors are captured and dealt with, but if a write succeeds but the subsequent flush fails, the operation as a whole will appear to succeed[2]. In the end-of-transaction uberblock+label write+flush ceremony, it's very difficult for this situation to occur. Since all devices are written to, typically the first write will succeed, the first flush will fail unobserved, but then the second write will fail, and the entire transaction is aborted. It's difficult to imagine a real-world scenario where all the writes in that sequence could succeed even as the flushes are failing (understanding that the OS is still seeing hardware problems and taking devices offline). In the ZIL however, it's another story. Since only the write response is checked, if that write succeeds but the flush then fails, the ZIL will believe that it succeeds, and zil_commit() (and thus fsync()) will return success rather than the "correct" behaviour of falling back into txg_wait_synced()[3]. This commit fixes this by adding a simple flag to zio_flush() to indicate whether or not the caller wants to receive flush errors. This flag is enabled for ZIL calls. The existing zio chaining inside the ZIL and the flush handler zil_lwb_flush_vdevs_done() already has all the necessary support to properly handle a flush failure and fail the entire zio chain. This causes zil_commit() to correct fall back to txg_wait_synced() rather than returning success prematurely. 1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support for flushing devices with write caches with the DKIOCFLUSHWRITECACHE ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from from the time shows the thinking: /* * Wait for all the flushes to complete. Not all devices actually * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails. */ 2. It's not entirely clear from the code history why this was acceptable for devices that _do_ have write caches. Our best guess is that this was an oversight: between the combination of hardware, pool topology and application behaviour required to hit this, it basically didn't come up. 3. Somewhat frustratingly, zil.c contains comments describing this exact behaviour, and further discussion in openzfs#12443 (September 2021). It appears that those involved saw the potential, but were looking at a different problem and so didn't have the context to recognise it for what it was. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Since the beginning, ZFS' "flush" operation has always ignored errors[1]. Write errors are captured and dealt with, but if a write succeeds but the subsequent flush fails, the operation as a whole will appear to succeed[2]. In the end-of-transaction uberblock+label write+flush ceremony, it's very difficult for this situation to occur. Since all devices are written to, typically the first write will succeed, the first flush will fail unobserved, but then the second write will fail, and the entire transaction is aborted. It's difficult to imagine a real-world scenario where all the writes in that sequence could succeed even as the flushes are failing (understanding that the OS is still seeing hardware problems and taking devices offline). In the ZIL however, it's another story. Since only the write response is checked, if that write succeeds but the flush then fails, the ZIL will believe that it succeeds, and zil_commit() (and thus fsync()) will return success rather than the "correct" behaviour of falling back into txg_wait_synced()[3]. This commit fixes this by adding a simple flag to zio_flush() to indicate whether or not the caller wants to receive flush errors. This flag is enabled for ZIL calls. The existing zio chaining inside the ZIL and the flush handler zil_lwb_flush_vdevs_done() already has all the necessary support to properly handle a flush failure and fail the entire zio chain. This causes zil_commit() to correct fall back to txg_wait_synced() rather than returning success prematurely. 1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support for flushing devices with write caches with the DKIOCFLUSHWRITECACHE ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from from the time shows the thinking: /* * Wait for all the flushes to complete. Not all devices actually * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails. */ 2. It's not entirely clear from the code history why this was acceptable for devices that _do_ have write caches. Our best guess is that this was an oversight: between the combination of hardware, pool topology and application behaviour required to hit this, it basically didn't come up. 3. Somewhat frustratingly, zil.c contains comments describing this exact behaviour, and further discussion in openzfs#12443 (September 2021). It appears that those involved saw the potential, but were looking at a different problem and so didn't have the context to recognise it for what it was. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Since the beginning, ZFS' "flush" operation has always ignored errors[1]. Write errors are captured and dealt with, but if a write succeeds but the subsequent flush fails, the operation as a whole will appear to succeed[2]. In the end-of-transaction uberblock+label write+flush ceremony, it's very difficult for this situation to occur. Since all devices are written to, typically the first write will succeed, the first flush will fail unobserved, but then the second write will fail, and the entire transaction is aborted. It's difficult to imagine a real-world scenario where all the writes in that sequence could succeed even as the flushes are failing (understanding that the OS is still seeing hardware problems and taking devices offline). In the ZIL however, it's another story. Since only the write response is checked, if that write succeeds but the flush then fails, the ZIL will believe that it succeeds, and zil_commit() (and thus fsync()) will return success rather than the "correct" behaviour of falling back into txg_wait_synced()[3]. This commit fixes this by adding a simple flag to zio_flush() to indicate whether or not the caller wants to receive flush errors. This flag is enabled for ZIL calls. The existing zio chaining inside the ZIL and the flush handler zil_lwb_flush_vdevs_done() already has all the necessary support to properly handle a flush failure and fail the entire zio chain. This causes zil_commit() to correct fall back to txg_wait_synced() rather than returning success prematurely. 1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support for flushing devices with write caches with the DKIOCFLUSHWRITECACHE ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from from the time shows the thinking: /* * Wait for all the flushes to complete. Not all devices actually * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails. */ 2. It's not entirely clear from the code history why this was acceptable for devices that _do_ have write caches. Our best guess is that this was an oversight: between the combination of hardware, pool topology and application behaviour required to hit this, it basically didn't come up. 3. Somewhat frustratingly, zil.c contains comments describing this exact behaviour, and further discussion in openzfs#12443 (September 2021). It appears that those involved saw the potential, but were looking at a different problem and so didn't have the context to recognise it for what it was. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Since the beginning, ZFS' "flush" operation has always ignored errors[1]. Write errors are captured and dealt with, but if a write succeeds but the subsequent flush fails, the operation as a whole will appear to succeed[2]. In the end-of-transaction uberblock+label write+flush ceremony, it's very difficult for this situation to occur. Since all devices are written to, typically the first write will succeed, the first flush will fail unobserved, but then the second write will fail, and the entire transaction is aborted. It's difficult to imagine a real-world scenario where all the writes in that sequence could succeed even as the flushes are failing (understanding that the OS is still seeing hardware problems and taking devices offline). In the ZIL however, it's another story. Since only the write response is checked, if that write succeeds but the flush then fails, the ZIL will believe that it succeeds, and zil_commit() (and thus fsync()) will return success rather than the "correct" behaviour of falling back into txg_wait_synced()[3]. This commit fixes this by adding a simple flag to zio_flush() to indicate whether or not the caller wants to receive flush errors. This flag is enabled for ZIL calls. The existing zio chaining inside the ZIL and the flush handler zil_lwb_flush_vdevs_done() already has all the necessary support to properly handle a flush failure and fail the entire zio chain. This causes zil_commit() to correct fall back to txg_wait_synced() rather than returning success prematurely. 1. The ZFS birth commit (illumos/illumos-gate@fa9e4066f0) had support for flushing devices with write caches with the DKIOCFLUSHWRITECACHE ioctl. No errors are checked. The comment in `zil_flush_vdevs()` from from the time shows the thinking: /* * Wait for all the flushes to complete. Not all devices actually * support the DKIOCFLUSHWRITECACHE ioctl, so it's OK if it fails. */ 2. It's not entirely clear from the code history why this was acceptable for devices that _do_ have write caches. Our best guess is that this was an oversight: between the combination of hardware, pool topology and application behaviour required to hit this, it basically didn't come up. 3. Somewhat frustratingly, zil.c contains comments describing this exact behaviour, and further discussion in openzfs#12443 (September 2021). It appears that those involved saw the potential, but were looking at a different problem and so didn't have the context to recognise it for what it was. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
zil_lwb_write_done error was not propagated to zil_lwb_flush_vdevs_done, due to which zil_commit_impl was
returning and application gets write success even though zfs was not able to write data to the disk.
Motivation and Context
zvol (sync=always): Write(O_SYNC) returns success when the disk is not accessible.
The application gets a write success from zvol created with option sync=always even when the disk is not accessible.
Note: Only the first write after the disk becomes inaccessible is wrongly acknowledged as successfully written, after that all other writes from the application wait.
#12391
Description
Capture the zil_lwb_write_done error into zil_commit_waiter and in zil_lwb_flush_vdevs_done if the zio->io_error is 0 then use
zil_lwb_write_done error. which will make sure zil_commit_impl won't return to user until we fix the disk/network error and clear the zpool_suspend state.
How Has This Been Tested?
Below is the program I used for write data to zvol,
Types of changes
Checklist:
Signed-off-by
.