-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Always wait for txg sync when umounting dataset #7795
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix and explanation.
Since this behavior may end up being slightly different than the other OpenZFS ports can you update the 'Evict cached data' comment to explain why readonly datasets dol need to be synced.
Also please add your reproducer to the existing cli_root/zfs_mount/zfs_mount_remount.ksh
test case.
d26756d
to
ad8bf00
Compare
@behlendorf I have addressed your comments. However, after running the fix repeatedly in a loop, I think there may be something else going on.... I'm still getting the assert occasionally. |
ad8bf00
to
ea1ca6a
Compare
Codecov Report
@@ Coverage Diff @@
## master #7795 +/- ##
==========================================
+ Coverage 78.3% 78.41% +0.11%
==========================================
Files 374 373 -1
Lines 112907 112803 -104
==========================================
+ Hits 88413 88456 +43
+ Misses 24494 24347 -147
Continue to review full report at Codecov.
|
ea1ca6a
to
43a56ef
Compare
I have verified this patch seems to work by running it on the broken machine 10 times in a row without issue. Unfortunately, it is a bit hard to script since I have to type the decryption password into the virtual console at boot time each iteration or I would do more. |
@tcaputi if you want to do some more tests with scripts, don't forget about the early-boot-ssh server on that machine. you should be able to unlock it with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just two minor things.
module/zfs/txg.c
Outdated
@@ -786,39 +788,56 @@ txg_list_create(txg_list_t *tl, spa_t *spa, size_t offset) | |||
tl->tl_head[t] = NULL; | |||
} | |||
|
|||
boolean_t |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be static
.
module/zfs/txg.c
Outdated
boolean_t ret; | ||
|
||
mutex_enter(&tl->tl_lock); | ||
ret = txg_list_empty_impl(tl, txg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be boolean_t ret = txg_list_empty_impl(tl, txg);
.
This patch simply adds some missing locking to the txg_list functions and refactors txg_verify() so that it is only compiled in for debug builds. Signed-off-by: Tom Caputi <tcaputi@datto.com>
Currently, when unmounting a filesystem, ZFS will only wait for a txg sync if the dataset is dirty and not readonly. However, this can be problematic in cases where a dataset is remounted readonly immediately before being unmounted, which often happens when the system is being shut down. Since encrypted datasets require that all I/O is completed before the dataset is disowned, this issue causes problems when write I/Os leak into the txgs after the dataset is disowned, which can happen when sync=disabled. While looking into fixes for this issue, it was discovered that dsl_dataset_is_dirty() does not return B_TRUE when the dataset has been removed from the txg dirty datasets list, but has not actually been processed yet. Furthermore, the implementation is comletely different from dmu_objset_is_dirty(), adding to the confusion. Rather than relying on this function, this patch forces the umount code path (and the remount readonly code path) to always perform a txg sync on read-write datasets and removes the function altogether. Fixes: openzfs#7753 Signed-off-by: Tom Caputi <tcaputi@datto.com>
43a56ef
to
b0f4290
Compare
@behlendorf your recommendations have been addressed. |
Currently, when unmounting a filesystem, ZFS will only wait for a txg sync if the dataset is dirty and not readonly. However, this can be problematic in cases where a dataset is remounted readonly immediately before being unmounted, which often happens when the system is being shut down. Since encrypted datasets require that all I/O is completed before the dataset is disowned, this issue causes problems when write I/Os leak into the txgs after the dataset is disowned, which can happen when sync=disabled. While looking into fixes for this issue, it was discovered that dsl_dataset_is_dirty() does not return B_TRUE when the dataset has been removed from the txg dirty datasets list, but has not actually been processed yet. Furthermore, the implementation is comletely different from dmu_objset_is_dirty(), adding to the confusion. Rather than relying on this function, this patch forces the umount code path (and the remount readonly code path) to always perform a txg sync on read-write datasets and removes the function altogether. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7753 Closes #7795
Since 8c4fb36 (PR openzfs#7795) spa_has_pending_synctask() started to take two more locks per write inside txg_all_lists_empty(). I am surprised those pool-wide locks are not contended, but still their operations are visible in CPU profiles under contended vdev lock. This commit slightly changes vdev_queue_max_async_writes() flow to not call the function if we are going to return max_active any way due to high amount of dirty data. It allows to save some CPU time exactly when the pool is busy. Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Since 8c4fb36 (PR #7795) spa_has_pending_synctask() started to take two more locks per write inside txg_all_lists_empty(). I am surprised those pool-wide locks are not contended, but still their operations are visible in CPU profiles under contended vdev lock. This commit slightly changes vdev_queue_max_async_writes() flow to not call the function if we are going to return max_active any way due to high amount of dirty data. It allows to save some CPU time exactly when the pool is busy. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #11280
Since 8c4fb36 (PR openzfs#7795) spa_has_pending_synctask() started to take two more locks per write inside txg_all_lists_empty(). I am surprised those pool-wide locks are not contended, but still their operations are visible in CPU profiles under contended vdev lock. This commit slightly changes vdev_queue_max_async_writes() flow to not call the function if we are going to return max_active any way due to high amount of dirty data. It allows to save some CPU time exactly when the pool is busy. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#11280
Since 8c4fb36 (PR openzfs#7795) spa_has_pending_synctask() started to take two more locks per write inside txg_all_lists_empty(). I am surprised those pool-wide locks are not contended, but still their operations are visible in CPU profiles under contended vdev lock. This commit slightly changes vdev_queue_max_async_writes() flow to not call the function if we are going to return max_active any way due to high amount of dirty data. It allows to save some CPU time exactly when the pool is busy. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#11280
Since 8c4fb36 (PR openzfs#7795) spa_has_pending_synctask() started to take two more locks per write inside txg_all_lists_empty(). I am surprised those pool-wide locks are not contended, but still their operations are visible in CPU profiles under contended vdev lock. This commit slightly changes vdev_queue_max_async_writes() flow to not call the function if we are going to return max_active any way due to high amount of dirty data. It allows to save some CPU time exactly when the pool is busy. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#11280
Since 8c4fb36 (PR openzfs#7795) spa_has_pending_synctask() started to take two more locks per write inside txg_all_lists_empty(). I am surprised those pool-wide locks are not contended, but still their operations are visible in CPU profiles under contended vdev lock. This commit slightly changes vdev_queue_max_async_writes() flow to not call the function if we are going to return max_active any way due to high amount of dirty data. It allows to save some CPU time exactly when the pool is busy. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#11280
This applies the same change in openzfs#9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this: * Illumos's zfsvfs_teardown used an unconditional txg_wait_synced * Illumos added the dirty data check [^4] * FreeBSD merged in Illumos's conditional check [^3] * OpenZFS forked from Illumos * OpenZFS removed the dirty data check in openzfs#7795 [^5] * @mattmacy forked the OpenZFS repo and began to add FreeBSD support * OpenZFS PR openzfs#9115[^1] recreated the same dirty data check that Illumos used, in slightly different form. At this point the OpenZFS repo did not yet have multi-OS support. * Matt Macy merged in FreeBSD support in openzfs#8987[^2] , but it was based on slightly outdated OpenZFS code. In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD. [^1]: openzfs#9115 [^2]: openzfs#8987 [^3]: freebsd/freebsd-src@10b9d77 [^4]: illumos/illumos-gate@5aaeed5 [^5]: openzfs#7795 Sponsored by: Axcient Signed-off-by: Alan Somers <asomers@gmail.com>
This applies the same change in #9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this: * Illumos's zfsvfs_teardown used an unconditional txg_wait_synced * Illumos added the dirty data check [^4] * FreeBSD merged in Illumos's conditional check [^3] * OpenZFS forked from Illumos * OpenZFS removed the dirty data check in #7795 [^5] * @mattmacy forked the OpenZFS repo and began to add FreeBSD support * OpenZFS PR #9115[^1] recreated the same dirty data check that Illumos used, in slightly different form. At this point the OpenZFS repo did not yet have multi-OS support. * Matt Macy merged in FreeBSD support in #8987[^2] , but it was based on slightly outdated OpenZFS code. In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD. [^1]: #9115 [^2]: #8987 [^3]: freebsd/freebsd-src@10b9d77 [^4]: illumos/illumos-gate@5aaeed5 [^5]: #7795 Sponsored by: Axcient Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #16268
This applies the same change in openzfs#9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this: * Illumos's zfsvfs_teardown used an unconditional txg_wait_synced * Illumos added the dirty data check [^4] * FreeBSD merged in Illumos's conditional check [^3] * OpenZFS forked from Illumos * OpenZFS removed the dirty data check in openzfs#7795 [^5] * @mattmacy forked the OpenZFS repo and began to add FreeBSD support * OpenZFS PR openzfs#9115[^1] recreated the same dirty data check that Illumos used, in slightly different form. At this point the OpenZFS repo did not yet have multi-OS support. * Matt Macy merged in FreeBSD support in openzfs#8987[^2] , but it was based on slightly outdated OpenZFS code. In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD. [^1]: openzfs#9115 [^2]: openzfs#8987 [^3]: freebsd/freebsd-src@10b9d77 [^4]: illumos/illumos-gate@5aaeed5 [^5]: openzfs#7795 Sponsored by: Axcient Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes openzfs#16268
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled. This
patch simply enforces that all dirty datasets should wait for a
txg sync before umount completes.
Fixes: #7753
Signed-off-by: Tom Caputi tcaputi@datto.com
How Has This Been Tested?
The following commands reproduce the issue without this patch:
Types of changes
Checklist:
Signed-off-by
.