Always wait for txg sync when umounting dataset #7795

tcaputi · 2018-08-16T17:12:32Z

Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled. This
patch simply enforces that all dirty datasets should wait for a
txg sync before umount completes.

Fixes: #7753

Signed-off-by: Tom Caputi tcaputi@datto.com

How Has This Been Tested?

The following commands reproduce the issue without this patch:

zpool create -f pool sdb
echo 'password' | zfs create -o sync=disabled -o encryption=on -o keyformat=passphrase pool/test
dd if=/dev/urandom of=/pool/test/fileb bs=4M count=1 iflag=fullblock
mount -o remount,ro /pool/test
umount -f /pool/test
zpool sync pool

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

behlendorf

Nice fix and explanation.

Since this behavior may end up being slightly different than the other OpenZFS ports can you update the 'Evict cached data' comment to explain why readonly datasets dol need to be synced.

Also please add your reproducer to the existing cli_root/zfs_mount/zfs_mount_remount.ksh test case.

tcaputi · 2018-08-16T18:57:59Z

@behlendorf I have addressed your comments. However, after running the fix repeatedly in a loop, I think there may be something else going on.... I'm still getting the assert occasionally.

codecov · 2018-08-21T05:43:54Z

Codecov Report

Merging #7795 into master will increase coverage by 0.11%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #7795      +/-   ##
==========================================
+ Coverage    78.3%   78.41%   +0.11%     
==========================================
  Files         374      373       -1     
  Lines      112907   112803     -104     
==========================================
+ Hits        88413    88456      +43     
+ Misses      24494    24347     -147

Flag	Coverage Δ
#kernel	`78.6% <100%> (-0.2%)`	⬇️
#user	`67.58% <100%> (+0.19%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55972a6...b0f4290. Read the comment docs.

tcaputi · 2018-08-21T19:52:37Z

I have verified this patch seems to work by running it on the broken machine 10 times in a row without issue. Unfortunately, it is a bit hard to script since I have to type the decryption password into the virtual console at boot time each iteration or I would do more.

phryneas · 2018-08-21T21:01:48Z

@tcaputi if you want to do some more tests with scripts, don't forget about the early-boot-ssh server on that machine. you should be able to unlock it with echo password | ssh -p 2222 root@ip - I just gave that a test

behlendorf

Looks good, just two minor things.

behlendorf · 2018-08-22T22:02:30Z

module/zfs/txg.c

@@ -786,39 +788,56 @@ txg_list_create(txg_list_t *tl, spa_t *spa, size_t offset)
 		tl->tl_head[t] = NULL;
 }

+boolean_t


Should be static.

behlendorf · 2018-08-22T22:04:25Z

module/zfs/txg.c

+	boolean_t ret;
+
+	mutex_enter(&tl->tl_lock);
+	ret = txg_list_empty_impl(tl, txg);


This could be boolean_t ret = txg_list_empty_impl(tl, txg);.

This patch simply adds some missing locking to the txg_list functions and refactors txg_verify() so that it is only compiled in for debug builds. Signed-off-by: Tom Caputi <tcaputi@datto.com>

Currently, when unmounting a filesystem, ZFS will only wait for a txg sync if the dataset is dirty and not readonly. However, this can be problematic in cases where a dataset is remounted readonly immediately before being unmounted, which often happens when the system is being shut down. Since encrypted datasets require that all I/O is completed before the dataset is disowned, this issue causes problems when write I/Os leak into the txgs after the dataset is disowned, which can happen when sync=disabled. While looking into fixes for this issue, it was discovered that dsl_dataset_is_dirty() does not return B_TRUE when the dataset has been removed from the txg dirty datasets list, but has not actually been processed yet. Furthermore, the implementation is comletely different from dmu_objset_is_dirty(), adding to the confusion. Rather than relying on this function, this patch forces the umount code path (and the remount readonly code path) to always perform a txg sync on read-write datasets and removes the function altogether. Fixes: openzfs#7753 Signed-off-by: Tom Caputi <tcaputi@datto.com>

tcaputi · 2018-08-24T18:30:25Z

@behlendorf your recommendations have been addressed.

Currently, when unmounting a filesystem, ZFS will only wait for a txg sync if the dataset is dirty and not readonly. However, this can be problematic in cases where a dataset is remounted readonly immediately before being unmounted, which often happens when the system is being shut down. Since encrypted datasets require that all I/O is completed before the dataset is disowned, this issue causes problems when write I/Os leak into the txgs after the dataset is disowned, which can happen when sync=disabled. While looking into fixes for this issue, it was discovered that dsl_dataset_is_dirty() does not return B_TRUE when the dataset has been removed from the txg dirty datasets list, but has not actually been processed yet. Furthermore, the implementation is comletely different from dmu_objset_is_dirty(), adding to the confusion. Rather than relying on this function, this patch forces the umount code path (and the remount readonly code path) to always perform a txg sync on read-write datasets and removes the function altogether. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7753 Closes #7795

Since 8c4fb36 (PR openzfs#7795) spa_has_pending_synctask() started to take two more locks per write inside txg_all_lists_empty(). I am surprised those pool-wide locks are not contended, but still their operations are visible in CPU profiles under contended vdev lock. This commit slightly changes vdev_queue_max_async_writes() flow to not call the function if we are going to return max_active any way due to high amount of dirty data. It allows to save some CPU time exactly when the pool is busy. Signed-off-by: Alexander Motin <mav@FreeBSD.org>

Since 8c4fb36 (PR #7795) spa_has_pending_synctask() started to take two more locks per write inside txg_all_lists_empty(). I am surprised those pool-wide locks are not contended, but still their operations are visible in CPU profiles under contended vdev lock. This commit slightly changes vdev_queue_max_async_writes() flow to not call the function if we are going to return max_active any way due to high amount of dirty data. It allows to save some CPU time exactly when the pool is busy. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #11280

Since 8c4fb36 (PR openzfs#7795) spa_has_pending_synctask() started to take two more locks per write inside txg_all_lists_empty(). I am surprised those pool-wide locks are not contended, but still their operations are visible in CPU profiles under contended vdev lock. This commit slightly changes vdev_queue_max_async_writes() flow to not call the function if we are going to return max_active any way due to high amount of dirty data. It allows to save some CPU time exactly when the pool is busy. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-By: Tom Caputi <caputit1@tcnj.edu> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#11280

@mattmacy

This applies the same change in openzfs#9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this: * Illumos's zfsvfs_teardown used an unconditional txg_wait_synced * Illumos added the dirty data check [^4] * FreeBSD merged in Illumos's conditional check [^3] * OpenZFS forked from Illumos * OpenZFS removed the dirty data check in openzfs#7795 [^5] * @mattmacy forked the OpenZFS repo and began to add FreeBSD support * OpenZFS PR openzfs#9115[^1] recreated the same dirty data check that Illumos used, in slightly different form. At this point the OpenZFS repo did not yet have multi-OS support. * Matt Macy merged in FreeBSD support in openzfs#8987[^2] , but it was based on slightly outdated OpenZFS code. In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD. [^1]: openzfs#9115 [^2]: openzfs#8987 [^3]: freebsd/freebsd-src@10b9d77 [^4]: illumos/illumos-gate@5aaeed5 [^5]: openzfs#7795 Sponsored by: Axcient Signed-off-by: Alan Somers <asomers@gmail.com>

@mattmacy

This applies the same change in #9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this: * Illumos's zfsvfs_teardown used an unconditional txg_wait_synced * Illumos added the dirty data check [^4] * FreeBSD merged in Illumos's conditional check [^3] * OpenZFS forked from Illumos * OpenZFS removed the dirty data check in #7795 [^5] * @mattmacy forked the OpenZFS repo and began to add FreeBSD support * OpenZFS PR #9115[^1] recreated the same dirty data check that Illumos used, in slightly different form. At this point the OpenZFS repo did not yet have multi-OS support. * Matt Macy merged in FreeBSD support in #8987[^2] , but it was based on slightly outdated OpenZFS code. In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD. [^1]: #9115 [^2]: #8987 [^3]: freebsd/freebsd-src@10b9d77 [^4]: illumos/illumos-gate@5aaeed5 [^5]: #7795 Sponsored by: Axcient Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes #16268

@mattmacy

This applies the same change in openzfs#9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this: * Illumos's zfsvfs_teardown used an unconditional txg_wait_synced * Illumos added the dirty data check [^4] * FreeBSD merged in Illumos's conditional check [^3] * OpenZFS forked from Illumos * OpenZFS removed the dirty data check in openzfs#7795 [^5] * @mattmacy forked the OpenZFS repo and began to add FreeBSD support * OpenZFS PR openzfs#9115[^1] recreated the same dirty data check that Illumos used, in slightly different form. At this point the OpenZFS repo did not yet have multi-OS support. * Matt Macy merged in FreeBSD support in openzfs#8987[^2] , but it was based on slightly outdated OpenZFS code. In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD. [^1]: openzfs#9115 [^2]: openzfs#8987 [^3]: freebsd/freebsd-src@10b9d77 [^4]: illumos/illumos-gate@5aaeed5 [^5]: openzfs#7795 Sponsored by: Axcient Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes openzfs#16268

behlendorf approved these changes Aug 16, 2018

View reviewed changes

tcaputi force-pushed the sync_disabled_fix branch from d26756d to ad8bf00 Compare August 16, 2018 18:56

behlendorf added the Status: Work in Progress Not yet ready for general review label Aug 17, 2018

tcaputi force-pushed the sync_disabled_fix branch from ad8bf00 to ea1ca6a Compare August 20, 2018 20:49

tcaputi force-pushed the sync_disabled_fix branch from ea1ca6a to 43a56ef Compare August 21, 2018 17:10

tcaputi removed the Status: Work in Progress Not yet ready for general review label Aug 21, 2018

behlendorf approved these changes Aug 22, 2018

View reviewed changes

Tom Caputi added 2 commits August 24, 2018 14:28

Small rework of txg_list code

33f4082

This patch simply adds some missing locking to the txg_list functions and refactors txg_verify() so that it is only compiled in for debug builds. Signed-off-by: Tom Caputi <tcaputi@datto.com>

tcaputi force-pushed the sync_disabled_fix branch from 43a56ef to b0f4290 Compare August 24, 2018 18:29

behlendorf added Reviewed Status: Accepted Ready to integrate (reviewed, tested) labels Aug 24, 2018

behlendorf closed this in 8c4fb36 Aug 27, 2018

phryneas mentioned this pull request Sep 4, 2018

zfs-unstable: 2018-08-13 -> 2018-09-02 NixOS/nixpkgs#46055

Merged

9 tasks

amotin mentioned this pull request Dec 4, 2020

Avoid some spa_has_pending_synctask() calls. #11280

Merged

12 tasks

This was referenced Jun 14, 2024

Make txg_wait_synced conditional in zfsvfs_teardown, for FreeBSD #16268

Merged

Possible regression when unmounting encrypted datasets. #16269

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always wait for txg sync when umounting dataset #7795

Always wait for txg sync when umounting dataset #7795

tcaputi commented Aug 16, 2018

behlendorf left a comment

tcaputi commented Aug 16, 2018

codecov bot commented Aug 21, 2018 •

edited

Loading

tcaputi commented Aug 21, 2018

phryneas commented Aug 21, 2018

behlendorf left a comment

behlendorf Aug 22, 2018

behlendorf Aug 22, 2018

tcaputi commented Aug 24, 2018

Always wait for txg sync when umounting dataset #7795

Always wait for txg sync when umounting dataset #7795

Conversation

tcaputi commented Aug 16, 2018

How Has This Been Tested?

Types of changes

Checklist:

behlendorf left a comment

Choose a reason for hiding this comment

tcaputi commented Aug 16, 2018

codecov bot commented Aug 21, 2018 • edited Loading

Codecov Report

tcaputi commented Aug 21, 2018

phryneas commented Aug 21, 2018

behlendorf left a comment

Choose a reason for hiding this comment

behlendorf Aug 22, 2018

Choose a reason for hiding this comment

behlendorf Aug 22, 2018

Choose a reason for hiding this comment

tcaputi commented Aug 24, 2018

codecov bot commented Aug 21, 2018 •

edited

Loading