Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always wait for txg sync when umounting dataset #7795

Closed
wants to merge 2 commits into from

Conversation

tcaputi
Copy link
Contributor

@tcaputi tcaputi commented Aug 16, 2018

Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled. This
patch simply enforces that all dirty datasets should wait for a
txg sync before umount completes.

Fixes: #7753

Signed-off-by: Tom Caputi tcaputi@datto.com

How Has This Been Tested?

The following commands reproduce the issue without this patch:

zpool create -f pool sdb
echo 'password' | zfs create -o sync=disabled -o encryption=on -o keyformat=passphrase pool/test
dd if=/dev/urandom of=/pool/test/fileb bs=4M count=1 iflag=fullblock
mount -o remount,ro /pool/test
umount -f /pool/test
zpool sync pool

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix and explanation.

Since this behavior may end up being slightly different than the other OpenZFS ports can you update the 'Evict cached data' comment to explain why readonly datasets dol need to be synced.

Also please add your reproducer to the existing cli_root/zfs_mount/zfs_mount_remount.ksh test case.

@tcaputi tcaputi force-pushed the sync_disabled_fix branch from d26756d to ad8bf00 Compare August 16, 2018 18:56
@tcaputi
Copy link
Contributor Author

tcaputi commented Aug 16, 2018

@behlendorf I have addressed your comments. However, after running the fix repeatedly in a loop, I think there may be something else going on.... I'm still getting the assert occasionally.

@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Aug 17, 2018
@tcaputi tcaputi force-pushed the sync_disabled_fix branch from ad8bf00 to ea1ca6a Compare August 20, 2018 20:49
@codecov
Copy link

codecov bot commented Aug 21, 2018

Codecov Report

Merging #7795 into master will increase coverage by 0.11%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7795      +/-   ##
==========================================
+ Coverage    78.3%   78.41%   +0.11%     
==========================================
  Files         374      373       -1     
  Lines      112907   112803     -104     
==========================================
+ Hits        88413    88456      +43     
+ Misses      24494    24347     -147
Flag Coverage Δ
#kernel 78.6% <100%> (-0.2%) ⬇️
#user 67.58% <100%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 55972a6...b0f4290. Read the comment docs.

@tcaputi tcaputi force-pushed the sync_disabled_fix branch from ea1ca6a to 43a56ef Compare August 21, 2018 17:10
@tcaputi tcaputi removed the Status: Work in Progress Not yet ready for general review label Aug 21, 2018
@tcaputi
Copy link
Contributor Author

tcaputi commented Aug 21, 2018

I have verified this patch seems to work by running it on the broken machine 10 times in a row without issue. Unfortunately, it is a bit hard to script since I have to type the decryption password into the virtual console at boot time each iteration or I would do more.

@phryneas
Copy link

@tcaputi if you want to do some more tests with scripts, don't forget about the early-boot-ssh server on that machine. you should be able to unlock it with echo password | ssh -p 2222 root@ip - I just gave that a test

Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just two minor things.

module/zfs/txg.c Outdated
@@ -786,39 +788,56 @@ txg_list_create(txg_list_t *tl, spa_t *spa, size_t offset)
tl->tl_head[t] = NULL;
}

boolean_t
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be static.

module/zfs/txg.c Outdated
boolean_t ret;

mutex_enter(&tl->tl_lock);
ret = txg_list_empty_impl(tl, txg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be boolean_t ret = txg_list_empty_impl(tl, txg);.

Tom Caputi added 2 commits August 24, 2018 14:28
This patch simply adds some missing locking to the txg_list
functions and refactors txg_verify() so that it is only compiled
in for debug builds.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.

While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.

Fixes: openzfs#7753

Signed-off-by: Tom Caputi <tcaputi@datto.com>
@tcaputi tcaputi force-pushed the sync_disabled_fix branch from 43a56ef to b0f4290 Compare August 24, 2018 18:29
@tcaputi
Copy link
Contributor Author

tcaputi commented Aug 24, 2018

@behlendorf your recommendations have been addressed.

@behlendorf behlendorf added Reviewed Status: Accepted Ready to integrate (reviewed, tested) labels Aug 24, 2018
behlendorf pushed a commit that referenced this pull request Aug 27, 2018
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.

While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7753
Closes #7795
amotin added a commit to amotin/zfs that referenced this pull request Dec 4, 2020
Since 8c4fb36 (PR openzfs#7795) spa_has_pending_synctask() started to
take two more locks per write inside txg_all_lists_empty().  I am
surprised those pool-wide locks are not contended, but still their
operations are visible in CPU profiles under contended vdev lock.

This commit slightly changes vdev_queue_max_async_writes() flow to
not call the function if we are going to return max_active any way
due to high amount of dirty data.  It allows to save some CPU time
exactly when the pool is busy.

Signed-off-by: Alexander Motin <mav@FreeBSD.org>
behlendorf pushed a commit that referenced this pull request Dec 6, 2020
Since 8c4fb36 (PR #7795) spa_has_pending_synctask() started to
take two more locks per write inside txg_all_lists_empty().  I am
surprised those pool-wide locks are not contended, but still their
operations are visible in CPU profiles under contended vdev lock.

This commit slightly changes vdev_queue_max_async_writes() flow to
not call the function if we are going to return max_active any way
due to high amount of dirty data.  It allows to save some CPU time
exactly when the pool is busy.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Tom Caputi <caputit1@tcnj.edu>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #11280
sterlingjensen pushed a commit to sterlingjensen/zfs that referenced this pull request Dec 7, 2020
Since 8c4fb36 (PR openzfs#7795) spa_has_pending_synctask() started to
take two more locks per write inside txg_all_lists_empty().  I am
surprised those pool-wide locks are not contended, but still their
operations are visible in CPU profiles under contended vdev lock.

This commit slightly changes vdev_queue_max_async_writes() flow to
not call the function if we are going to return max_active any way
due to high amount of dirty data.  It allows to save some CPU time
exactly when the pool is busy.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Tom Caputi <caputit1@tcnj.edu>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#11280
ghost pushed a commit to zfsonfreebsd/ZoF that referenced this pull request Dec 23, 2020
Since 8c4fb36 (PR openzfs#7795) spa_has_pending_synctask() started to
take two more locks per write inside txg_all_lists_empty().  I am
surprised those pool-wide locks are not contended, but still their
operations are visible in CPU profiles under contended vdev lock.

This commit slightly changes vdev_queue_max_async_writes() flow to
not call the function if we are going to return max_active any way
due to high amount of dirty data.  It allows to save some CPU time
exactly when the pool is busy.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Tom Caputi <caputit1@tcnj.edu>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#11280
jsai20 pushed a commit to jsai20/zfs that referenced this pull request Mar 30, 2021
Since 8c4fb36 (PR openzfs#7795) spa_has_pending_synctask() started to
take two more locks per write inside txg_all_lists_empty().  I am
surprised those pool-wide locks are not contended, but still their
operations are visible in CPU profiles under contended vdev lock.

This commit slightly changes vdev_queue_max_async_writes() flow to
not call the function if we are going to return max_active any way
due to high amount of dirty data.  It allows to save some CPU time
exactly when the pool is busy.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Tom Caputi <caputit1@tcnj.edu>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#11280
sempervictus pushed a commit to sempervictus/zfs that referenced this pull request May 31, 2021
Since 8c4fb36 (PR openzfs#7795) spa_has_pending_synctask() started to
take two more locks per write inside txg_all_lists_empty().  I am
surprised those pool-wide locks are not contended, but still their
operations are visible in CPU profiles under contended vdev lock.

This commit slightly changes vdev_queue_max_async_writes() flow to
not call the function if we are going to return max_active any way
due to high amount of dirty data.  It allows to save some CPU time
exactly when the pool is busy.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Tom Caputi <caputit1@tcnj.edu>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#11280
asomers added a commit to asomers/zfs that referenced this pull request Jun 14, 2024
This applies the same change in openzfs#9115 to FreeBSD.  This was actually the
old behavior in FreeBSD 12; it only regressed when FreeBSD support was
added to OpenZFS.  As far as I can tell, the timeline went like this:

* Illumos's zfsvfs_teardown used an unconditional txg_wait_synced
* Illumos added the dirty data check [^4]
* FreeBSD merged in Illumos's conditional check [^3]
* OpenZFS forked from Illumos
* OpenZFS removed the dirty data check in openzfs#7795 [^5]
* @mattmacy forked the OpenZFS repo and began to add FreeBSD support
* OpenZFS PR openzfs#9115[^1] recreated the same dirty data check that Illumos
  used, in slightly different form.  At this point the OpenZFS repo did
  not yet have multi-OS support.
* Matt Macy merged in FreeBSD support in openzfs#8987[^2] , but it was based on
  slightly outdated OpenZFS code.

In my local testing, this vastly improves the reboot speed of a server
with a large pool that has 1000 datasets and is resilvering an HDD.

[^1]: openzfs#9115
[^2]: openzfs#8987
[^3]: freebsd/freebsd-src@10b9d77
[^4]: illumos/illumos-gate@5aaeed5
[^5]: openzfs#7795

Sponsored by:	Axcient
Signed-off-by:	Alan Somers <asomers@gmail.com>
behlendorf pushed a commit that referenced this pull request Aug 9, 2024
This applies the same change in #9115 to FreeBSD.  This was actually the
old behavior in FreeBSD 12; it only regressed when FreeBSD support was
added to OpenZFS.  As far as I can tell, the timeline went like this:

* Illumos's zfsvfs_teardown used an unconditional txg_wait_synced
* Illumos added the dirty data check [^4]
* FreeBSD merged in Illumos's conditional check [^3]
* OpenZFS forked from Illumos
* OpenZFS removed the dirty data check in #7795 [^5]
* @mattmacy forked the OpenZFS repo and began to add FreeBSD support
* OpenZFS PR #9115[^1] recreated the same dirty data check that Illumos
  used, in slightly different form.  At this point the OpenZFS repo did
  not yet have multi-OS support.
* Matt Macy merged in FreeBSD support in #8987[^2] , but it was based on
  slightly outdated OpenZFS code.

In my local testing, this vastly improves the reboot speed of a server
with a large pool that has 1000 datasets and is resilvering an HDD.

[^1]: #9115
[^2]: #8987
[^3]: freebsd/freebsd-src@10b9d77
[^4]: illumos/illumos-gate@5aaeed5
[^5]: #7795

Sponsored by: Axcient
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes #16268
lundman pushed a commit to openzfsonwindows/openzfs that referenced this pull request Sep 4, 2024
This applies the same change in openzfs#9115 to FreeBSD.  This was actually the
old behavior in FreeBSD 12; it only regressed when FreeBSD support was
added to OpenZFS.  As far as I can tell, the timeline went like this:

* Illumos's zfsvfs_teardown used an unconditional txg_wait_synced
* Illumos added the dirty data check [^4]
* FreeBSD merged in Illumos's conditional check [^3]
* OpenZFS forked from Illumos
* OpenZFS removed the dirty data check in openzfs#7795 [^5]
* @mattmacy forked the OpenZFS repo and began to add FreeBSD support
* OpenZFS PR openzfs#9115[^1] recreated the same dirty data check that Illumos
  used, in slightly different form.  At this point the OpenZFS repo did
  not yet have multi-OS support.
* Matt Macy merged in FreeBSD support in openzfs#8987[^2] , but it was based on
  slightly outdated OpenZFS code.

In my local testing, this vastly improves the reboot speed of a server
with a large pool that has 1000 datasets and is resilvering an HDD.

[^1]: openzfs#9115
[^2]: openzfs#8987
[^3]: freebsd/freebsd-src@10b9d77
[^4]: illumos/illumos-gate@5aaeed5
[^5]: openzfs#7795

Sponsored by: Axcient
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Alan Somers <asomers@gmail.com>
Closes openzfs#16268
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kernel: PANIC at dmu_objset.c:1844:do_userquota_cacheflush()
3 participants