Make txg_wait_synced conditional in zfsvfs_teardown, for FreeBSD #16268

asomers · 2024-06-14T15:46:54Z

This applies the same change in #9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this:

Illumos's zfsvfs_teardown used an unconditional txg_wait_synced
Illumos added the dirty data check ¹
FreeBSD merged in Illumos's conditional check ²
OpenZFS forked from Illumos
OpenZFS removed the dirty data check in Always wait for txg sync when umounting dataset #7795 ³
@mattmacy forked the OpenZFS repo and began to add FreeBSD support
OpenZFS PR Make txg_wait_synced conditional in zfsvfs_teardown #9115⁴ recreated the same dirty data check that Illumos used, in slightly different form. At this point the OpenZFS repo did not yet have multi-OS support.
Matt Macy merged in FreeBSD support in Add support for FreeBSD #8987⁵ , but it was based on slightly outdated OpenZFS code.

In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD.

Sponsored by: Axcient
Signed-off-by: Alan Somers asomers@gmail.com

Motivation and Context

Without this change, unmount speed is very slow, especially when a scan is ongoing

Description

Copy the same changes as in #9115 to FreeBSD.

How Has This Been Tested?

Tested locally on a server with about 240 HDDs, 1000 datasets, and an ongoing resilver. Unmount speed went from 1-3 datasets per 10 seconds to "too fast to measure".

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

@mattmacy

This applies the same change in openzfs#9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this: * Illumos's zfsvfs_teardown used an unconditional txg_wait_synced * Illumos added the dirty data check [^4] * FreeBSD merged in Illumos's conditional check [^3] * OpenZFS forked from Illumos * OpenZFS removed the dirty data check in openzfs#7795 [^5] * @mattmacy forked the OpenZFS repo and began to add FreeBSD support * OpenZFS PR openzfs#9115[^1] recreated the same dirty data check that Illumos used, in slightly different form. At this point the OpenZFS repo did not yet have multi-OS support. * Matt Macy merged in FreeBSD support in openzfs#8987[^2] , but it was based on slightly outdated OpenZFS code. In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD. [^1]: openzfs#9115 [^2]: openzfs#8987 [^3]: freebsd/freebsd-src@10b9d77 [^4]: illumos/illumos-gate@5aaeed5 [^5]: openzfs#7795 Sponsored by: Axcient Signed-off-by: Alan Somers <asomers@gmail.com>

amotin · 2024-06-14T17:58:06Z

I haven't looked too deep, but looking on original panic in #7753 I suspect there may be a race between last dnode of dataset being synced in dmu_objset_sync_dnodes() and userquota_updates_task() actually completed in another task.

tonyhutter · 2024-08-08T17:20:03Z

@amotin can you give a little more detail the dmu_objset_sync_dnodes()/userquota_updates_task() race condition you're concerned about? If there is a race then I'm worried that it would affect the Linux zfsvfs_teardown() codepath as well, since this PR basically makes them the same code.

Linux codepath in master:

       // module/os/linux/zfs/zfs_vfsops.c:zfsvfs_teardown()
       /*
        * Evict cached data. We must write out any dirty data before
        * disowning the dataset.
        */
       objset_t *os = zfsvfs->z_os;
       boolean_t os_dirty = B_FALSE;
       for (int t = 0; t < TXG_SIZE; t++) {
              if (dmu_objset_is_dirty(os, t)) {
                     os_dirty = B_TRUE;
                     break;
              }
       }
       if (!zfs_is_readonly(zfsvfs) && os_dirty) {
              txg_wait_synced(dmu_objset_pool(zfsvfs->z_os), 0);
       }
       dmu_objset_evict_dbufs(zfsvfs->z_os);
       dsl_dir_t *dd = os->os_dsl_dataset->ds_dir;
       dsl_dir_cancel_waiters(dd);

This PR:

       // module/os/freebsd/zfs/zfs_vfsops.c:zfsvfs_teardown()
       /*
        * Evict cached data. We must write out any dirty data before
        * disowning the dataset.
        */
       objset_t *os = zfsvfs->z_os;
       boolean_t os_dirty = B_FALSE;
       for (int t = 0; t < TXG_SIZE; t++) {
              if (dmu_objset_is_dirty(os, t)) {
                     os_dirty = B_TRUE;
                     break;
              }
       }
       if (!zfs_is_readonly(zfsvfs) && os_dirty)
              txg_wait_synced(dmu_objset_pool(zfsvfs->z_os), 0);
       dmu_objset_evict_dbufs(zfsvfs->z_os);
       dd = zfsvfs->z_os->os_dsl_dataset->ds_dir;
       dsl_dir_cancel_waiters(dd);

amotin · 2024-08-08T23:28:39Z

@tonyhutter dmu_objset_is_dirty() checks whether objset has any elements in os_dirty_dnodes multilist. Entries are deleted from there by dmu_objset_sync_dnodes() before the dnode is actually synced and moved into os_synced_dnodes. But the last dnode may be not yet synced by that point, barely starting it. I am not sure flushing dbufs from under dnode_sync() is really dangerous, since they should still have references, but I guess they may be not evicted at very least. On top of that dsl_pool_sync() calls dmu_objset_sync_done() for all datasets only after they all synced their dnodes, that even more likely haven't completed yet. Meanwhile dmu_objset_sync_done() may call userquota_updates_task(), which as I can see may dirty few more dnodes. dnode_sync() for those dnodes is explicitly called later by dmu_objset_sync(), which again makes me think about flushing dbufs under it. After that meta-dnode is also synced, which may also have a bunch of dbufs.

TLDR, I am not sure it is legal to call dmu_objset_is_dirty() out of syncing context, and while I am not sure it is fatal in this case, it seems to me it may end up not evicting some of dbufs from cache, as was desired. I also wonder if this code is expected to guarantee that dataset is synced in case of crash, if unmount is expected to provide such guaranties.

amotin · 2024-08-08T23:44:01Z

As a quick thought, I wonder if txg_wait_synced(dmu_objset_pool(os, os_dirty ? 0 : spa_syncing_txg(os->os_spa)); would close the race, making sure whatever TXG sync is ongoing now will be completed, otherwise doing nothing.

@mattmacy

This applies the same change in openzfs#9115 to FreeBSD. This was actually the old behavior in FreeBSD 12; it only regressed when FreeBSD support was added to OpenZFS. As far as I can tell, the timeline went like this: * Illumos's zfsvfs_teardown used an unconditional txg_wait_synced * Illumos added the dirty data check [^4] * FreeBSD merged in Illumos's conditional check [^3] * OpenZFS forked from Illumos * OpenZFS removed the dirty data check in openzfs#7795 [^5] * @mattmacy forked the OpenZFS repo and began to add FreeBSD support * OpenZFS PR openzfs#9115[^1] recreated the same dirty data check that Illumos used, in slightly different form. At this point the OpenZFS repo did not yet have multi-OS support. * Matt Macy merged in FreeBSD support in openzfs#8987[^2] , but it was based on slightly outdated OpenZFS code. In my local testing, this vastly improves the reboot speed of a server with a large pool that has 1000 datasets and is resilvering an HDD. [^1]: openzfs#9115 [^2]: openzfs#8987 [^3]: freebsd/freebsd-src@10b9d77 [^4]: illumos/illumos-gate@5aaeed5 [^5]: openzfs#7795 Sponsored by: Axcient Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Alan Somers <asomers@gmail.com> Closes openzfs#16268

allanjude approved these changes Jul 28, 2024

View reviewed changes

tonyhutter approved these changes Aug 8, 2024

View reviewed changes

behlendorf added the Status: Code Review Needed Ready for review and testing label Aug 8, 2024

behlendorf approved these changes Aug 8, 2024

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Aug 8, 2024

behlendorf merged commit ed0db1c into openzfs:master Aug 9, 2024
23 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make txg_wait_synced conditional in zfsvfs_teardown, for FreeBSD #16268

Make txg_wait_synced conditional in zfsvfs_teardown, for FreeBSD #16268

asomers commented Jun 14, 2024

amotin commented Jun 14, 2024 •

edited

Loading

tonyhutter commented Aug 8, 2024

amotin commented Aug 8, 2024 •

edited

Loading

amotin commented Aug 8, 2024

Make txg_wait_synced conditional in zfsvfs_teardown, for FreeBSD #16268

Make txg_wait_synced conditional in zfsvfs_teardown, for FreeBSD #16268

Conversation

asomers commented Jun 14, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Footnotes

amotin commented Jun 14, 2024 • edited Loading

tonyhutter commented Aug 8, 2024

amotin commented Aug 8, 2024 • edited Loading

amotin commented Aug 8, 2024

amotin commented Jun 14, 2024 •

edited

Loading

amotin commented Aug 8, 2024 •

edited

Loading