ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() #13368

ahrens · 2022-04-22T18:32:00Z

Motivation and Context

lseek(SEEK_DATA | SEEK_HOLE) are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks. To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations". If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them. In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily. For
example, if we do a read() syscall to each file before we do its SEEK.
This applies even with relatime=on, when the read() is the first
read after the last write. The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk. They don't care about metadata
like the contents of the bonus or spill blocks. (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

Description

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

How Has This Been Tested?

Tested with a workload that does:

write lots of files
for each file:

read 1 block of file
SEEK_DATA/HOLE of the file

Previously, the first SEEK_DATA of each file caused a txg_wait_synced(). Now only the first SEEK_DATA of the first file causes a txg_wait_synced().

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

module/zfs/dnode.c

adamdmoss · 2022-08-26T02:28:34Z

(polite ping! wondering if @behlendorf's review comment is a show-stopper.)

behlendorf · 2022-08-26T21:13:36Z

The idea here is great. The issue caught by the test case failure just needs to be resolved before it can be merged.

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Signed-off-by: Matthew Ahrens <mahrens@delphix.com>

amotin

Seems to have sense.

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009

Finix1979 · 2023-03-15T13:42:20Z

Hi, @ahrens Could we say the dnode is also clean if the blkid of lseek offset is bigger than the blkid of all dirty records and free ranges?

ahrens · 2023-03-15T14:46:52Z

@Finix1979 that is probably true, but it might take too long to determine that, with the current data structures. (dn_dirty_records is not sorted so you'd have to look through them all)

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #13368 Issue #14594 Issue #14512 Issue #14009

`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes openzfs#13368 Issue openzfs#14594 Issue openzfs#14512 Issue openzfs#14009

thesamesam · 2023-04-16T04:01:55Z

This seems to cause #14753.

ahrens requested review from behlendorf, grwilson and mmaybee April 22, 2022 18:32

behlendorf added the Status: Code Review Needed Ready for review and testing label Apr 28, 2022

behlendorf reviewed May 4, 2022

View reviewed changes

module/zfs/dnode.c Show resolved Hide resolved

This was referenced Mar 8, 2023

severe performance regression on virtual disk migration for qcow2 on ZFS with ZFS 2.1.5 #14594

Open

SEEK_HOLE loop & forced syncing causes never-ending delay in grep #14512

Closed

ahrens force-pushed the dnode_dirty branch from dea1000 to c4ae54e Compare March 9, 2023 22:55

thesamesam mentioned this pull request Mar 11, 2023

ZFS big write performance hit upgrading from 2.1.4 to 2.1.5 or 2.1.6 #14009

Open

ahrens added 2 commits March 13, 2023 09:53

new file is holey

60ef5d5

ahrens force-pushed the dnode_dirty branch from c4ae54e to 60ef5d5 Compare March 13, 2023 16:54

amotin approved these changes Mar 13, 2023

View reviewed changes

behlendorf approved these changes Mar 14, 2023

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Mar 14, 2023

behlendorf merged commit 5198511 into openzfs:master Mar 14, 2023

behlendorf mentioned this pull request Mar 14, 2023

[2.1.10] ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() #14627

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() #13368

ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() #13368

ahrens commented Apr 22, 2022

adamdmoss commented Aug 26, 2022

behlendorf commented Aug 26, 2022

amotin left a comment

Finix1979 commented Mar 15, 2023

ahrens commented Mar 15, 2023

thesamesam commented Apr 16, 2023

ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() #13368

ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() #13368

Conversation

ahrens commented Apr 22, 2022

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

adamdmoss commented Aug 26, 2022

behlendorf commented Aug 26, 2022

amotin left a comment

Choose a reason for hiding this comment

Finix1979 commented Mar 15, 2023

ahrens commented Mar 15, 2023

thesamesam commented Apr 16, 2023