Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZFS big write performance hit upgrading from 2.1.4 to 2.1.5 or 2.1.6 #14009

Open
ppwaskie opened this issue Oct 10, 2022 · 13 comments
Open

ZFS big write performance hit upgrading from 2.1.4 to 2.1.5 or 2.1.6 #14009

ppwaskie opened this issue Oct 10, 2022 · 13 comments
Labels
Status: Stale No recent activity for issue Type: Regression Indicates a functional regression

Comments

@ppwaskie
Copy link

System information

Type Version/Name
Distribution Name Gentoo
Distribution Version Rolling
Kernel Version 5.19.14-gentoo-x86_64, 5.15.72-gentoo-x86_64
Architecture x86_64
OpenZFS Version 2.1.6 or 2.1.5

Describe the problem you're observing

I've been running ZFS 2.1.4 for quite some time on my main ZFS array, using RAIDz3 with a very large dataset (85TB online). On Gentoo, I can only run a 5.15.x or lower kernel with this version. Upgrading to a 5.18 or 5.19 kernel, I need to upgrade to use ZFS 2.1.6 to compile for the newer kernel. When I try this, my write performance goes from 100-150MB/sec of write on 5.15 and ZFS 2.1.4 (testing emerge -a =sys-kernel/gentoo-sources-5.10.144) to about 100 kB/sec on 5.19.14 and ZFS 2.1.6.

I've tried ZFS 2.1.5 and 2.1.6 with a 5.15.72 kernel, and had the exact same performance regression.

The big issue is ZFS 2.1.4 has now been removed from the main world list after an emerge --sync, so I can't revert my installed version of 2.1.6.

Describe how to reproduce the problem

Upgrade an existing host to ZFS 2.1.5 or 2.1.6, attempt writing a larger package with lots of smaller files (e.g. a Linux kernel source package) and observe the write performance reduced by a factor of about 100.

Include any warning/errors/backtraces from the system logs

I see nothing indicating anything is going wrong. Nothing in dmesg, nothing in syslogs, and zpool status is clean.

Rebooting into a 5.15 kernel with ZFS 2.1.4 on the exact same array returns the expected performance.

@ppwaskie ppwaskie added the Type: Defect Incorrect behavior (e.g. crash, hang) label Oct 10, 2022
@ryao ryao added Type: Regression Indicates a functional regression and removed Type: Defect Incorrect behavior (e.g. crash, hang) labels Oct 10, 2022
@ryao
Copy link
Contributor

ryao commented Oct 10, 2022

Would you try ZFS master via the 9999 ebuild and see if the issue is present there too?

As long as you do not run a zpool upgrade $pool command, it should be safe to go to ZFS master and then back to 2.1.4.

@satarsa
Copy link

satarsa commented Oct 11, 2022

The big issue is ZFS 2.1.4 has now been removed from the main world list after an emerge --sync, so I can't revert my installed version of 2.1.6.

Actually, you can. You could clone the official gentoo repo from https://gitweb.gentoo.org/repo/gentoo.git/
as your local repo, check it out to the version with zfs-kmod-2.1.4-r1 not yet dropped (I believe it would be 33344d7dd6b44bd93c17485d77d60c0e25ef71ee) and mask locally everything >=2.1.5.

@ppwaskie
Copy link
Author

@satarsa thanks for that. And @ryao I connected with one of the Gentoo maintainers for ZFS offline, and he provided me with some instructions on how to use the 9999 ebuild along with bisecting between 2.1.4 and 2.1.5. I’m happy to try and find the commit where the perf regression showed up, at least for my ZFS setup.

I honestly didn’t think this would get so much activity though shortly after I opened the issue! I’m currently not at home where this server is, but I’ll try and run some of these bisect ops while I’m away this week. Worst case, I can get this nailed down this coming weekend.

All of the support is greatly appreciated!!

@ryao
Copy link
Contributor

ryao commented Oct 11, 2022

I did not expect you to bisect it, but if you do, that would be awesome. I should be able to figure this out quickly if you identify the bad patch through a bisect.

@scineram
Copy link

@ryao From the release notes only #13405 looks like it could really impact general performance.

@ppwaskie
Copy link
Author

I haven’t started bisecting yet, but more info on my system/setup where I’m seeing this issue:

  • Intel Xeon SP system, Skylake Platinum, 2 socket, 112 cores (with SMT enabled)
  • 128GB RAM
  • 13 x 10TB Seagate Exos drives in RAIDz3
  • 2 x 1TB Intel NVMe SSD’s. Half of each are split into Log and ARC cache. The other half of each is a RAID-1 mirror for the root filesystem of the host.

So I do have many cores in the system running. In that RAIDz3 pool, I have many datasets carved out, where I’m pushing about 31TB used total. Most of it is video-based streaming content for Plex, so not lots of tiny files.

I hope to have more info once I can coordinate with home and bisect on the live system.

@ppwaskie
Copy link
Author

ppwaskie commented Oct 16, 2022

Apologies for the delay on this. I was finally able to get some time on the box and bisect this.

This is the offending commit that is killing write performance on my system:

9f6943504aec36f897f814fb7ae5987425436b11 is the first bad commit
commit 9f6943504aec36f897f814fb7ae5987425436b11
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Tue Nov 30 10:38:09 2021 -0800

    Default to zfs_dmu_offset_next_sync=1
    
    Strict hole reporting was previously disabled by default as a
    performance optimization.  However, this has lead to confusion
    over the expected behavior and a variety of workarounds being
    adopted by consumers of ZFS.  Change the default behavior to
    always report holes and force the TXG sync.
    
    Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Upstream-commit: 05b3eb6d232009db247882a39d518e7282630753
    Ref: #13261
    Closes #12746

 man/man4/zfs.4   |  8 ++++----
 module/zfs/dmu.c | 12 ++++++++----
 2 files changed, 12 insertions(+), 8 deletions(-)

I've taken this a step further and while on the build with this patch, I turned off that tunable:

# echo 0 > /sys/module/zfs/parameters/zfs_dmu_offset_next_sync

And then re-tested immediately after. The issue went away. I went from about 100kB/sec write performance to 150MB/sec (note the order of magnitude difference).

UPDATE: I went ahead and built the 2.1.6 ebuilds, and confirmed I still had this issue. I then turned off the same tunable, and the performance issue went away.

Hope this helps inform how to deal with this upstream.

@ryao
Copy link
Contributor

ryao commented Oct 16, 2022

Nice find.

@rincebrain
Copy link
Contributor

I should warn you, turning that off will result in sometimes treating files as dense when they're sparse if that hasn't synced out yet, IIRC, so if that's a use case you care about, you may be sad.

Of course, when you're handing it to ZFS with compression on, it'll eat the sparseness one way or another, it's just a question of whether you unnecessarily copied some zeroes only to throw them out, so, if this works for you, great, just be aware that it results in additional IO overhead if you come looking for performance bottlenecks again.

@amotin
Copy link
Member

amotin commented Oct 18, 2022

I see it not great to allow regular unprivileged user to force or depend on pool TXG commits. There should be some better solution.

@amotin
Copy link
Member

amotin commented Oct 18, 2022

I think at very least the code could be optimized to not even think to commit TXG if file is below a certain size, especially if below one block, that means it can't have holes unless it is a one big hole. If I understood right and the workload is updating Linux source tree, then I guess most/many of source files should fit within one block.

@thesamesam
Copy link
Contributor

thesamesam commented Mar 11, 2023

See also #14512 and #14594. #13368 may or may not help.

behlendorf pushed a commit that referenced this issue Mar 14, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #13368 
Issue #14594 
Issue #14512 
Issue #14009
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Mar 14, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
behlendorf pushed a commit that referenced this issue Mar 15, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #13368 
Issue #14594 
Issue #14512 
Issue #14009
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 17, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 17, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 17, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
lundman pushed a commit to openzfsonwindows/openzfs that referenced this issue Mar 17, 2023
`lseek(SEEK_DATA | SEEK_HOLE)` are only accurate when the on-disk blocks
reflect all writes, i.e. when there are no dirty data blocks.  To ensure
this, if the target dnode is dirty, they wait for the open txg to be
synced, so we can call them "stabilizing operations".  If they cause
txg_wait_synced often, it can be detrimental to performance.

Typically, a group of files are all modified, and then SEEK_DATA/HOLE
are performed on them.  In this case, the first SEEK does a
txg_wait_synced(), and subsequent SEEKs don't need to wait, so
performance is good.

However, if a workload involves an interleaved metadata modification,
the subsequent SEEK may do a txg_wait_synced() unnecessarily.  For
example, if we do a `read()` syscall to each file before we do its SEEK.
This applies even with `relatime=on`, when the `read()` is the first
read after the last write.  The txg_wait_synced() is unnecessary because
the SEEK operations only care that the structure of the tree of indirect
and data blocks is up to date on disk.  They don't care about metadata
like the contents of the bonus or spill blocks.  (They also don't care
if an existing data block is modified, but this would be more involved
to filter out.)

This commit changes the behavior of SEEK_DATA/HOLE operations such that
they do not call txg_wait_synced() if there is only a pending change to
the bonus or spill block.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes openzfs#13368 
Issue openzfs#14594 
Issue openzfs#14512 
Issue openzfs#14009
Copy link

stale bot commented Mar 13, 2024

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale No recent activity for issue label Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale No recent activity for issue Type: Regression Indicates a functional regression
Projects
None yet
Development

No branches or pull requests

7 participants