Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BRT: Rework structures and locks to be per-vdev #16740

Closed
wants to merge 4 commits into from

Conversation

amotin
Copy link
Member

@amotin amotin commented Nov 10, 2024

While block cloning operation from the beginning was made per-vdev, before this change most of its data were protected by two pool-wide locks. It created lots of lock contention in many workload.

This change makes most of block cloning data structures per-vdev, which allows to lock them separately. The only pool-wide lock now it spa_brt_lock, protecting array of per-vdev pointers and in most cases taken as reader. Also this splits per-vdev locks into three different ones: bv_pending_lock protects the AVL-tree of pending operations in open context, bv_mos_entries_lock protects BRT ZAP object from destruction while being prefetched from open context, and bv_lock protects the rest of per-vdev context during TXG commit process. There should be no functional difference aside of some optimizations.

While there, add by_dnode variants to lookup/prefetch_uint64 ZAP calls.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

@amotin
Copy link
Member Author

amotin commented Nov 12, 2024

Added fix for bclone/bclone_prop_sync ZTS test. Not my fault, just an unstable test.

If we write less than 113 bytes with enabled compression we get
embeded block, which then fails check for number of cloned blocks
in bclone_test.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
We are doing exactly the same checks around all brt_pending_add().

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
@amotin
Copy link
Member Author

amotin commented Nov 15, 2024

After more profiling decided to some more optimize brt_entry_addref(): pass it already available brtvd to avoid another brt_vdev() call with its locks, plus pass it the reference count to not call it multiple times if block was cloned several times within one TXG.

While block cloning operation from the beginning was made per-vdev,
before this change most of its data were protected by two pool-
wide locks.  It created lots of lock contention in many workload.

This change makes most of block cloning data structures per-vdev,
which allows to lock them separately.  The only pool-wide lock now
it spa_brt_lock, protecting array of per-vdev pointers and in most
cases taken as reader.  Also this splits per-vdev locks into three
different ones: bv_pending_lock protects the AVL-tree of pending
operations in open context, bv_mos_entries_lock protects BRT ZAP
object from while being prefetched, and bv_lock protects the rest
of per-vdev context during TXG commit process.  There should be
no functional difference aside of some optimizations.

Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 15, 2024
behlendorf pushed a commit that referenced this pull request Nov 15, 2024
We are doing exactly the same checks around all brt_pending_add().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16740
behlendorf pushed a commit that referenced this pull request Nov 15, 2024
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16740
behlendorf pushed a commit that referenced this pull request Nov 15, 2024
While block cloning operation from the beginning was made per-vdev,
before this change most of its data were protected by two pool-
wide locks.  It created lots of lock contention in many workload.

This change makes most of block cloning data structures per-vdev,
which allows to lock them separately.  The only pool-wide lock now
it spa_brt_lock, protecting array of per-vdev pointers and in most
cases taken as reader.  Also this splits per-vdev locks into three
different ones: bv_pending_lock protects the AVL-tree of pending
operations in open context, bv_mos_entries_lock protects BRT ZAP
object from while being prefetched, and bv_lock protects the rest
of per-vdev context during TXG commit process.  There should be
no functional difference aside of some optimizations.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #16740
@amotin amotin deleted the brt_lock branch November 15, 2024 23:09
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 15, 2024
If we write less than 113 bytes with enabled compression we get
embeded block, which then fails check for number of cloned blocks
in bclone_test.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 15, 2024
We are doing exactly the same checks around all brt_pending_add().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 15, 2024
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 15, 2024
While block cloning operation from the beginning was made per-vdev,
before this change most of its data were protected by two pool-
wide locks.  It created lots of lock contention in many workload.

This change makes most of block cloning data structures per-vdev,
which allows to lock them separately.  The only pool-wide lock now
it spa_brt_lock, protecting array of per-vdev pointers and in most
cases taken as reader.  Also this splits per-vdev locks into three
different ones: bv_pending_lock protects the AVL-tree of pending
operations in open context, bv_mos_entries_lock protects BRT ZAP
object from while being prefetched, and bv_lock protects the rest
of per-vdev context during TXG commit process.  There should be
no functional difference aside of some optimizations.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
@markjdb
Copy link
Contributor

markjdb commented Nov 19, 2024

After this commit I get a panic while running the ZTS on FreeBSD: brt_unload() may destroy spa->spa_brt_lock before the lock is initialized:

panic: sx lock still held
cpuid = 1
time = 1732035378
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00d90e88b0
vpanic() at vpanic+0x136/frame 0xfffffe00d90e89e0
panic() at panic+0x43/frame 0xfffffe00d90e8a40
sx_destroy() at sx_destroy+0x30/frame 0xfffffe00d90e8a50
spa_unload() at spa_unload+0x2fd/frame 0xfffffe00d90e8a90
spa_tryimport() at spa_tryimport+0x378/frame 0xfffffe00d90e8b00
zfs_ioc_pool_tryimport() at zfs_ioc_pool_tryimport+0x3a/frame 0xfffffe00d90e8b30
zfsdev_ioctl_common() at zfsdev_ioctl_common+0x48d/frame 0xfffffe00d90e8bd0
zfsdev_ioctl() at zfsdev_ioctl+0xee/frame 0xfffffe00d90e8c00
devfs_ioctl() at devfs_ioctl+0xd1/frame 0xfffffe00d90e8c50
vn_ioctl() at vn_ioctl+0xb6/frame 0xfffffe00d90e8cc0
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe00d90e8ce0
kern_ioctl() at kern_ioctl+0x286/frame 0xfffffe00d90e8d40
sys_ioctl() at sys_ioctl+0x12f/frame 0xfffffe00d90e8e00
amd64_syscall() at amd64_syscall+0x158/frame 0xfffffe00d90e8f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00d90e8f30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x82b79eb6a, rsp = 0x82058e848, rbp = 0x82058e8b0 ---

gdb says:

(gdb) frame 18
#18 0xffffffff82c8dafa in brt_unload (spa=spa@entry=0xfffffe00ff928000) at /home/markj/src/zfs/module/zfs/brt.c:1544
1544            rw_destroy(&spa->spa_brt_lock);
(gdb) p spa->spa_brt_lock
$1 = {
  lock_object = {
    lo_name = 0x0,
    lo_flags = 0,
    lo_data = 0,
    lo_witness = 0x0
  },
  sx_lock = 0
}

In particular, brt_unload() isn't idempotent, it seems we want some kind of guard in case brt_alloc() was never called.

@amotin
Copy link
Member Author

amotin commented Nov 19, 2024

@markjdb Yes, I hit it myself just last night. The last chunk of #16773 should fix it.

behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 21, 2024
If we write less than 113 bytes with enabled compression we get
embeded block, which then fails check for number of cloned blocks
in bclone_test.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 21, 2024
We are doing exactly the same checks around all brt_pending_add().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 21, 2024
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Nov 21, 2024
While block cloning operation from the beginning was made per-vdev,
before this change most of its data were protected by two pool-
wide locks.  It created lots of lock contention in many workload.

This change makes most of block cloning data structures per-vdev,
which allows to lock them separately.  The only pool-wide lock now
it spa_brt_lock, protecting array of per-vdev pointers and in most
cases taken as reader.  Also this splits per-vdev locks into three
different ones: bv_pending_lock protects the AVL-tree of pending
operations in open context, bv_mos_entries_lock protects BRT ZAP
object from while being prefetched, and bv_lock protects the rest
of per-vdev context during TXG commit process.  There should be
no functional difference aside of some optimizations.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
@mmatuska
Copy link
Contributor

@amotin is this something for 2.2.7?

@amotin
Copy link
Member Author

amotin commented Nov 24, 2024

is this something for 2.2.7?

@mmatuska I don't plan to have 2.2.7 in production to care personally, but it could be, just together with #16773 and #16791. It is just that the patches are very new, so may be give them time to soak till 2.2.8? It would definitely improve cloning performance, but formally not a bug fix. @tonyhutter What do you think?

ixhamza pushed a commit to truenas/zfs that referenced this pull request Dec 2, 2024
If we write less than 113 bytes with enabled compression we get
embeded block, which then fails check for number of cloned blocks
in bclone_test.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
ixhamza pushed a commit to truenas/zfs that referenced this pull request Dec 2, 2024
We are doing exactly the same checks around all brt_pending_add().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
ixhamza pushed a commit to truenas/zfs that referenced this pull request Dec 2, 2024
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
ixhamza pushed a commit to truenas/zfs that referenced this pull request Dec 2, 2024
While block cloning operation from the beginning was made per-vdev,
before this change most of its data were protected by two pool-
wide locks.  It created lots of lock contention in many workload.

This change makes most of block cloning data structures per-vdev,
which allows to lock them separately.  The only pool-wide lock now
it spa_brt_lock, protecting array of per-vdev pointers and in most
cases taken as reader.  Also this splits per-vdev locks into three
different ones: bv_pending_lock protects the AVL-tree of pending
operations in open context, bv_mos_entries_lock protects BRT ZAP
object from while being prefetched, and bv_lock protects the rest
of per-vdev context during TXG commit process.  There should be
no functional difference aside of some optimizations.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
arter97 pushed a commit to arter97/zfs that referenced this pull request Dec 9, 2024
If we write less than 113 bytes with enabled compression we get
embeded block, which then fails check for number of cloned blocks
in bclone_test.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
arter97 pushed a commit to arter97/zfs that referenced this pull request Dec 9, 2024
We are doing exactly the same checks around all brt_pending_add().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes openzfs#16740
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants