BRT: Rework structures and locks to be per-vdev #16740

amotin · 2024-11-10T22:54:23Z

While block cloning operation from the beginning was made per-vdev, before this change most of its data were protected by two pool-wide locks. It created lots of lock contention in many workload.

This change makes most of block cloning data structures per-vdev, which allows to lock them separately. The only pool-wide lock now it spa_brt_lock, protecting array of per-vdev pointers and in most cases taken as reader. Also this splits per-vdev locks into three different ones: bv_pending_lock protects the AVL-tree of pending operations in open context, bv_mos_entries_lock protects BRT ZAP object from destruction while being prefetched from open context, and bv_lock protects the rest of per-vdev context during TXG commit process. There should be no functional difference aside of some optimizations.

While there, add by_dnode variants to lookup/prefetch_uint64 ZAP calls.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin · 2024-11-12T01:08:46Z

Added fix for bclone/bclone_prop_sync ZTS test. Not my fault, just an unstable test.

module/zfs/dbuf.c

If we write less than 113 bytes with enabled compression we get embeded block, which then fails check for number of cloned blocks in bclone_test. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

We are doing exactly the same checks around all brt_pending_add(). Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

amotin · 2024-11-15T02:25:18Z

After more profiling decided to some more optimize brt_entry_addref(): pass it already available brtvd to avoid another brt_vdev() call with its locks, plus pass it the reference count to not call it multiple times if block was cloned several times within one TXG.

While block cloning operation from the beginning was made per-vdev, before this change most of its data were protected by two pool- wide locks. It created lots of lock contention in many workload. This change makes most of block cloning data structures per-vdev, which allows to lock them separately. The only pool-wide lock now it spa_brt_lock, protecting array of per-vdev pointers and in most cases taken as reader. Also this splits per-vdev locks into three different ones: bv_pending_lock protects the AVL-tree of pending operations in open context, bv_mos_entries_lock protects BRT ZAP object from while being prefetched, and bv_lock protects the rest of per-vdev context during TXG commit process. There should be no functional difference aside of some optimizations. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

We are doing exactly the same checks around all brt_pending_add(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16740

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16740

While block cloning operation from the beginning was made per-vdev, before this change most of its data were protected by two pool- wide locks. It created lots of lock contention in many workload. This change makes most of block cloning data structures per-vdev, which allows to lock them separately. The only pool-wide lock now it spa_brt_lock, protecting array of per-vdev pointers and in most cases taken as reader. Also this splits per-vdev locks into three different ones: bv_pending_lock protects the AVL-tree of pending operations in open context, bv_mos_entries_lock protects BRT ZAP object from while being prefetched, and bv_lock protects the rest of per-vdev context during TXG commit process. There should be no functional difference aside of some optimizations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #16740

If we write less than 113 bytes with enabled compression we get embeded block, which then fails check for number of cloned blocks in bclone_test. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

We are doing exactly the same checks around all brt_pending_add(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

While block cloning operation from the beginning was made per-vdev, before this change most of its data were protected by two pool- wide locks. It created lots of lock contention in many workload. This change makes most of block cloning data structures per-vdev, which allows to lock them separately. The only pool-wide lock now it spa_brt_lock, protecting array of per-vdev pointers and in most cases taken as reader. Also this splits per-vdev locks into three different ones: bv_pending_lock protects the AVL-tree of pending operations in open context, bv_mos_entries_lock protects BRT ZAP object from while being prefetched, and bv_lock protects the rest of per-vdev context during TXG commit process. There should be no functional difference aside of some optimizations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

markjdb · 2024-11-19T17:06:22Z

After this commit I get a panic while running the ZTS on FreeBSD: brt_unload() may destroy spa->spa_brt_lock before the lock is initialized:

panic: sx lock still held
cpuid = 1
time = 1732035378
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00d90e88b0
vpanic() at vpanic+0x136/frame 0xfffffe00d90e89e0
panic() at panic+0x43/frame 0xfffffe00d90e8a40
sx_destroy() at sx_destroy+0x30/frame 0xfffffe00d90e8a50
spa_unload() at spa_unload+0x2fd/frame 0xfffffe00d90e8a90
spa_tryimport() at spa_tryimport+0x378/frame 0xfffffe00d90e8b00
zfs_ioc_pool_tryimport() at zfs_ioc_pool_tryimport+0x3a/frame 0xfffffe00d90e8b30
zfsdev_ioctl_common() at zfsdev_ioctl_common+0x48d/frame 0xfffffe00d90e8bd0
zfsdev_ioctl() at zfsdev_ioctl+0xee/frame 0xfffffe00d90e8c00
devfs_ioctl() at devfs_ioctl+0xd1/frame 0xfffffe00d90e8c50
vn_ioctl() at vn_ioctl+0xb6/frame 0xfffffe00d90e8cc0
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe00d90e8ce0
kern_ioctl() at kern_ioctl+0x286/frame 0xfffffe00d90e8d40
sys_ioctl() at sys_ioctl+0x12f/frame 0xfffffe00d90e8e00
amd64_syscall() at amd64_syscall+0x158/frame 0xfffffe00d90e8f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe00d90e8f30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x82b79eb6a, rsp = 0x82058e848, rbp = 0x82058e8b0 ---

gdb says:

(gdb) frame 18
#18 0xffffffff82c8dafa in brt_unload (spa=spa@entry=0xfffffe00ff928000) at /home/markj/src/zfs/module/zfs/brt.c:1544
1544            rw_destroy(&spa->spa_brt_lock);
(gdb) p spa->spa_brt_lock
$1 = {
  lock_object = {
    lo_name = 0x0,
    lo_flags = 0,
    lo_data = 0,
    lo_witness = 0x0
  },
  sx_lock = 0
}

In particular, brt_unload() isn't idempotent, it seems we want some kind of guard in case brt_alloc() was never called.

amotin · 2024-11-19T17:15:13Z

@markjdb Yes, I hit it myself just last night. The last chunk of #16773 should fix it.

If we write less than 113 bytes with enabled compression we get embeded block, which then fails check for number of cloned blocks in bclone_test. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

We are doing exactly the same checks around all brt_pending_add(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

While block cloning operation from the beginning was made per-vdev, before this change most of its data were protected by two pool- wide locks. It created lots of lock contention in many workload. This change makes most of block cloning data structures per-vdev, which allows to lock them separately. The only pool-wide lock now it spa_brt_lock, protecting array of per-vdev pointers and in most cases taken as reader. Also this splits per-vdev locks into three different ones: bv_pending_lock protects the AVL-tree of pending operations in open context, bv_mos_entries_lock protects BRT ZAP object from while being prefetched, and bv_lock protects the rest of per-vdev context during TXG commit process. There should be no functional difference aside of some optimizations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

mmatuska · 2024-11-24T01:34:28Z

@amotin is this something for 2.2.7?

amotin · 2024-11-24T01:59:53Z

is this something for 2.2.7?

@mmatuska I don't plan to have 2.2.7 in production to care personally, but it could be, just together with #16773 and #16791. It is just that the patches are very new, so may be give them time to soak till 2.2.8? It would definitely improve cloning performance, but formally not a bug fix. @tonyhutter What do you think?

If we write less than 113 bytes with enabled compression we get embeded block, which then fails check for number of cloned blocks in bclone_test. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

We are doing exactly the same checks around all brt_pending_add(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

While block cloning operation from the beginning was made per-vdev, before this change most of its data were protected by two pool- wide locks. It created lots of lock contention in many workload. This change makes most of block cloning data structures per-vdev, which allows to lock them separately. The only pool-wide lock now it spa_brt_lock, protecting array of per-vdev pointers and in most cases taken as reader. Also this splits per-vdev locks into three different ones: bv_pending_lock protects the AVL-tree of pending operations in open context, bv_mos_entries_lock protects BRT ZAP object from while being prefetched, and bv_lock protects the rest of per-vdev context during TXG commit process. There should be no functional difference aside of some optimizations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

If we write less than 113 bytes with enabled compression we get embeded block, which then fails check for number of cloned blocks in bclone_test. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

We are doing exactly the same checks around all brt_pending_add(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

amotin force-pushed the brt_lock branch from c8c24d2 to 2856996 Compare November 10, 2024 23:08

amotin added the Status: Code Review Needed Ready for review and testing label Nov 10, 2024

amotin force-pushed the brt_lock branch from 2856996 to 5ec6c8f Compare November 11, 2024 13:44

This was referenced Nov 12, 2024

block clone and bulk deletions with regards to files generated by Veeam. #16680

Open

dsl_dataset: put IO-inducing frees on the pool deadlist #16722

Merged

behlendorf approved these changes Nov 13, 2024

View reviewed changes

bwatkinson reviewed Nov 14, 2024

View reviewed changes

module/zfs/dbuf.c Show resolved Hide resolved

bwatkinson approved these changes Nov 14, 2024

View reviewed changes

pjd approved these changes Nov 14, 2024

View reviewed changes

amotin added 3 commits November 14, 2024 21:19

ZTS: Avoid embedded blocks in bclone/bclone_prop_sync

33c6bf4

If we write less than 113 bytes with enabled compression we get embeded block, which then fails check for number of cloned blocks in bclone_test. Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

BRT: Don't call brt_pending_remove() on holes/embedded

d2e06dd

We are doing exactly the same checks around all brt_pending_add(). Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

ZAP: Add by_dnode variants to lookup/prefetch_uint64

f21e492

Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc.

amotin force-pushed the brt_lock branch from bd0cb20 to 5b9a8f4 Compare November 15, 2024 02:20

amotin force-pushed the brt_lock branch from 5b9a8f4 to c9572b1 Compare November 15, 2024 13:57

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 15, 2024

behlendorf closed this in 483087b Nov 15, 2024

amotin deleted the brt_lock branch November 15, 2024 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BRT: Rework structures and locks to be per-vdev #16740

BRT: Rework structures and locks to be per-vdev #16740

amotin commented Nov 10, 2024 •

edited

Loading

amotin commented Nov 12, 2024 •

edited

Loading

amotin commented Nov 15, 2024

markjdb commented Nov 19, 2024

amotin commented Nov 19, 2024

mmatuska commented Nov 24, 2024

amotin commented Nov 24, 2024

BRT: Rework structures and locks to be per-vdev #16740

BRT: Rework structures and locks to be per-vdev #16740

Conversation

amotin commented Nov 10, 2024 • edited Loading

Types of changes

Checklist:

amotin commented Nov 12, 2024 • edited Loading

amotin commented Nov 15, 2024

markjdb commented Nov 19, 2024

amotin commented Nov 19, 2024

mmatuska commented Nov 24, 2024

amotin commented Nov 24, 2024

amotin commented Nov 10, 2024 •

edited

Loading

amotin commented Nov 12, 2024 •

edited

Loading