zfs-2.3.0-rc4 patchset #16760 #16794

behlendorf · 2024-11-21T15:51:50Z

Motivation and Context

Initial proposed patchset for zfs-2.3.0-rc4.

Description

Bug fixes, build fixes, ZTS updates.

How Has This Been Tested?

Clean backports from master. Will be retested by the CI.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

This commit fixes JSON output for zpool list when user properties are requested with -o flag. This case needed to be handled specifically since zpool_prop_to_name does not return property name for user properties, instead it is stored in pl->pl_user_prop. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes openzfs#16734

In zpool_get_user_prop, when called from zpool_expand_proplist and collect_pool, we often have zpool_props present in zpool_handle_t equal to NULL. This mostly happens when only one user property is requested using zpool list -o <user_property>. Checking for this case and correctly initializing the zpool_props field in zpool_handle_t fixes this issue. Interestingly, this issue does not occur if we query any other property like name or guid along with a user property with -o flag because while accessing properties like guid, zpool_prop_get_int is called which checks for this case specifically and calls zpool_get_all_props. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Umer Saleem <usaleem@ixsystems.com> Closes openzfs#16734

mappedread_sf() may allocate pages; if it fails to populate a page can't free it, it needs to ensure that it's placed into a page queue, otherwise it can't be reclaimed until the vnode is destroyed. I think this is quite unlikely to happen in practice, it was noticed by code inspection. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes openzfs#16643

As a deadlock avoidance measure, zfs_getpages() would only try to acquire a rangelock, falling back to a single-page read if this was not possible. However, this is incompatible with direct I/O. Instead, release the busy lock before trying to acquire the rangelock in blocking mode. This means that it's possible for the page to be replaced, so we have to re-lookup. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes openzfs#16643

..., before we make the header or the log block visible to others. It should fix assertion on allocated space going negative if the header is freed once the lock is dropped, while the write is still going. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16040 Closes openzfs#16743

dsl_free() calls zio_free() to free the block. For most blocks, this simply calls metaslab_free() without doing any IO or putting anything on the IO pipeline. Some blocks however require additional IO to free. This at least includes gang, dedup and cloned blocks. For those, zio_free() will issue a ZIO_TYPE_FREE IO and return. If a huge number of blocks are being freed all at once, it's possible for dsl_dataset_block_kill() to be called millions of time on a single transaction (eg a 2T object of 128K blocks is 16M blocks). If those are all IO-inducing frees, that then becomes 16M FREE IOs placed on the pipeline. At time of writing, a zio_t is 1280 bytes, so for just one 2T object that requires a 20G allocation of resident memory from the zio_cache. If that can't be satisfied by the kernel, an out-of-memory condition is raised. This would be better handled by improving the cases that the dmu_tx_assign() throttle will handle, or by reducing the overheads required by the IO pipeline, or with a better central facility for freeing blocks. For now, we simply check for the cases that would cause zio_free() to create a FREE IO, and instead put the block on the pool's freelist. This is the same place that blocks from destroyed datasets go, and the async destroy machinery will automatically see them and trickle them out as normal. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#6783 Closes openzfs#16708 Closes openzfs#16722 Closes openzfs#16697

- If we don't want dmu_read_pages() to perform extra readahead/behind, pass a pointer to 0 instead of a null pointer, as dum_read_pages() expects rahead and rbehind to be non-null. - Avoid unneeded iterations in a loop. Sponsored-by: Klara, Inc. Reported-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes openzfs#16758

Since zvol read and write can process up to (DMU_MAX_ACCESS / 2) bytes in a single operation, the current optimal I/O size is too low. SCST directly reports this value as the optimal transfer length for the target SCSI device. Increasing it from the previous volblocksize results in performance improvement for large block parallel I/O workloads. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes openzfs#16750

This patch fixes compilation with uClibc by applying the same fallback as commit e12d761 to the `getversion.c` file, which was previously overlooked. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: José Luis Salvador Rufo <salvador.joseluis@gmail.com> Closes openzfs#16735 Closes openzfs#16741

Welcome to the party 🎉 Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes openzfs#16762

When an OFFLINE device is physically removed, a spare is automatically activated. However, this behavior differs in FreeBSD, where we do not transition from OFFLINE state to REMOVED. Our support team has encountered cases where customers experienced unexpected behavior during drive replacements, with multiple spares activating for the same VDEV due to a single disk replacement. This patch ensures that a drive in an OFFLINE state remains in that state, preventing it from transitioning to REMOVED and being automatically replaced by a spare. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes openzfs#16751

It should be __VA_ARGS__, not __VA_ARGS. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16780

Also fix comment cross-referencing to zpool.8. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Steve Mokris <smokris@softpixel.com> Closes openzfs#16777

Compression names actually aren't used in dedup table names, but checksum names are. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#16776

Those values require global atomics to get current hash_elements values in few of the hottest code paths, while in all the years I never cared about it. If somebody wants, it should be easy to get it by periodic sampling, since neither ARC header nor DBUF counts change so fast that it would be difficult to catch. For now I've left hash_elements_max kstat for ARC, since it was used/reported by arc_summary and it would break older versions, but now it just reports the current value. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16759

Without doing that there is a race window on export when history log write by completed rebuild dirties transaction beyond final, triggering assertion. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16714 Closes openzfs#16782

Increase the injected delay to 1000ms and the ZIO_SLOW_IO_MS threshold to 750ms to avoid false positives due to unrelated slow IOs which may occur in the CI environment. Additionally, clear the fault injection as soon as it is no longer required for the test case. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#16769

zio_delay_interrupt(), apparently used for fault injection, is executed in the I/O pipeline. It can cause the calling thread to go to sleep, which is not allowed on FreeBSD. This happens only for small delays, though, and there's no apparent reason to avoid deferring to a taskqueue in that case, as it already does otherwise. Simply go to sleep unconditionally. This fixes an occasional panic I see when running the ZTS on FreeBSD. Also remove an unhelpful comment referencing the non-existent timeout_generic(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes openzfs#16785

The current "Requires" lines only ensure the old kernel is available on the system but it does not prevent fedora from updating to an incompatible and breaking user's system. Set Conflicts to block incompatible kernels from being installed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: tleydxdy <shironeko.github@tesaguri.club> Closes openzfs#16139

If we write less than 113 bytes with enabled compression we get embeded block, which then fails check for number of cloned blocks in bclone_test. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

We are doing exactly the same checks around all brt_pending_add(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

While block cloning operation from the beginning was made per-vdev, before this change most of its data were protected by two pool- wide locks. It created lots of lock contention in many workload. This change makes most of block cloning data structures per-vdev, which allows to lock them separately. The only pool-wide lock now it spa_brt_lock, protecting array of per-vdev pointers and in most cases taken as reader. Also this splits per-vdev locks into three different ones: bv_pending_lock protects the AVL-tree of pending operations in open context, bv_mos_entries_lock protects BRT ZAP object from while being prefetched, and bv_lock protects the rest of per-vdev context during TXG commit process. There should be no functional difference aside of some optimizations. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Pawel Jakub Dawidek <pjd@FreeBSD.org> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16740

- With both pending and current AVL-trees being per-vdev and having effectively identical comparison functions (pending tree compared also birth time, but I don't believe it is possible for them to be different for the same offset within one transaction group), it makes no sense to move entries from one to another. Instead inline dramatically simplified brt_entry_addref() into brt_pending_apply(). It no longer requires bv_lock, since there is nothing concurrent to it at the time. And it does not need to search the tree for the previous entries, since it is the same tree, we already have the entry and we know it is unique. - Put brt_vdev_lookup() and brt_vdev_addref() into different tree traversals to avoid false positives in the first due to the second entcount modifications. It saves dramatic amount of time when a file cloned first time by not looking for non-existent ZAP entries. - Remove avl_is_empty(bv_tree) check from brt_maybe_exists(). I don't think it is needed, since by the time all added entries are already accounted in bv_entcount. The extra check must be producing too many false positives for no reason. Also we don't need bv_lock there, since bv_entcount pointer must be table at this point, and we don't care about false positive races here, while false negative should be impossible, since all brt_vdev_addref() have already completed by this point. This dramatically reduces lock contention on massive deletes of cloned blocks. The only remaining one is between multiple parallel free threads calling brt_entry_decref(). - Do not update ZAP if net change for a block over the TXG was 0. In combination with above it makes file move between datasets as cheap operation as originally intended if it fits into one TXG. - Do not allocate vdevs on pool creation or import if it did not have active block cloning. This allows to save a bit in few cases. - While here, add proper error handling in brt_load() on pool import instead of assertions. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16773

This fixes assertion in brt_sync_table() on debug builds when last cloned block on the vdev is freed and bv_meta_dirty is cleared, while bv_entcount_dirty is not. Should not matter in production. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16791

Update the META file to reflect compatibility with the 6.12 kernel. Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#16793

Without them the order of operations might get unexpected. Reviewed-by: Adam Moss <c@yotes.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16826

Direct I/O implementation added condition to call dbuf_undirty() only in case of block cloning. But the condition is not right if the block is no longer dirty in this TXG, but still in DB_NOFILL state. It resulted in block not reverting to DB_UNCACHED and following NULL de-reference on attempt to access absent db_data. While there, add assertions for db_data to make debugging easier. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16829

Doc bug missed in d7605ae. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#16827

FreeBSD's libprocstat seems to build kernel code in user space, which does not work here due to undefined vnode_t. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Martin Matuska <mm@FreeBSD.org> Signed-off-by:Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16808

Should make no difference, just some dead code cleanup. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Martin Matuska <mm@FreeBSD.org> Signed-off-by:Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16808

When replacing a disk, a child process is forked to run a script called zfs_prepare_disk (which can be useful for disk firmware update or health check). The parent than calls waitpid and checks the child error/status code. However, the _reap_children thread (created from zed_exec_process to manage zedlets) also waits for all children with the same PGID and can stole the signal, causing the replace operation to be aborted. As waitpid returns -1, the parent incorrectly assume that the child process had an error or was killed. This, in turn, leaves the newly added disk in REMOVED or UNAVAIL status rather than completing the replace process. This patch changes the PGID of the child process execuing the prepare script, shielding it from the _reap_children thread. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Gionatan Danti <g.danti@assyoma.it> Closes openzfs#16801

Some users might want to scrub only new data because they would like to know if the new write wasn't corrupted. This PR adds possibility scrub only newly written data. This introduces new `last_scrubbed_txg` property, indicating the transaction group (TXG) up to which the most recent scrub operation has checked and repaired the dataset, so users can run scrub only from the last saved point. We use a scn_max_txg and scn_min_txg which are already built into scrub, to accomplish that. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-By: Wasabi Technology, Inc. Sponsored-By: Klara Inc. Closes openzfs#16301

The pages in the array may become valid after this initial unbusying, so the assertion only holds during the first iteration of the outer loop. Later in zfs_getpages(), the dmu_read_pages() loop handles already-valid pages. Just drop the assertion, it's not terribly useful. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reported-by: Peter Holm <pho@FreeBSD.org> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Sponsored-by: Klara, Inc. Closes openzfs#16810 Closes openzfs#16834

In some cases like dsl_dataset_hold_obj() it is possible to handle those errors, so failure to hold dataset should be better than kernel panic. Some other places where these errors are still not handled but asserted should be less dangerous just as unreachable. We have a user report about pool corruption leading to assertions on these errors. Hopefully this will make behavior a bit nicer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16836

- Issue prescient prefetches for demand indirect blocks after the first one. It should be quite rare for reads/writes, but much more useful for cloning due to much bigger (up to 1022 blocks) accesses. It covers the gap during the first couple accesses when we can not speculate yet, but we know what is needed right now. It reduces dbuf_hold() sync read delays in dmu_buf_hold_array_by_dnode(). - Increase maximum prefetch distance for indirect blocks from 64 to 128MB. It should cover the maximum 1022 blocks of block cloning access size in case of default 128KB recordsize used. In case of bigger recordsize the above prescient prefetch should also help. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16814

In 6f50f8e we added flex arrays to lr_XX_t structs to silence kernel bounds check warnings. Userspace code was mostly not updated to use them though. It seems that in the right circumstances, compilers can get confused about sizes in the same way, and throw warnings. This commits switch those uses over to use the flex array fields also. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Rob Norris <robn@despairlabs.com> Sponsored-by: https://despairlabs.com/sponsor/ Closes openzfs#16832

Same as writes block cloning can increase block size and number of indirection levels. That means it can dirty block 0 at level 0 or at new top indirection level without explicitly holding them. A block cloning test case for large offsets has been added. Reviewed-by: Rob Norris <robn@despairlabs.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Ameer Hamza <ahamza@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16825

- Instead of copying one ashift-sized block per ZIO, copy as much as we have contiguous data up to 16MB per old vdev. To avoid data moves use gang ABDs, so that read ZIOs can directly fill buffers for write ZIOs. ABDs have much smaller overhead than ZIOs in both memory usage and processing time, plus big I/Os do not depend on I/O aggregation and scheduling to reach decent performance on HDDs. - Reduce raidz_expand_max_copy_bytes to 16MB on 32bit platforms. - Use 32bit range tree when possible (practically always now) to slightly reduce memory usage. - Use ZIO_PRIORITY_REMOVAL for early stages of expansion, same as for main ones. - Fix rate overflows in `zpool status` reporting. With these changes expanding RAIDZ1 from 4 to 5 children I am able to reach 6-12GB/s rate on SSDs and ~500MB/s on HDDs, both are limited by devices instead of CPU. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#15680 Closes openzfs#16819

We quite often invoke macros outside of functions, usually to generate functions or data. cstyle inteprets these as function headers, which at least have opinions for indenting. This introduces a separate state for top-level macro invocations, and excludes it from matching functions. For the moment, most of the existing rules will continue to apply, but this gives us a way to add or removes rules targeting macros specifically. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes openzfs#16840

It's not uncommon to have empty parameters in code generator macros, usually when multiple parameters are concatenated or stringified into a single token or literal. So, exclude the space-before-comma check, which will allow construction like `MACRO_CALL(foo, , baz)`. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes openzfs#16840

In code generation macros, we often use names like `uint` when constructing handler functions. These are not being used as types, so exclude them from the admonishment to use POSIX type names. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes openzfs#16840

cstyle can handle these cases now, so we don't need to disable it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes openzfs#16840

When vdev first sees some block cloning, there is a window when brt_maybe_exists() might already return true since something was cloned, but bv_mos_entries is still 0 since BRT ZAP was not yet created. In such case we should not try to look into the ZAP and dereference NULL bv_mos_entries_dnode. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16851

In dbuf_sync_leaf, we clone the arc_buf in dr if we share it with db except for overridden case. However, this exception causes a race where dbuf_new_size could free the arc_buf after the last dereference of *datap and causes use-after-free. We fix this by cloning the buf regardless if it's overridden. The race: -- P0 P1 dbuf_hold_impl() // dbuf_hold_copy passed // because db_data_pending NULL dbuf_sync_leaf() // doesn't clone *datap // *datap derefed to db_buf dbuf_write(*datap) dbuf_new_size() dmu_buf_will_dirty() dbuf_fix_old_data() // alloc new buf for P0 dr // but can't change *datap arc_alloc_buf() arc_buf_destroy() // alloc new buf for db_buf // and destroy old buf dbuf_write() // continue abd_get_from_buf(data->b_data, arc_buf_size(data)) // use-after-free -- Here's an example when it happens: BUG: kernel NULL pointer dereference, address: 000000000000002e RIP: 0010:arc_buf_size+0x1c/0x30 [zfs] Call Trace: dbuf_write+0x3ff/0x580 [zfs] dbuf_sync_leaf+0x13c/0x530 [zfs] dbuf_sync_list+0xbf/0x120 [zfs] dnode_sync+0x3ea/0x7a0 [zfs] sync_dnodes_task+0x71/0xa0 [zfs] taskq_thread+0x2b8/0x4e0 [spl] kthread+0x112/0x130 ret_from_fork+0x1f/0x30 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Co-authored-by: Chunwei Chen <david.chen@nutanix.com> Closes openzfs#16854

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

usaleem-ix and others added 10 commits November 14, 2024 16:51

behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 21, 2024

behlendorf changed the base branch from master to zfs-2.3-release November 21, 2024 15:52

behlendorf mentioned this pull request Nov 21, 2024

zfs-2.3.0-rc4 patchset #16760

Closed

7 tasks

behlendorf force-pushed the zfs-2.3.0-rc4-staging branch from cefff1e to ec407c3 Compare November 21, 2024 15:57

ixhamza and others added 16 commits November 21, 2024 08:24

Fix few __VA_ARGS typos in assertions

8023d9d

It should be __VA_ARGS__, not __VA_ARGS. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes openzfs#16780

Expand zpool-remove.8 manpage with example results

9a4b2f0

Also fix comment cross-referencing to zpool.8. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Steve Mokris <smokris@softpixel.com> Closes openzfs#16777

Linux 6.12 compat: META

3ed1d60

Update the META file to reflect compatibility with the 6.12 kernel. Reviewed-by: Umer Saleem <usaleem@ixsystems.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#16793

amotin and others added 3 commits December 2, 2024 18:14

behlendorf force-pushed the zfs-2.3.0-rc4-staging branch from f30e11f to bb516ee Compare December 3, 2024 02:15

amotin and others added 8 commits December 5, 2024 09:33

behlendorf force-pushed the zfs-2.3.0-rc4-staging branch from bb516ee to 75205db Compare December 5, 2024 17:34

behlendorf force-pushed the zfs-2.3.0-rc4-staging branch from 75205db to 87ff290 Compare December 5, 2024 19:49

amotin and others added 5 commits December 6, 2024 09:05

behlendorf force-pushed the zfs-2.3.0-rc4-staging branch from 87ff290 to e795b6b Compare December 6, 2024 17:05

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Work in Progress Not yet ready for general review Status: Code Review Needed Ready for review and testing labels Dec 6, 2024

behlendorf marked this pull request as ready for review December 6, 2024 17:05

LunarLambda mentioned this pull request Dec 10, 2024

Linux 6.12 LTS support for zfs 2.3.0 #16850

Open

amotin and others added 3 commits December 12, 2024 16:20

Tag 2.3.0-rc4

7cbe7bb

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

behlendorf force-pushed the zfs-2.3.0-rc4-staging branch from e795b6b to 7cbe7bb Compare December 13, 2024 00:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zfs-2.3.0-rc4 patchset #16760 #16794

zfs-2.3.0-rc4 patchset #16760 #16794

behlendorf commented Nov 21, 2024

zfs-2.3.0-rc4 patchset #16760 #16794

Are you sure you want to change the base?

zfs-2.3.0-rc4 patchset #16760 #16794

Conversation

behlendorf commented Nov 21, 2024

Motivation and Context

Description

How Has This Been Tested?

Types of changes