-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs-2.1.3 patchset #13063
zfs-2.1.3 patchset #13063
Conversation
va_seq was actually a thin veil over va_gen, so z_gen is a more appropriate value than z_seq to populate the field with. Drop the unnecessary compat obfuscation and provide the correct file generation number. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <freqlabs@freebsd.org> Closes openzfs#12851
On FreeBSD vnode reclamation is single-threaded, protected by single global lock. Linux seems to be able to use a thread per mount point, but at this time it creates more harm than good. Reduce number of threads to 1, adding tunable in case somebody wants to try more. Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#12896 Issue openzfs#9966
A recent commit to FreeBSD changed the type of vop_readdir_args.a_cookies to a uint64_t**. There is no functional impact to ZFS because ZFS only uses 32-bit cookies, which will be zero-extended to 64-bits by the existing code. freebsd/freebsd-src@b214fcc Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Alan Somers <asomers@gmail.com> Closes openzfs#12874
When restructuring the zvol_open() logic for the Linux 5.13 kernel a lock inversion was accidentally introduced. In the updated code the spa_namespace_lock is now taken before the zv_suspend_lock allowing the following scenario to occur: down_read <=== waiting for zv_suspend_lock zvol_open <=== holds spa_namespace_lock __blkdev_get blkdev_get_by_dev blkdev_open ... mutex_lock <== waiting for spa_namespace_lock spa_open_common spa_open dsl_pool_hold dmu_objset_hold_flags dmu_objset_hold dsl_prop_get dsl_prop_get_integer zvol_create_minor dmu_recv_end zfs_ioc_recv_impl <=== holds zv_suspend_lock via zvol_suspend() zfs_ioc_recv ... This commit resolves the issue by moving the acquisition of the spa_namespace_lock back to after the zv_suspend_lock which restores the original ordering. Additionally, as part of this change the error exit paths were simplified where possible. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#12863
pidfile_open() sets *pidptr to -1 if the process currently holding the lock is between pidfile_open() and pidfile_write(), the subsequent kill(mountdpid) would potentially SIGHUP all non-system processes except init: just sleep for half a millisecond and try again in that case Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#12067
Any error from lzc_send_redacted is overwritten by the error of send_conclusion_record; skip writing the conclusion record if there was an earlier error. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Philipp Riederer <philipp@riederer.email> Closes openzfs#12766
Do not redefine the fallthrough macro when building with libcpp. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Martin Matuska <mm@FreeBSD.org> Closes openzfs#12880
When performing I/O on FreeBSD using a file based vdev ensure all errors encountered when reading/writing are propagated through the zio pipeline. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#12904
Verify that all empty sectors are zero filled before using them to calculate parity. Failure to do so can result in incorrect parity columns being generated and written to disk if the contents of an empty sector are non-zero. This was possible because the checksum only protects the data portions of the buffer, not the empty sector padding. This issue has been addressed by updating raidz_parity_verify() to check that all dRAID empty sectors are zero filled. Any sectors which are non-zero will be fixed, repair IO issued, and a checksum error logged. They can then be safely used to verify the parity. This specific type of damage is unlikely to occur since it requires a disk to have silently returned bad data, for an empty sector, while performing a scrub. However, if a pool were to have been damaged in this way, scrubbing the pool with this change applied will repair both the empty sector and parity columns as long as the data checksum is valid. Checksum errors will be reported in the `zpool status` output for any repairs which are made. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#12857
sc.nr_to_scan is an input to super_cache_clean (via shrinker->scan_objects), used to set the number of objects to scan in the various caches. However super_cache_scan also modifies sc.nr_to_scan, so when used in a loop we need to reset sc.nr_to_scan back to our desired nr_to_scan for the next iteration. Issue discovered and solution suggested by Tenzin Lhakhang @tlhakhan. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue openzfs#12433 Closes openzfs#12908
Fix from openzfs#12844 (comment) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#12905
They're later |=d with constants, but never reset Caught by valgrind while investigating openzfs#12928 (comment) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#12954
The FreeBSD implementations of various libspl functions for getting mounted device information were found to leak several strings which were being allocated in statfs2mnttab but never freed. The Solaris getmntany(3C) and related interfaces are expected to return strings residing in static buffers that need to be copied rather than freed by the caller. Use static thread-local storage to stash the mnttab structure strings from FreeBSD's statfs info rather than strings allocated on the heap by strdup(3). While here, remove some stray commented out lines. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes openzfs#12961
These are the changes for FreeBSD corresponding to the changes made for Linux in openzfs#12863, see that PR for details. Changes from openzfs#12863 are applied for zvol_geom_open and zvol_cdev_open on FreeBSD. This also adds a check for the zvol dying which we had in zvol_geom_open but was missing in zvol_cdev_open. The check causes the open to fail early with ENXIO when we are in the middle of changing volmode. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes openzfs#12934
First open locking changes were correctly applied to zvol_geom_open but incorrectly applied to zvol_cdev_open, causing spa_namespace_lock to be held indefinitely. Make the first open locking in zvol_cdev_open match zvol_geom_open. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <ryan@iXsystems.com> Closes openzfs#13016
avl_add does avl_find internally, then avl_insert. We're already doing the avl_find, so using avl_insert directly avoids repeating the search. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes openzfs#12967
There is no need to allocate a holds nvlist. lzc_get_holds does that for us. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes openzfs#12967
When the eviction thread goes to shrink an ARC state, it allocates a set of marker buffers used to hold its place in the state's sublists. This can be problematic in low memory conditions, since 1) the allocation can be substantial, as we allocate NCPU markers; 2) on at least FreeBSD, page reclamation can block in arc_wait_for_eviction() In particular, in stress tests it's possible to hit a deadlock on FreeBSD when the number of free pages is very low, wherein the system is waiting for the page daemon to reclaim memory, the page daemon is waiting for the ARC eviction thread to finish, and the ARC eviction thread is blocked waiting for more memory. Try to reduce the likelihood of such deadlocks by pre-allocating markers for the eviction thread at ARC initialization time. When evicting buffers from an ARC state, check to see if the current thread is the ARC eviction thread, and use the pre-allocated markers for that purpose rather than dynamically allocating them. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: George Amanakis <gamanakis@gmail.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes openzfs#12985
FreeBSD's implementation of zfs_uio_fault_move() returns EFAULT when a page fault occurs while copying data in or out of user buffers. The VFS treats such errors specially and will retry the I/O operation (which may have made some partial progress). When the FreeBSD and Linux implementations of zfs_write() were merged, the handling of errors from dmu_write_uio_dbuf() changed such that EFAULT is not handled as a partial write. For example, when appending to a file, the z_size field of the znode is not updated after a partial write resulting in EFAULT. Restore the old handling of errors from dmu_write_uio_dbuf() to fix this. This should have no impact on Linux, which has special handling for EFAULT already. Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Mark Johnston <markj@FreeBSD.org> Closes openzfs#12964
As it says on the tin - the folio work moved a bunch out of mm.h. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#12975
add_disk went from void to must-check int return. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#12975
Linux decided to rename this for some reason. At some point, we should probably invert this mapping, but for now... Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#12975
For us, I think it's always just FALLOC_FL_PUNCH_HOLE with a fake mustache on. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Coleman Kane <ckane@colemankane.org> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#12975
Linux 5.17 sees a rename from complete_and_exit() to kthread complete_and_exit() Upstream commit cead18552660702a4a46f58e65188fe5f36e9dfe ("exit: Rename complete_and_exit to kthread_complete_and_exit") Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#12989
Linux 5.17's dequeue_signal() takes an additional enum pid_type * output argument Upstream commit 5768d8906bc23d512b1a736c1e198aa833a6daa4 ("signal: Requeue signals in the appropriate queue") Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#12989
Upstream commit 359745d78351c6f5442435f81549f0207ece28aa ("proc: remove PDE_DATA() completely") Link: https://lore.kernel.org/all/20211124081956.87711-2-songmuchun@bytedance.com/T/#u Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13004 Closes openzfs#12989
When using the two argument version of submit_bio() in kernel's prior to 4.8 the first argument should be specified. It's used by block dump to report the bio direction. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Finix Yan <yancw@info2soft.com> Closes openzfs#13006
Raw receiving a snapshot back to the originating dataset is currently impossible because of user accounting being present in the originating dataset. One solution would be resetting user accounting when raw receiving on the receiving dataset. However, to recalculate it we would have to dirty all dnodes, which may not be preferable on big datasets. Instead, we rely on the os_phys flag OBJSET_FLAG_USERACCOUNTING_COMPLETE to indicate that user accounting is incomplete when raw receiving. Thus, on the next mount of the receiving dataset the local mac protecting user accounting is zeroed out. The flag is then cleared when user accounting of the raw received snapshot is calculated. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes openzfs#12981 Closes openzfs#10523 Closes openzfs#11221 Closes openzfs#11294 Closes openzfs#12594 Issue openzfs#11300
On FreeBSD pools are not allowed to be created using vdevs which are backed by ZFS volumes. This configuration is not recommended for any supported platform, nevertheless the largest_pool_001_pos.ksh test case makes use of it as a convenience. This causes the test case to fail reliably on FreeBSD. The layout is still tolerated on Linux so only perform this test on Linux. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13166
As previously noted in openzfs#12272 the receive-o-x_props_override.ksh test reliably fails on FreeBSD. Since we don't expect this test to pass move the exception from the "maybe" to "known" section. This way we don't retry the FAILED test when it is not expected to pass. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13167
There seems to be some unexpected checksum errors that are causing a number of the FreeBSD tests to fail:
I'm going to look into those.. |
As explained by the disclaimer in the test case, "This test can fail since nothing guarantees that old MOS blocks aren't overwritten." This behavior is expected and correct, but results in a flaky test case which is problematic for the CI. The best we can do to resolve this is to retry the sub-test which failed when the MOS blocks have clearly been overwritten. When testing failures were rare enough that a single retry should normally be sufficient. However, we allow up to five for good measure. Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13119
Related to commit 90b77a0. Retry the `zpool export` if the pool is "busy" indicating there is a process accessing the mount point. This can happen after an import, allowing it to be retried will avoid spurious test failures. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13169
While switching abd_zero_buf allocation KPI I've missed the fact that kmem_zalloc() zeroed the allocation, while kmem_cache_alloc() does not. Add explicit bzero() after it. I don't think it should have caused real problems, but leaking one memory page content all over the pool is not good. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Ryan Moeller <ryan@ixsystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#12569
When rolling back a dataset, ZFS has to purge file data resident in the system page cache. To do this, it loops over all vnodes for the mountpoint and calls vn_pages_remove() to purge pages associated with the vnode's VM object. Each page is thus exclusively busied while the dataset's teardown write lock is held. When handling a page fault on a mapped ZFS file, FreeBSD's page fault handler busies newly allocated pages and then uses VOP_GETPAGES to fill them. The ZFS getpages VOP acquires the teardown read lock with vnode pages already busied. This represents a lock order reversal which can lead to deadlock. To break the deadlock, observe that zfs_rezget() need only purge those pages marked valid, and that pages busied by the page fault handler are, by definition, invalid. Furthermore, ZFS pages always transition from invalid to valid with the teardown lock held, and ZFS never creates partially valid pages. Thus, zfs_rezget() can use the new vn_pages_remove_valid() to skip over pages busied by the fault handler. PR: 258208 Tested by: pho Reviewed by: avg, sef, kib MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D32931 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Ryan Moeller <freqlabs@FreeBSD.org> Closes openzfs#12828
For deadman_sync: 72f06d0 |
In the CI environment it's possible for events to be slightly delayed resulting in 4, instead of 5, events appearing in the log file. This isn't a problem and should be considered a success to avoid false positive test results. Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#12625
@freqlabs thanks, included in my latest push |
It's odd that inheritance/inherit_001_pos failing on FreeBSD was considered unexpected. It's on the maybe list to fail for FreeBSD. |
When unlinking multiple files from a pool at 100% capacity, it was possible for ENOSPC to be returned after the first unlink. e.g. rm -f /mnt/fs/test1.0.0 /mnt/fs/test1.1.0 /mnt/fs/test1.2.0 rm: cannot remove '/mnt/fs/test1.1.0': No space left on device rm: cannot remove '/mnt/fs/test1.2.0': No space left on device After waiting for the pending deferred frees from the first unlink to be processed the remaining files can then be unlinked. This is caused by the quota limit in dsl_dir_tempreserve_impl() being temporarily decreased to the allocatable pool capacity less any deferred free space. This is resolved using the existing mechanism of returning ERESTART when over quota as long as we know enough space will shortly be available after processing the pending deferred frees. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <freqlabs@FreeBSD.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13172
@freqlabs I'm not sure I follow. I don't have the test run link handy anymore, but what I recall from the log was the |
Correct me if I'm wrong, but is this release supposed to still not be compatible with kernels newer than 5.15, despite all the fixes cherry-picked into this branch? At least I don't see anything that bumps the relevant field in the metadata file. Does this mean that we'll have to wait another couple months before being able to use ZFS with 5.16 or 5.17? |
META file and changelog updated. Signed-off-by: Tony Hutter <hutter2@llnl.gov>
e859ab9
to
ef83e07
Compare
@jaen you are correct, it works with 5.16 and at least 5.17-rc6. I just bumped the kernel version to 5.16 in the META file. |
@behlendorf okay, I had not realized that the maybe list is what drives the retry mechanism. The output does not show the expected failures anywhere when there is a retry, so I assumed it was broken. |
@jaen I just rebuilt from this PR and it also works with 5.17-rc7. |
Hmm, maybe it would make sense to delay it for 5.17 proper then and be able to bump two versions at once, since we're already waited a bit for the bump? What is one week more ;' ) (unless you are willing to issue a release for the supported version bump only) |
5.17 should be released this weekend unless something goes wrong. Kernel releases are usually on Sundays... |
It would be nice to have official 5.17 support, and then ask the debian maintainers to release 2.1.3 and then ask the Ubuntu maintainers to try to get it into 22.04.... |
Release is out: @satmandu yea I understand - unfortunately the kernel is always a moving target and it's a always a problem to get the timing right. We really couldn't keep up the release any longer (this PR has been out for over a month). I know FreeBSD has been hoping for zfs-2.1.3 for their 13.1 release, and hopefully we were able to make their deadline. |
Many thanks. I have installed it on a few test systems and see the new,
Many thanks for fixing this particularly painful issue, along with fixes for a few others I reported. The Open ZFS community rocks! |
@tonyhutter in that case, would you be willing to make a release with just the meta bump faster if it turns out that ZFS is already compatible on 5.17 release? Sorry if that's unreasonable, I'm not too familiar with OpenZFS release process. |
It might also be a good idea to have a rolling point release for every kernel release, even if just to bump the supported kernel version. If zfs is being tested against all the RC kernels, (ideally there would be some level of CI for this—ubuntu'd mainline kernel builds might make this relatively easy), then a point release just after final kernel release would help people who test the mainline kernels and people who test zfs, helping both projects. |
Motivation and Context
Proposed zfs-2.1.3 patchset.
Description
Mostly kernel fixes
How Has This Been Tested?
Awaiting buildbot results
Types of changes
Checklist:
Signed-off-by
.