-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix synchronicity for ZVOLs #388
Conversation
Currently the "sync=always" property works for regular ZFS datasets, but not for ZVOLs. This patch remedies that. Fixes openzfs#374.
This operation allows "hole punching" in ZFS files. On Solaris this is done via the vop_space() system call, which maps to the zfs_space() function. So we just need to write zpl_truncate_range() as a wrapper around zfs_space(). Note that this only works for regular files, not ZVOLs. This is currently an insecure implementation without permission checking, although this isn't that big of a deal since truncate_range() isn't even callable from userspace. This is related to issue openzfs#334.
This isn't done on Solaris because on this OS zfs_space() can only be called with an opened file handle. Since the addition of zpl_truncate_range() this isn't the case anymore, so we need to enforce access rights. This is related to issue openzfs#334.
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is supported, since it's the only one that matches the behavior of zfs_space(). This makes it pretty much useless in its current form, but it's a start. To support other flag combinations we would need to modify zfs_space() to make it more flexible, or emulate the desired functionality in zpl_fallocate(). This is related to issue openzfs#334.
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning. It allows ZVOL clients to discard (unmap, trim) block ranges from a ZVOL, thus optimizing disk space usage by allowing a ZVOL to shrink instead of just grow. We can't use zfs_space() or zfs_freesp() here, since these functions only work on regular files, not volumes. Fortunately we can use the low-level function dmu_free_long_range() which does exactly what we want. Currently the discard operation is not added to the log. That's not a big deal since losing discard requests cannot result in data corruption. It would however result in disk space usage higher than it should be. Thus adding log support to zvol_discard() is probably a good idea for a future improvement.
zvol_write() assumes that the write request must be written to stable storage if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed, "sync" does *not* mean what we think it means in the context of the Linux block layer. This is well explained in linux/fs.h: WRITE: A normal async write. Device will be plugged. WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down the hint that someone will be waiting on this IO shortly. WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush. WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on non-volatile media on completion. In other words, SYNC does not *mean* that the write must be on stable storage on completion. It just means that someone is waiting on us to complete the write request. Thus triggering a ZIL commit for each SYNC write request on a ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL users have no way to express that they actually want data to be written to stable storage, which means the ZIL is broken for ZVOLs. The request for stable storage is expressed by the FUA flag, so we must commit the ZIL after the write if the FUA flag is set. In addition, we must commit the ZIL before the write if the FLUSH flag is set. Also, we must inform the block layer that we actually support FLUSH and FUA.
The Linux block device queue subsystem exposes a number of configurable settings described in Linux block/blk-settings.c. The defaults for these settings are tuned for hard drives, and are not optimized for ZVOLs. Proper configuration of these options would allow upper layers (I/O scheduler) to take better decisions about write merging and ordering. Detailed rationale: - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to handle writes of any size, so there's no reason to impose a limit. Let the upper layer decide. - max_segments and max_segment_size are set to unlimited. zvol_write() will copy the requests' contents into a dbuf anyway, so the number and size of the segments are irrelevant. Let the upper layer decide. - physical_block_size and io_opt are set to the ZVOL's block size. This has the potential to somewhat alleviate issue openzfs#361 for ZVOLs, by warning the upper layers that writes smaller than the volume's block size will be slow. - The NONROT flag is set to indicate this isn't a rotational device. Although the backing zpool might be composed of rotational devices, the resulting ZVOL often doesn't exhibit the same behavior due to the COW mechanisms used by ZFS. Setting this flag will prevent upper layers from making useless decisions (such as reordering writes) based on incorrect assumptions about the behavior of the ZVOL.
Nice work on the zvol changes, my plan is to start reviewing them soon so they can be merged. However, if you had some performance comparisons/results to go along with tweaks that would be helpful. |
This issue is not about performance, it's about the ZIL and data security. I assume you're talking about #389? If that's the case, then providing performance comparisons would be difficult because it really depends on the I/O scheduler used and the specific use case. For example, I benefit from #389 because I'm doing unaligned writes (see #361) and setting unlimited limits for the write size allows for bigger writes which spans more blocks, which alleviates (but does not eliminate, unfortunately) the need for ZFS to read blocks to dirty. Other people might benefit from #389 some other way. The important thing here is that we're handing better information to the I/O scheduler, which is A Good Thing (TM). |
Yes, I meant in general for the zvol changes you've proposed. I was just curious, they all look very sensible and correct on the surface. I'll start carefully reviewing each one and making some they don't break anything (older kernels for example) today. |
While I completely agree with what this patch is trying to do it needs some more work before it can be included. The WRITE_FUA/WRITE_FLUSH semantics were introduced recently in 2.6.36 as a replacement for WRITE_BARRIER, see torvalds/linux@4fed947 . We are going to need to add compatibility code to handle both cases correctly. This can probably be done with a simple '#ifdef WRITE_FLUSH_FUA' without the need for extensive autoconf changes. This was already in fact done once in include/linux/blkdev_compat.h to ensure we pass the write barrier in the vdev disk code. Adding rq_is_fua() and rq_is_flush() compatibly functions to include/linux/blkdev_compat.h is probably the cleanest way to go about this. Additionally, I noticed that this change didn't depend on the ZFS_SYNC_ALWAYS tweak in issue #374. Was that intentional? It seems to me that even with the REQ_FUA change we need to honor ZFS_SYNC_ALWAYS as you described previously. |
I agree. Should I add these functions myself or are you taking over from here? You seem to be better equipped to do these boilerplate adjustments.
I didn't see a reason to introduce a dependency between the two changes. However, both changes are of course useful, and both should be merged. In the end the ZIL should be commited if the request is flagged FUA or if |
error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, offset, size); | ||
|
||
/* | ||
* TODO: maybe we should add the operation to the log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm… in fact this line is supposed to be in pull request #384. Seems like I seriously screwed up while doing the merges: all my pull requests show the same commits. What a mess… git is new to me, I guess this was bound to happen. There should be only one commit in this pull request: 90e1b21, so if you're just interested in ZVOL synchronicity, you should check it out. I'm not sure how to fix this, I guess I'll have to recreate the pull requests.
FYI, in the context of #384, this comment means that maybe the discard operation should be added to the log. This is not very important since losing discard operations cannot result in data corruption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've found this approach works best: Start from a checkout of upstream trunk. Then git branch TOPIC; git checkout TOPIC
. Make your changes and commit. Push that branch to github. Repeat as necessary for the other features, starting from a checkout of upstream trunk each time. Then, do a pull request for each branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I did, basically; the problem is, when updating my master branch from upstream I guessed it would be a good idea to also update individual pull request branches from my master branch. Alas, it was a very bad idea, because my master branch add commits from all my pull requests, so by merging master into each pull request, I merged all commits from all pull requests into each pull request, hence the mess. In the future I'll just let my pull requests alone when I'm done with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The easiest solution is probably to leave your master tracking upstream master (i.e. you should not commit anything to master). Use a separate branch to combine your topic branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never commited anything to my master branch, just merges. I wrote the pull request's code into the appropriate pull request branches, as I should. The issue is, I was merging master into my pull requests without realizing what I was doing. The solution is to stop doing that. I just emailed Brian so that we decide what to do about the already messed up pull requests.
I made a new, clean pull request to fix this mess. See #552. |
* Improve the inline descriptions of the ARC module parameters These are displayed as the descriptions of the sysctl's on FreeBSD Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes openzfs#13334 * linux: module: weld all but spl.ko into zfs.ko Originally it was thought it would be useful to split up the kmods by functionality. This would allow external consumers to only load what was needed. However, in practice we've never had a case where this functionality would be needed, and conversely managing multiple kmods can be awkward. Therefore, this change merges all but the spl.ko kmod in to a single zfs.ko kmod. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13274 * scripts: zfs.sh: remove cat Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13274 * scripts: zfs.sh: make usage make sense We don't pass the arguments as arguments Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13274 * scripts: zfs.sh: unload zfs with dependencies Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13274 * linux: module: uninstall legacy modules on (un)installation This can be reverted once we're sure nobody's using them anymore (post-3.0 release?) Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13274 * zpool_history_unpack: return correct errno on nvlist_unpack failure Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: WHR <msl0000023508@gmail.com> Closes openzfs#13321 * Document zfs inherit -S's interaction with noninheritable properties Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#11894 Closes openzfs#13335 * rpm -> deb doesn't fail when optional packages are missing Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes openzfs#13331 Closes openzfs#13336 * man: ... -> … again zfs-program.8 is left, but that's literal Lua syntax Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13255 * FreeBSD: Fix translation from ABD to physical pages In hypothetical case of non-linear ABD with single segment, multiple to page size but not aligned to it, vdev_geom_fill_unmap_cb() could fill one page less into bio_ma array. I am not sure it is exploitable, but better to be safe than sorry. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reported-by: Mark Johnston <markj@FreeBSD.org> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes openzfs#13345 * Corrected oversight in ZERO_RANGE behavior It turns out, no, in fact, ZERO_RANGE and PUNCH_HOLE do have differing semantics in some ways - in particular, one requires KEEP_SIZE, and the other does not. Also added a zero-range test to catch this, corrected a flaw that made the punch-hole test succeed vacuously, and a typo in file_write. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes openzfs#13329 Closes openzfs#13338 * contrib: dracut: parse-zfs: drop initqueue-finished for i/f The switch was released in dracut 009 in March 2011, we can safely get rid of the compatibility hook Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13291 * contrib: dracut: parse-zfs: stop pretending we support FILESYSTEM= It was added in the original ae26d04 ("Add dracut support") commit in 2011, and was then broken a bit later with the advent of dracut-zfs-generator, or maybe earlier as part of other churn Either way, it's broken, and has been in 2.0+ as well, and no-one complained. Stop pretending we support it at all Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13291 * contrib; dracut: centralise root= parsing, actually support root=s So far, everything parsed root= manually, which meant that while zfs-parse.sh was updated, and supposedly supported + -> ' ' conversion, it meant nothing Instead, centralise parsing, and allow: root= root=zfs root=zfs: root=zfs:AUTO root=ZFS=data/set root=zfs:data/set root=zfs:ZFS=data/set (as a side-effect; allowed but undocumented) rootfstype=zfs AND root=data/set <=> root=data/set rootfstype=zfs AND root= <=> root=zfs:AUTO So rootfstype=zfs /also/ behaves as expected, and + decoding works Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13291 * contrib; dracut: flatten zfs-load-key, simplify zfs-env-bootfs Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13291 * contrib: dracut: zfs-lib: simplify ask_for_password The only user is mount-zfs.sh (non-systemd systems), so reduce it to what it needs Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13291 * contrib: dracut: zfs-lib: remove find_bootfs Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13291 * contrib: dracut: inline single-use import_pool, move single-use ask_for_password Also don't set ROOTFS_MOUNTED; the final mention was removed in dracut 011 from July 2011 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13291 * contrib: dracut: don't require essentials to be under the same encroot Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13291 * contrib: dracut: zfs-{rollback,snapshot}-bootfs: order after key loading This fixes at least one race I got with an encrypted root Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13291 * contrib: dracut: zfs-needshutdown: don't list Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13291 * Add dracut.zfs.7 Thorough documentation with a dracut.bootup(7)-style flowchart, dracut.cmdline(7)-style cmdline listing, and per-file docs like the old README Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13291 * contrib: dracut: remove getargbool polyfill It was originally released in dracut 008 in February 2011; we can probably drop it now Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13291 * Strengthen Linux kernel capabilities detection - Add `CONFIG_BLOCK` Linux config requirement to `ZFS_AC_KERNEL_CONFIG_DEFINED`. OpenZFS won't compile without that block device support due to large amount of functional dependencies on it. - Remove dependency on `groups_alloc()` in `ZFS_AC_KERNEL_SRC_GROUP_INFO_GID` to circumvent the missing stub in Linux 4.X kernel headers. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes openzfs#13351 * scripts: zfs.sh: explicitly ignore unloaded modules when unloading Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13356 * scripts: zfs.sh: explicitly unload all modules via rmmod modprobe -r only works for depmodded modules, but this also means we have to re-iterate legacy modules, and in the right order Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13356 * linux: module: zfs: sysfs: constify types and attrs Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13357 * Linux 5.18 compat: kobj_type.default_attrs replaced with default_groups Upstream-commit: cdb4f26a63c391317e335e6e683a614358e70aeb ("kobject: kobj_type: remove default_attrs") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13357 * zvol_wait: Ignore locked zvols "When an encrypted zvol is locked the zfs-volume-wait service does not start. The /sbin/zvol_wait should not wait for links when the volume has property keystatus=unavailable." -- https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1888405 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com> Thanks: James Dingwall <james-launchpad@dingwall.me.uk> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes openzfs#10662 * man: zfs-send.8: fix -X synopses and description Also clean up the horrendously verbose -X handling in zfs_main() Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13352 * tests: cli_user: zfs_001_neg: print the problematic lines Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13352 * Improve zpool status output, list all affected datasets Currently, determining which datasets are affected by corruption is a manual process. The primary difficulty in reporting the list of affected snapshots is that since the error was initially found, the snapshot where the error originally occurred in, may have been deleted. To solve this issue, we add the ID of the head dataset of the original snapshot which the error was detected in, to the stored error report. Then any time a filesystem is deleted, the errors associated with it are deleted as well. Any time a clone promote occurs, we modify reports associated with the original head to refer to the new head. The stored error reports are identified by this head ID, the birth time of the block which the error occurred in, as well as some information about the error itself are also stored. Once this information is stored, we can find the set of datasets affected by an error by walking back the list of snapshots in the given head until we find one with the appropriate birth txg, and then traverse through the snapshots of the clone family, terminating a branch if the block was replaced in a given snapshot. Then we report this information back to libzfs, and to the zpool status command, where it is displayed as follows: pool: test state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec 3 08:27:57 2021 config: NAME STATE READ WRITE CKSUM test ONLINE 0 0 0 sdb ONLINE 0 0 1.58K errors: Permanent errors have been detected in the following files: test@1:/test.0.0 /test/test.0.0 /test/1clone/test.0.0 A new feature flag is introduced to mark the presence of this change, as well as promotion and backwards compatibility logic. This is an updated version of openzfs#9175. Rebase required fixing the tests, updating the ABI of libzfs, updating the man pages, fixing bugs, fixing the error returns, and updating the old on-disk error logs to the new format when activating the feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Co-authored-by: TulsiJain <tulsi.jain@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes openzfs#9175 Closes openzfs#12812 * Improve log spacemap load time Previous flushing algorithm limited only total number of log blocks to the minimum of 256K and 4x number of metaslabs in the pool. As result, system with 1500 disks with 1000 metaslabs each, touching several new metaslabs each TXG could grow spacemap log to huge size without much benefits. We've observed one of such systems importing pool for about 45 minutes. This patch improves the situation from five sides: - By limiting maximum period for each metaslab to be flushed to 1000 TXGs, that effectively limits maximum number of per-TXG spacemap logs to load to the same number. - By making flushing more smooth via accounting number of metaslabs that were touched after the last flush and actually need another flush, not just ms_unflushed_txg bump. - By applying zfs_unflushed_log_block_pct to the number of metaslabs that were touched after the last flush, not all metaslabs in the pool. - By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in advance, making log spacemap load process for wide HDD pool CPU-bound, accelerating it by many times. - By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing single-threaded by nature log processing time from ~10 to ~5 minutes. As further optimization we could skip bumping ms_unflushed_txg for metaslabs not touched since the last flush, but that would be an incompatible change, requiring new pool feature. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored-By: iXsystems, Inc. Closes openzfs#12789 * autoconf: Pretend `CONFIG_MODULES` is always on - Unconditionally inject `CONFIG_MODULES` make variable and `#define CONFIG_MODULES` to Kbuild in `ZFS_LINUX_COMPILE` autoconf function to emulate loadable kernel modules support. This allows OpenZFS to perform Linux checks despite `CONFIG_MODULES=n` in the actual Linux config. - Add `ZFS_AC_KERNEL_CONFIG_MODULES` check which encompasses the logic from `ZFS_AC_KERNEL_TEST_MODULE` with additional diagnostic messages to the user - Removed `ZFS_AC_KERNEL_TEST_MODULE` as it merely duplicates every check in `ZFS_AC_KERNEL_CONFIG_DEFINED` - Moved `ZFS_AC_MODULE_SYMVERS` after `ZFS_AC_KERNEL_CONFIG_DEFINED` so the user has a chance to see the proper diagnostic from the steps before. A workaround for Linux's ``` commit 3e3005df73b535cb849cf4ec8075d6aa3c460f68 Author: Masahiro Yamada <masahiroy@kernel.org> Date: Wed Mar 31 22:38:03 2021 +0900 kbuild: unify modules(_install) for in-tree and external modules If you attempt to build or install modules ('make modules(_install)' with CONFIG_MODULES disabled, you will get a clear error message, but nothing for external module builds. Factor out the modules and modules_install rules into the common part, so you will get the same error message when you try to build external modules with CONFIG_MODULES=n. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> ``` Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes openzfs#10832 Closes openzfs#13361 * PPC get_user workaround Linux 5.12 PPC 5.12 get_user() and __copy_from_user_inatomic() inline helpers very indirectly include a reference to the GPL'd array mmu_feature_keys[] and fails to build. Workaround this by using copy_from_user() and throwing EFAULT for any calls to __copy_from_user_inatomic(). This is a workaround until a fix for Linux commit 7613f5a66becfd0e43a0f34de8518695888f5458 "powerpc/64s/kuap: Use mmu_has_feature()" is fully addressed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Authored-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: szubersk <szuberskidamian@gmail.com> Closes openzfs#11958 Closes openzfs#12590 Closes openzfs#13367 * zfs: holds: general cleanup Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13373 * zfs: holds: dequadratify Before: 15 0m0.177s 30 0m0.653s 45 0m1.289s 60 0m2.129s 75 0m3.264s 90 0m4.397s 100 0m5.996s 117 0m8.552s After: 30 0m0.053s 117 0m0.125s Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Closes openzfs#13372 Closes openzfs#13373 * Linux 5.18 compat: replace __set_page_dirty_nobuffers Replace __set_page_dirty_nobuffers with filemap_dirty_folio. Upstream-commit: 6b1f86f8e9c7f9de7ca1cb987b2cf25e99b1ae3a ("Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache ") Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Authored-by: Satadru Pramanik <satadru@gmail.com> Signed-off-by: Satadru Pramanik <satadru@gmail.com> Closes openzfs#13325 Closes openzfs#13380 * Fix O_APPEND for Linux 3.15 and older kernels When using a Linux kernel which predates the iov_iter interface the O_APPEND flag should be applied in zpl_aio_write() via the call to generic_write_checks(). The updated pos variable was incorrectly ignored resulting in the current offset being used. This issue should only realistically impact the RHEL/CentOS 7.x kernels which are based on Linux 3.10. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#13370 Closes openzfs#13377 Co-authored-by: Allan Jude <allan@klarasystems.com> Co-authored-by: наб <nabijaczleweli@nabijaczleweli.xyz> Co-authored-by: Low-power <msl0000023508@gmail.com> Co-authored-by: Damian Szuberski <szuberskidamian@gmail.com> Co-authored-by: Alexander Motin <mav@FreeBSD.org> Co-authored-by: Rich Ercolani <214141+rincebrain@users.noreply.github.com> Co-authored-by: Richard Laager <rlaager@wiktel.com> Co-authored-by: George Amanakis <gamanakis@gmail.com> Co-authored-by: Satadru Pramanik <satadru@gmail.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
zvol_write()
assumes that the write request must be written to stable storage ifrq_is_sync()
is true. Unfortunately, this assumption is incorrect. Indeed, "sync" does not mean what we think it means in the context of the Linux block layer. This is well explained inlinux/fs.h
:WRITE
: A normal async write. Device will be plugged.WRITE_SYNC
: Synchronous write. Identical toWRITE
, but passes down the hint that someone will be waiting on this IO shortly.WRITE_FLUSH
: LikeWRITE_SYNC
but with preceding cache flush.WRITE_FUA
: LikeWRITE_SYNC
but data is guaranteed to be on non-volatile media on completion.In other words,
SYNC
does not mean that the write must be on stable storage on completion. It just means that someone is waiting on us to complete the write request. Thus triggering a ZIL commit for each SYNC write request on a ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL users have no way to express that they actually want data to be written to stable storage, which means the ZIL is broken for ZVOLs.The request for stable storage is expressed by the
FUA
flag, so we must commit the ZIL after the write if theFUA
flag is set. In addition, we must commit the ZIL before the write if theFLUSH
flag is set.Also, we must inform the block layer that we actually support
FLUSH
andFUA
.