Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix synchronicity for ZVOLs #388

Closed
wants to merge 17 commits into from
Closed

Conversation

dechamps
Copy link
Contributor

@dechamps dechamps commented Sep 5, 2011

zvol_write() assumes that the write request must be written to stable storage if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed, "sync" does not mean what we think it means in the context of the Linux block layer. This is well explained in linux/fs.h:

  • WRITE: A normal async write. Device will be plugged.
  • WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down the hint that someone will be waiting on this IO shortly.
  • WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
  • WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on non-volatile media on completion.

In other words, SYNC does not mean that the write must be on stable storage on completion. It just means that someone is waiting on us to complete the write request. Thus triggering a ZIL commit for each SYNC write request on a ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL users have no way to express that they actually want data to be written to stable storage, which means the ZIL is broken for ZVOLs.

The request for stable storage is expressed by the FUA flag, so we must commit the ZIL after the write if the FUA flag is set. In addition, we must commit the ZIL before the write if the FLUSH flag is set.

Also, we must inform the block layer that we actually support FLUSH and FUA.

Currently the "sync=always" property works for regular ZFS datasets, but not
for ZVOLs. This patch remedies that.

Fixes openzfs#374.
This operation allows "hole punching" in ZFS files. On Solaris this is done
via the vop_space() system call, which maps to the zfs_space() function. So
we just need to write zpl_truncate_range() as a wrapper around zfs_space().

Note that this only works for regular files, not ZVOLs.

This is currently an insecure implementation without permission checking,
although this isn't that big of a deal since truncate_range() isn't even
callable from userspace.

This is related to issue openzfs#334.
This isn't done on Solaris because on this OS zfs_space() can only be called
with an opened file handle. Since the addition of zpl_truncate_range() this
isn't the case anymore, so we need to enforce access rights.

This is related to issue openzfs#334.
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is supported,
since it's the only one that matches the behavior of zfs_space(). This
makes it pretty much useless in its current form, but it's a start.

To support other flag combinations we would need to modify zfs_space()
to make it more flexible, or emulate the desired functionality in
zpl_fallocate().

This is related to issue openzfs#334.
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning. It allows
ZVOL clients to discard (unmap, trim) block ranges from a ZVOL, thus
optimizing disk space usage by allowing a ZVOL to shrink instead of just
grow.

We can't use zfs_space() or zfs_freesp() here, since these functions only
work on regular files, not volumes. Fortunately we can use the low-level
function dmu_free_long_range() which does exactly what we want.

Currently the discard operation is not added to the log. That's not a big
deal since losing discard requests cannot result in data corruption. It would
however result in disk space usage higher than it should be. Thus adding
log support to zvol_discard() is probably a good idea for a future
improvement.
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:

    WRITE:       A normal async write. Device will be plugged.
    WRITE_SYNC:  Synchronous write. Identical to WRITE, but passes down
                 the hint that someone will be waiting on this IO
                 shortly.
    WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
    WRITE_FUA:   Like WRITE_SYNC but data is guaranteed to be on
                 non-volatile media on completion.

In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.

The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.

Also, we must inform the block layer that we actually support FLUSH and FUA.
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.

Detailed rationale:

 - max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
   handle writes of any size, so there's no reason to impose a limit. Let the
   upper layer decide.

 - max_segments and max_segment_size are set to unlimited. zvol_write() will
   copy the requests' contents into a dbuf anyway, so the number and size of
   the segments are irrelevant. Let the upper layer decide.

 - physical_block_size and io_opt are set to the ZVOL's block size. This
   has the potential to somewhat alleviate issue openzfs#361 for ZVOLs, by warning
   the upper layers that writes smaller than the volume's block size will be
   slow.

 - The NONROT flag is set to indicate this isn't a rotational device.
   Although the backing zpool might be composed of rotational devices, the
   resulting ZVOL often doesn't exhibit the same behavior due to the COW
   mechanisms used by ZFS. Setting this flag will prevent upper layers from
   making useless decisions (such as reordering writes) based on incorrect
   assumptions about the behavior of the ZVOL.
@behlendorf
Copy link
Contributor

Nice work on the zvol changes, my plan is to start reviewing them soon so they can be merged. However, if you had some performance comparisons/results to go along with tweaks that would be helpful.

@dechamps
Copy link
Contributor Author

dechamps commented Sep 6, 2011

This issue is not about performance, it's about the ZIL and data security. I assume you're talking about #389? If that's the case, then providing performance comparisons would be difficult because it really depends on the I/O scheduler used and the specific use case. For example, I benefit from #389 because I'm doing unaligned writes (see #361) and setting unlimited limits for the write size allows for bigger writes which spans more blocks, which alleviates (but does not eliminate, unfortunately) the need for ZFS to read blocks to dirty. Other people might benefit from #389 some other way. The important thing here is that we're handing better information to the I/O scheduler, which is A Good Thing (TM).

@behlendorf
Copy link
Contributor

Yes, I meant in general for the zvol changes you've proposed. I was just curious, they all look very sensible and correct on the surface. I'll start carefully reviewing each one and making some they don't break anything (older kernels for example) today.

@behlendorf
Copy link
Contributor

While I completely agree with what this patch is trying to do it needs some more work before it can be included. The WRITE_FUA/WRITE_FLUSH semantics were introduced recently in 2.6.36 as a replacement for WRITE_BARRIER, see torvalds/linux@4fed947 .

We are going to need to add compatibility code to handle both cases correctly. This can probably be done with a simple '#ifdef WRITE_FLUSH_FUA' without the need for extensive autoconf changes. This was already in fact done once in include/linux/blkdev_compat.h to ensure we pass the write barrier in the vdev disk code. Adding rq_is_fua() and rq_is_flush() compatibly functions to include/linux/blkdev_compat.h is probably the cleanest way to go about this.

Additionally, I noticed that this change didn't depend on the ZFS_SYNC_ALWAYS tweak in issue #374. Was that intentional? It seems to me that even with the REQ_FUA change we need to honor ZFS_SYNC_ALWAYS as you described previously.

@dechamps
Copy link
Contributor Author

dechamps commented Sep 7, 2011

We are going to need to add compatibility code to handle both cases correctly. This can probably be done with a simple '#ifdef WRITE_FLUSH_FUA' without the need for extensive autoconf changes. This was already in fact done once in include/linux/blkdev_compat.h to ensure we pass the write barrier in the vdev disk code. Adding rq_is_fua() and rq_is_flush() compatibly functions to include/linux/blkdev_compat.h is probably the cleanest way to go about this.

I agree. Should I add these functions myself or are you taking over from here? You seem to be better equipped to do these boilerplate adjustments.

Additionally, I noticed that this change didn't depend on the ZFS_SYNC_ALWAYS tweak in issue #374. Was that intentional? It seems to me that even with the REQ_FUA change we need to honor ZFS_SYNC_ALWAYS as you described previously.

I didn't see a reason to introduce a dependency between the two changes. However, both changes are of course useful, and both should be merged. In the end the ZIL should be commited if the request is flagged FUA or if sync is set to always.

error = dmu_free_long_range(zv->zv_objset, ZVOL_OBJ, offset, size);

/*
* TODO: maybe we should add the operation to the log.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm… in fact this line is supposed to be in pull request #384. Seems like I seriously screwed up while doing the merges: all my pull requests show the same commits. What a mess… git is new to me, I guess this was bound to happen. There should be only one commit in this pull request: 90e1b21, so if you're just interested in ZVOL synchronicity, you should check it out. I'm not sure how to fix this, I guess I'll have to recreate the pull requests.

FYI, in the context of #384, this comment means that maybe the discard operation should be added to the log. This is not very important since losing discard operations cannot result in data corruption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've found this approach works best: Start from a checkout of upstream trunk. Then git branch TOPIC; git checkout TOPIC. Make your changes and commit. Push that branch to github. Repeat as necessary for the other features, starting from a checkout of upstream trunk each time. Then, do a pull request for each branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I did, basically; the problem is, when updating my master branch from upstream I guessed it would be a good idea to also update individual pull request branches from my master branch. Alas, it was a very bad idea, because my master branch add commits from all my pull requests, so by merging master into each pull request, I merged all commits from all pull requests into each pull request, hence the mess. In the future I'll just let my pull requests alone when I'm done with them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The easiest solution is probably to leave your master tracking upstream master (i.e. you should not commit anything to master). Use a separate branch to combine your topic branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I never commited anything to my master branch, just merges. I wrote the pull request's code into the appropriate pull request branches, as I should. The issue is, I was merging master into my pull requests without realizing what I was doing. The solution is to stop doing that. I just emailed Brian so that we decide what to do about the already messed up pull requests.

@dechamps
Copy link
Contributor Author

dechamps commented Feb 3, 2012

I made a new, clean pull request to fix this mess. See #552.

@dechamps dechamps closed this Feb 3, 2012
sdimitro pushed a commit to sdimitro/zfs that referenced this pull request May 23, 2022
* Improve the inline descriptions of the ARC module parameters

These are displayed as the descriptions of the sysctl's on FreeBSD

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes openzfs#13334

* linux: module: weld all but spl.ko into zfs.ko

Originally it was thought it would be useful to split up the kmods
by functionality.  This would allow external consumers to only load
what was needed.  However, in practice we've never had a case where
this functionality would be needed, and conversely managing multiple
kmods can be awkward.  Therefore, this change merges all but the
spl.ko kmod in to a single zfs.ko kmod.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13274

* scripts: zfs.sh: remove cat

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13274

* scripts: zfs.sh: make usage make sense

We don't pass the arguments as arguments

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13274

* scripts: zfs.sh: unload zfs with dependencies

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13274

* linux: module: uninstall legacy modules on (un)installation

This can be reverted once we're sure nobody's using them anymore
(post-3.0 release?)

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13274

* zpool_history_unpack: return correct errno on nvlist_unpack failure

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes openzfs#13321

* Document zfs inherit -S's interaction with noninheritable properties

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#11894
Closes openzfs#13335

* rpm -> deb doesn't fail when optional packages are missing

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes openzfs#13331
Closes openzfs#13336

* man: ... -> … again

zfs-program.8 is left, but that's literal Lua syntax

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13255

* FreeBSD: Fix translation from ABD to physical pages

In hypothetical case of non-linear ABD with single segment, multiple
to page size but not aligned to it, vdev_geom_fill_unmap_cb() could
fill one page less into bio_ma array.

I am not sure it is exploitable, but better to be safe than sorry.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reported-by: Mark Johnston <markj@FreeBSD.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes openzfs#13345

* Corrected oversight in ZERO_RANGE behavior

It turns out, no, in fact, ZERO_RANGE and PUNCH_HOLE do
have differing semantics in some ways - in particular,
one requires KEEP_SIZE, and the other does not.

Also added a zero-range test to catch this, corrected a flaw
that made the punch-hole test succeed vacuously, and a typo
in file_write.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes openzfs#13329 
Closes openzfs#13338

* contrib: dracut: parse-zfs: drop initqueue-finished for i/f

The switch was released in dracut 009 in March 2011,
we can safely get rid of the compatibility hook

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13291

* contrib: dracut: parse-zfs: stop pretending we support FILESYSTEM=

It was added in the original ae26d04 ("Add dracut support") commit
in 2011, and was then broken a bit later with the advent of
dracut-zfs-generator, or maybe earlier as part of other churn

Either way, it's broken, and has been in 2.0+ as well, and no-one
complained. Stop pretending we support it at all

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13291

* contrib; dracut: centralise root= parsing, actually support root=s

So far, everything parsed root= manually, which meant that while
zfs-parse.sh was updated, and supposedly supported + -> ' ' conversion,
it meant nothing

Instead, centralise parsing, and allow:
  root=
  root=zfs
  root=zfs:
  root=zfs:AUTO

  root=ZFS=data/set
  root=zfs:data/set
  root=zfs:ZFS=data/set (as a side-effect; allowed but undocumented)

  rootfstype=zfs AND root=data/set <=> root=data/set
  rootfstype=zfs AND root=         <=> root=zfs:AUTO

So rootfstype=zfs /also/ behaves as expected, and + decoding works

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13291

* contrib; dracut: flatten zfs-load-key, simplify zfs-env-bootfs

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13291

* contrib: dracut: zfs-lib: simplify ask_for_password

The only user is mount-zfs.sh (non-systemd systems),
so reduce it to what it needs

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13291

* contrib: dracut: zfs-lib: remove find_bootfs

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13291

* contrib: dracut: inline single-use import_pool, move single-use ask_for_password

Also don't set ROOTFS_MOUNTED; the final mention was removed in dracut
011 from July 2011

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13291

* contrib: dracut: don't require essentials to be under the same encroot

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13291

* contrib: dracut: zfs-{rollback,snapshot}-bootfs: order after key loading

This fixes at least one race I got with an encrypted root

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13291

* contrib: dracut: zfs-needshutdown: don't list

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13291

* Add dracut.zfs.7

Thorough documentation with a dracut.bootup(7)-style flowchart,
dracut.cmdline(7)-style cmdline listing,
and per-file docs like the old README

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13291

* contrib: dracut: remove getargbool polyfill

It was originally released in dracut 008 in February 2011;
we can probably drop it now

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13291

* Strengthen Linux kernel capabilities detection

- Add `CONFIG_BLOCK` Linux config requirement to
  `ZFS_AC_KERNEL_CONFIG_DEFINED`. OpenZFS won't compile without
  that block device support due to large amount of functional
  dependencies on it.

- Remove dependency on `groups_alloc()` in
  `ZFS_AC_KERNEL_SRC_GROUP_INFO_GID` to circumvent the missing stub
  in Linux 4.X kernel headers.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes openzfs#13351

* scripts: zfs.sh: explicitly ignore unloaded modules when unloading

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13356

* scripts: zfs.sh: explicitly unload all modules via rmmod

modprobe -r only works for depmodded modules, but this also means we
have to re-iterate legacy modules, and in the right order

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13356

* linux: module: zfs: sysfs: constify types and attrs

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13357

* Linux 5.18 compat: kobj_type.default_attrs replaced with default_groups

Upstream-commit: cdb4f26a63c391317e335e6e683a614358e70aeb ("kobject:
 kobj_type: remove default_attrs")
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13357

* zvol_wait: Ignore locked zvols

"When an encrypted zvol is locked the zfs-volume-wait service does not
start.  The /sbin/zvol_wait should not wait for links when the volume
has property keystatus=unavailable."
-- https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/1888405

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Thanks: James Dingwall <james-launchpad@dingwall.me.uk>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes openzfs#10662

* man: zfs-send.8: fix -X synopses and description

Also clean up the horrendously verbose -X handling in zfs_main()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13352

* tests: cli_user: zfs_001_neg: print the problematic lines

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13352

* Improve zpool status output, list all affected datasets

Currently, determining which datasets are affected by corruption is
a manual process.

The primary difficulty in reporting the list of affected snapshots is
that since the error was initially found, the snapshot where the error
originally occurred in, may have been deleted. To solve this issue, we
add the ID of the head dataset of the original snapshot which the error
was detected in, to the stored error report. Then any time a filesystem
is deleted, the errors associated with it are deleted as well. Any time
a clone promote occurs, we modify reports associated with the original
head to refer to the new head. The stored error reports are identified
by this head ID, the birth time of the block which the error occurred
in, as well as some information about the error itself are also stored.

Once this information is stored, we can find the set of datasets
affected by an error by walking back the list of snapshots in the given
head until we find one with the appropriate birth txg, and then traverse
through the snapshots of the clone family, terminating a branch if the
block was replaced in a given snapshot. Then we report this information
back to libzfs, and to the zpool status command, where it is displayed
as follows:

 pool: test
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:00:00 with 800 errors on Fri Dec  3
08:27:57 2021
config:

        NAME        STATE     READ WRITE CKSUM
        test        ONLINE       0     0     0
          sdb       ONLINE       0     0 1.58K

errors: Permanent errors have been detected in the following files:

        test@1:/test.0.0
        /test/test.0.0
        /test/1clone/test.0.0

A new feature flag is introduced to mark the presence of this change, as
well as promotion and backwards compatibility logic. This is an updated
version of openzfs#9175. Rebase required fixing the tests, updating the ABI of
libzfs, updating the man pages, fixing bugs, fixing the error returns,
and updating the old on-disk error logs to the new format when
activating the feature.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Co-authored-by: TulsiJain <tulsi.jain@delphix.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes openzfs#9175
Closes openzfs#12812

* Improve log spacemap load time

Previous flushing algorithm limited only total number of log blocks to
the minimum of 256K and 4x number of metaslabs in the pool.  As result,
system with 1500 disks with 1000 metaslabs each, touching several new
metaslabs each TXG could grow spacemap log to huge size without much
benefits.  We've observed one of such systems importing pool for about
45 minutes.

This patch improves the situation from five sides:
 - By limiting maximum period for each metaslab to be flushed to 1000
TXGs, that effectively limits maximum number of per-TXG spacemap logs
to load to the same number.
 - By making flushing more smooth via accounting number of metaslabs
that were touched after the last flush and actually need another flush,
not just ms_unflushed_txg bump.
 - By applying zfs_unflushed_log_block_pct to the number of metaslabs
that were touched after the last flush, not all metaslabs in the pool.
 - By aggressively prefetching per-TXG spacemap logs up to 16 TXGs in
advance, making log spacemap load process for wide HDD pool CPU-bound,
accelerating it by many times.
 - By reducing zfs_unflushed_log_block_max from 256K to 128K, reducing
single-threaded by nature log processing time from ~10 to ~5 minutes.

As further optimization we could skip bumping ms_unflushed_txg for
metaslabs not touched since the last flush, but that would be an
incompatible change, requiring new pool feature.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes openzfs#12789

* autoconf: Pretend `CONFIG_MODULES` is always on

- Unconditionally inject `CONFIG_MODULES` make variable
  and `#define CONFIG_MODULES` to Kbuild in `ZFS_LINUX_COMPILE`
  autoconf function to emulate loadable kernel modules support.
  This allows OpenZFS to perform Linux checks despite
  `CONFIG_MODULES=n` in the actual Linux config.

- Add `ZFS_AC_KERNEL_CONFIG_MODULES` check which encompasses
  the logic from `ZFS_AC_KERNEL_TEST_MODULE` with additional
  diagnostic messages to the user

- Removed `ZFS_AC_KERNEL_TEST_MODULE` as it merely duplicates
  every check in `ZFS_AC_KERNEL_CONFIG_DEFINED`

- Moved `ZFS_AC_MODULE_SYMVERS` after `ZFS_AC_KERNEL_CONFIG_DEFINED`
  so the user has a chance to see the proper diagnostic from the
  steps before.

A workaround for Linux's

```
commit 3e3005df73b535cb849cf4ec8075d6aa3c460f68
Author: Masahiro Yamada <masahiroy@kernel.org>
Date:   Wed Mar 31 22:38:03 2021 +0900

kbuild: unify modules(_install) for in-tree and external modules

If you attempt to build or install modules ('make modules(_install)'
with CONFIG_MODULES disabled, you will get a clear error message, but
nothing for external module builds.

Factor out the modules and modules_install rules into the common part,
so you will get the same error message when you try to build external
modules with CONFIG_MODULES=n.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
```

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes openzfs#10832
Closes openzfs#13361

* PPC get_user workaround

Linux 5.12 PPC 5.12 get_user() and __copy_from_user_inatomic()
inline helpers very indirectly include a reference to the GPL'd
array mmu_feature_keys[] and fails to build. Workaround this by
using copy_from_user() and throwing EFAULT for any calls to
__copy_from_user_inatomic(). This is a workaround until a fix
for Linux commit 7613f5a66becfd0e43a0f34de8518695888f5458
"powerpc/64s/kuap: Use mmu_has_feature()" is fully addressed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes openzfs#11958
Closes openzfs#12590
Closes openzfs#13367

* zfs: holds: general cleanup

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13373

* zfs: holds: dequadratify

Before:
  15  0m0.177s
  30  0m0.653s
  45  0m1.289s
  60  0m2.129s
  75  0m3.264s
  90  0m4.397s
  100 0m5.996s
  117 0m8.552s

After:
  30  0m0.053s
  117 0m0.125s

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes openzfs#13372
Closes openzfs#13373

* Linux 5.18 compat: replace __set_page_dirty_nobuffers

Replace __set_page_dirty_nobuffers with filemap_dirty_folio.

Upstream-commit: 6b1f86f8e9c7f9de7ca1cb987b2cf25e99b1ae3a
("Merge tag 'folio-5.18b' of
git://git.infradead.org/users/willy/pagecache ")

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Authored-by: Satadru Pramanik <satadru@gmail.com>
Signed-off-by: Satadru Pramanik <satadru@gmail.com>
Closes openzfs#13325
Closes openzfs#13380

* Fix O_APPEND for Linux 3.15 and older kernels

When using a Linux kernel which predates the iov_iter interface the
O_APPEND flag should be applied in zpl_aio_write() via the call to
generic_write_checks().  The updated pos variable  was incorrectly
ignored resulting in the current offset being used.

This issue should only realistically impact the RHEL/CentOS 7.x
kernels which are based on Linux 3.10.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#13370 
Closes openzfs#13377

Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: наб <nabijaczleweli@nabijaczleweli.xyz>
Co-authored-by: Low-power <msl0000023508@gmail.com>
Co-authored-by: Damian Szuberski <szuberskidamian@gmail.com>
Co-authored-by: Alexander Motin <mav@FreeBSD.org>
Co-authored-by: Rich Ercolani <214141+rincebrain@users.noreply.github.com>
Co-authored-by: Richard Laager <rlaager@wiktel.com>
Co-authored-by: George Amanakis <gamanakis@gmail.com>
Co-authored-by: Satadru Pramanik <satadru@gmail.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants