Linux 5.13 compat: retry zvol_open() when contended #12759

behlendorf · 2021-11-13T00:43:42Z

Motivation and Context

Description

Due to a possible lock inversion the zvol open call path, on Linux we
need to be able to retry in the case where the spa_namespace_lock
cannot be acquired.

For Linux 5.12 an older kernel this was accomplished by returning
-ERESTARTSYS from zvol_open() to request that blkdev_get() drop
the bdev->bd_mutex lock, reacquire it, then call the open callback
again. However, as of the 5.13 kernel this behavior was removed.

Therefore, for 5.12 and older kernels we preserved the existing
retry logic, but for 5.13 and newer kernels we retry internally in
zvol_open(). This should always succeed except in the case where
a pool's vdev are layed on zvols, in which case it may fail. To
handle this case vdev_disk_open() has been updated to retry when
opening a device when -ERESTARTSYS is returned.

How Has This Been Tested?

Locally by running the zfs_copies_003_pos tests on Fedora with
in a loop with the 5.14.16-301.fc35.x86_64 kernel. This test would
often failure with the new kernel because the block device could
not be opened and would return an error (ERESTARTSYS). With
this change applied the test now runs reliably.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

lundman · 2021-11-13T01:07:06Z

In macOS we had some locking issues with zvol_open, and that requests come in from diskarbitration. We had to do a fairly ugly thing for it:
https://github.com/openzfsonosx/openzfs/blob/development/module/os/macos/zfs/zvol_os.c#L172-L248

Do we want to think about a shared code solution?

behlendorf · 2021-11-13T01:14:15Z

Do we want to think about a shared code solution?

That would be grand, unfortunately after looking at this code for a while I think that the requirements for each platform are just different enough to make it not really worthwhile. But if you see a way to nice unify the FreeBSD, Linux, and MacOS implementations I'm all ears.

jgottula · 2021-11-13T03:37:38Z

Glad to see this; will test it out soon. 👍

Also, just double checking to make sure I have this right:

For "normal" zvol_open() situations, this should resolve the lock issues 100% of the time; essentially back to pre-v5.13 behavior where things "just worked".
For vdev-on-zvol situations, it will do up to 100-ish retries (every ~10ms for 1000ms); and the open isn't 100% guaranteed to succeed every time.
- However, if the timeout is hit and failure occurs, the open() caller will at the very least get a sane, userspace-friendly errno value. (Er, wait, I guess only the internal zfs vdev code would even see that anyway...?)

Does that all sound correct?

behlendorf · 2021-11-15T19:27:00Z

Not quite. In the normal zvol_open() case we'll retry up to 100 times at 10ms intervals. Since contention on the lock is relatively unlikely the retries should prevent failures during open. However, it is still technically possible, in which case an error will be returned.

When using a zvol as a vdev we'll do additional retries when an open fails with ERESTART. This handles the possible deadlock case mentioned is the comments.

jgottula · 2021-11-15T19:28:30Z

Okay; thanks for the clarification!

Bronek · 2021-11-24T10:24:07Z

Since I am affected by this as well ( ZVOLs randomly not showing up under /dev/zvol when running ZFS with kernel 5.14 ) I have applied this patch on top of 2.1.1 release ; only had to drop change in tests/test-runner/bin/zts-report.py.in, other changes applied cleanly. The system does not show any signs of regression and my problem with missing /dev/zvol is fixed. Can we please have it included in release 2.1.2 #12718 ?

Bronek · 2021-11-24T10:30:13Z

As for the design of this change, would it be viable to have a r/w lock instead? This way all read accesses to ZVOL will not block each other, only when shared lock needs to be promoted to exclusive (or if there already is exclusive lock when read access is requested) will any failures/blocking happen. Even better if reads could access "old" data concurrently, even when the data is being written, exploiting the COW nature of ZFS.

Admittedly I do not know what I am talking about, since I never worked on ZFS internals, and won't be at all offended if this suggestion is dismissed 🤡

KevinBuettner · 2021-11-24T22:22:06Z

FWIW, this PR seems to fix most of the problems that I reported in #12764 in addition to the problem with zvol links going missing after VM shutdown in #12712.

I, too, would like to see this fix included in 2.1.2.

jgottula · 2021-11-24T22:37:25Z

I wanted to mention that I've been running a system with this patchset, on top of 8ac58acf56, for about 7 days now. (Arch Linux w/ standard Arch kernel config, version 5.15.2.)

Haven't noticed any particularly egregious problems or anything notable.

On the other hand, I haven't gone out of my way to do any particular tests or stressing.

(Also this is not-my-main-system; so there are only a few zvols, rather than literally dozens-to-hundreds of them. 😝 Might be a good idea to stress test on a system with hundreds or thousands of zvols, to see whether it ever does timeout or not... 🤔)

sempervictus · 2021-11-30T00:05:29Z

This retry mechanism seems like a code path to which someone will one day return and potentially facepalm under 100s of delegated namespaces contending on that lock or some other strange condition producing thrash. Do we know what happened in 5.13 to cause this, and if so, is it feasible to push a fix upstream instead of implementing this workaround instead?

Due to a possible lock inversion the zvol open call path on Linux needs to be able to retry in the case where the spa_namespace_lock cannot be acquired. For Linux 5.12 an older kernel this was accomplished by returning -ERESTARTSYS from zvol_open() to request that blkdev_get() drop the bdev->bd_mutex lock, reaquire it, then call the open callback again. However, as of the 5.13 kernel this behavior was removed. Therefore, for 5.12 and older kernels we preserved the existing retry logic, but for 5.13 and newer kernels we retry internally in zvol_open(). This should always succeed except in the case where a pool's vdev are layed on zvols, in which case it may fail. To handle this case vdev_disk_open() has been updated to retry when opening a device when -ERESTARTSYS is returned. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#12301

behlendorf · 2021-11-30T01:05:48Z

Do we know what happened in 5.13 to cause this

We do. This change to the 5.13 kernel removed the kernel provided mechanism for retrying the open which we were depending on. Since it was explicitly removed I doubt upstream would be receptive to putting it back for us, but you could re-apply the change to a custom kernel.

Longer term I agree we're going to want to find a way to restructure the locking in ZFS to remove the need for this entirely. However, that's going to be a more disruptive change so it's something we're going to want to tackle in a different PR.

sempervictus · 2021-11-30T08:41:15Z

Thank you sir, looks like they were doing just about the same thing internally though so not much of a "fix" in there either :-.

config/kernel-blkdev.m4

jgottula · 2021-12-01T21:38:56Z

@behlendorf @tonyhutter Oops, I meant to submit my code review comment a few days back; but only just now realized it was sitting in "pending" state, and therefore not actually visible AFAIK. 🤦‍♂️

Pretty much just poking around at possible edge cases to get your opinion on whether anyone who ends up having both the belt (pre-5.13 ERESTARTSYS goto retry path) and suspenders (new zfs #ifndef HAVE_BLKDEV_GET_ERESTARTSYS code) engaged at the same time in a >=5.13 kernel would hypothetically have anything break or if it wouldn't be a problem.

module/os/linux/zfs/zvol_os.c

Due to a possible lock inversion the zvol open call path on Linux needs to be able to retry in the case where the spa_namespace_lock cannot be acquired. For Linux 5.12 an older kernel this was accomplished by returning -ERESTARTSYS from zvol_open() to request that blkdev_get() drop the bdev->bd_mutex lock, reaquire it, then call the open callback again. However, as of the 5.13 kernel this behavior was removed. Therefore, for 5.12 and older kernels we preserved the existing retry logic, but for 5.13 and newer kernels we retry internally in zvol_open(). This should always succeed except in the case where a pool's vdev are layed on zvols, in which case it may fail. To handle this case vdev_disk_open() has been updated to retry when opening a device when -ERESTARTSYS is returned. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs#12301 Closes openzfs#12759

behlendorf added Component: ZVOL ZFS Volumes Status: Code Review Needed Ready for review and testing labels Nov 13, 2021

behlendorf requested a review from tonyhutter November 13, 2021 00:43

behlendorf mentioned this pull request Nov 13, 2021

Linux v5.13 broke zvol_first_open lock retry logic, so open() on a zvol dev node now sometimes fails with errno=ERESTARTSYS in userspace (particularly noticeable in udev) #12301

Closed

rincebrain mentioned this pull request Nov 15, 2021

zvol links not updated correctly when containing dataset is renamed #12764

Closed

satmandu mentioned this pull request Nov 16, 2021

zfs-2.1.2 proposed patch set #12718

Merged

13 tasks

KevinBuettner mentioned this pull request Nov 24, 2021

Zvols get lost upon host startup and VM shut down #12712

Closed

behlendorf force-pushed the issue-12301 branch from 6c4c9a1 to d25cd3b Compare November 30, 2021 00:48

tonyhutter approved these changes Dec 1, 2021

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Dec 1, 2021

jgottula reviewed Dec 1, 2021

View reviewed changes

config/kernel-blkdev.m4 Show resolved Hide resolved

tonynguien reviewed Dec 1, 2021

View reviewed changes

module/os/linux/zfs/zvol_os.c Show resolved Hide resolved

tonynguien self-assigned this Dec 1, 2021

tonynguien approved these changes Dec 1, 2021

View reviewed changes

tonynguien merged commit 77e2756 into openzfs:master Dec 2, 2021

This was referenced Dec 2, 2021

zfs-2.0.7 proposed patch set #12719

Merged

Official support for linux kernel 5.15 #12786

Closed

Exclude zvol_misc_volmode for now #12733

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linux 5.13 compat: retry zvol_open() when contended #12759

Linux 5.13 compat: retry zvol_open() when contended #12759

behlendorf commented Nov 13, 2021

lundman commented Nov 13, 2021

behlendorf commented Nov 13, 2021 •

edited

Loading

jgottula commented Nov 13, 2021 •

edited

Loading

behlendorf commented Nov 15, 2021

jgottula commented Nov 15, 2021

Bronek commented Nov 24, 2021 •

edited

Loading

Bronek commented Nov 24, 2021 •

edited

Loading

KevinBuettner commented Nov 24, 2021

jgottula commented Nov 24, 2021

sempervictus commented Nov 30, 2021

behlendorf commented Nov 30, 2021 •

edited

Loading

sempervictus commented Nov 30, 2021

jgottula commented Dec 1, 2021 •

edited

Loading

Linux 5.13 compat: retry zvol_open() when contended #12759

Linux 5.13 compat: retry zvol_open() when contended #12759

Conversation

behlendorf commented Nov 13, 2021

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

lundman commented Nov 13, 2021

behlendorf commented Nov 13, 2021 • edited Loading

jgottula commented Nov 13, 2021 • edited Loading

behlendorf commented Nov 15, 2021

jgottula commented Nov 15, 2021

Bronek commented Nov 24, 2021 • edited Loading

Bronek commented Nov 24, 2021 • edited Loading

KevinBuettner commented Nov 24, 2021

jgottula commented Nov 24, 2021

sempervictus commented Nov 30, 2021

behlendorf commented Nov 30, 2021 • edited Loading

sempervictus commented Nov 30, 2021

jgottula commented Dec 1, 2021 • edited Loading

behlendorf commented Nov 13, 2021 •

edited

Loading

jgottula commented Nov 13, 2021 •

edited

Loading

Bronek commented Nov 24, 2021 •

edited

Loading

Bronek commented Nov 24, 2021 •

edited

Loading

behlendorf commented Nov 30, 2021 •

edited

Loading

jgottula commented Dec 1, 2021 •

edited

Loading