Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zvol_wait logic may terminate prematurely #13998

Merged
merged 1 commit into from
Oct 11, 2022

Conversation

sdimitro
Copy link
Contributor

@sdimitro sdimitro commented Oct 6, 2022

Setups that have a lot of zvols may see zvol_wait terminate prematurely even though the script is still making progress. For example, we have a customer that called zvol_wait for ~7100 zvols and by the last iteration of that script it was still waiting on ~2900. Similarly another one called zvol_wait for 2200 and by the time the script terminated there were only 50 left.

This patch adjusts the logic to stay within the outer loop of the script if we are making any progress whatsoever.

Signed-off-by: Serapheim Dimitropoulos serapheim@delphix.com

Testing

With @don-brady 's help I managed to reproduce the issue by invoking a 20-second sleep in the udev-rule that creates those links. Here is the rule with the sleep statemet:

$ sudo cat /lib/udev/rules.d/60-zvol.rules
# Persistent links for zvol
#
# persistent disk links: /dev/zvol/dataset_name
#
# NOTE: We used to also create an additional tree of zvol symlinks located at
#       /dev/dataset_name (i.e. without the 'zvol' path component) for
#       compatibility reasons. These are no longer created anymore, and should
#       not be relied upon.
#

KERNEL=="zd*", SUBSYSTEM=="block", ACTION=="add|change", PROGRAM+="/usr/bin/sleep 20", PROGRAM+="/lib/udev/zvol_id $devnode", SYMLINK+="zvol/%c"

And here is the reproduced error as seen by journalctl:

-- Reboot --
Oct 08 17:28:47 ip-10-110-252-123 systemd[1]: Starting Wait for ZFS Volume (zvol) links in /dev...
Oct 08 17:29:07 ip-10-110-252-123 zvol_wait[14121]: Testing 10002 zvol links
Oct 08 17:29:41 ip-10-110-252-123 zvol_wait[14121]: Still waiting on 9841 zvol links ...
Oct 08 17:30:14 ip-10-110-252-123 zvol_wait[14121]: Still waiting on 9809 zvol links ...
...cropped...
Oct 08 17:40:15 ip-10-110-252-123 zvol_wait[14121]: Still waiting on 8861 zvol links ...
Oct 08 17:40:15 ip-10-110-252-123 zvol_wait[14121]: Timed out waiting on zvol links
Oct 08 17:40:15 ip-10-110-252-123 systemd[1]: zfs-volume-wait.service: Main process exited, code=exited, status=1/F>
Oct 08 17:40:15 ip-10-110-252-123 systemd[1]: zfs-volume-wait.service: Failed with result 'exit-code'.
Oct 08 17:40:15 ip-10-110-252-123 systemd[1]: Failed to start Wait for ZFS Volume (zvol) links in /dev.

With this patch I verified that the service stays around and eventually finishes successfully.

Setups that have a lot of zvols may see zvol_wait terminate prematurely
even though the script is still making progress.  For example, we have a
customer that called zvol_wait for ~7100 zvols and by the last iteration
of that script it was still waiting on ~2900. Similarly another one
called zvol_wait for 2200 and by the time the script terminated there
were only 50 left.

This patch adjusts the logic to stay within the outer loop of the script
if we are making any progress whatsoever.

Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 7, 2022
@behlendorf behlendorf requested a review from pzakha October 7, 2022 20:35
@sdimitro sdimitro marked this pull request as ready for review October 7, 2022 21:41
@sdimitro sdimitro marked this pull request as draft October 7, 2022 22:39
@sdimitro sdimitro marked this pull request as ready for review October 9, 2022 01:14
@sdimitro sdimitro requested review from don-brady and grwilson October 9, 2022 01:14
@sdimitro sdimitro added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Oct 11, 2022
Copy link
Contributor

@pzakha pzakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@behlendorf behlendorf merged commit e5646c5 into openzfs:master Oct 11, 2022
andrewc12 pushed a commit to andrewc12/openzfs that referenced this pull request Oct 12, 2022
Setups that have a lot of zvols may see zvol_wait terminate prematurely
even though the script is still making progress.  For example, we have a
customer that called zvol_wait for ~7100 zvols and by the last iteration
of that script it was still waiting on ~2900. Similarly another one
called zvol_wait for 2200 and by the time the script terminated there
were only 50 left.

This patch adjusts the logic to stay within the outer loop of the script
if we are making any progress whatsoever.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes openzfs#13998
behlendorf pushed a commit that referenced this pull request Nov 1, 2022
Setups that have a lot of zvols may see zvol_wait terminate prematurely
even though the script is still making progress.  For example, we have a
customer that called zvol_wait for ~7100 zvols and by the last iteration
of that script it was still waiting on ~2900. Similarly another one
called zvol_wait for 2200 and by the time the script terminated there
were only 50 left.

This patch adjusts the logic to stay within the outer loop of the script
if we are making any progress whatsoever.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #13998
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants