-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: honor rd.timeout for nvme ctrl_loss_tmo #11
base: timberland_final
Are you sure you want to change the base?
Conversation
Add code to parse the Nvme-oF Boot Firmware Table (NBFT) according to the NVM Express Boot Specification 1.0 [1]. The implementation in dracut follows a similar general approach as iBFT support in the iscsi module. NBFT support requires two steps: (1) Setting up the network and routing according to the HFI ("Host Fabric Interface") records in the NBFT, (2) Establishing the actual NVMe-oF connection. (1) is accomplished by reading the NBFT using JSON output from the "nvme nbft show" command, and transforming it into command line options ("ip=", "rd.neednet", etc.) understood by dracut's network module and its backends. The resulting network setup code is backend-agnostic. It has been tested with the "network-legacy" and "network-manager" network backend modules. The network setup code supports IPv4 and IPv6 with static, RA, or DHCP configurations, 802.1q VLANs, and simple routing / gateway setup. (2) is done using the "nvme connect-all" command [2] in the netroot handler, which is invoked by networking backends when an interface gets fully configured. This patch adds support for "netboot=nbft". The "nbftroot" handler calls nvmf-autoconnect.sh, which contains the actual connect logic. nvmf-autoconnect.sh itself is preserved, because there are other NVMe-oF setups like NVMe over FC which don't depend on the network. The various ways to configure NVMe-oF are prioritized like this: 1 FC autoconnect from kernel commandline (rd.nvmf.discover=fc,auto) 2 NBFT, if present 3 discovery.conf or config.json, if present, and cmdline.d parameters, if present (rd.nvmf.discovery=...) 4 FC autoconnect (without kernel command line) The reason for this priorization is that in the initial RAM fs, we try to activate only those connections that are necessary to mount the root file system. This avoids confusion, possible contradicting or ambiguous configuration, and timeouts from unavailable targets. A retry logic is implemented for enabling the NVMe-oF connections, using the "settled" initqueue, the netroot handler, and eventually, the "timeout" initqueue. This is similar to the retry logic of the iscsi module. In the "timeout" case, connection to all possible NVMe-oF subsystems is attempted. Two new command line parameters are introduced to make it possible to change the priorities above: - "rd.nvmf.nonbft" causes the NBFT to be ignored, - "rd.nvmf.nostatic" causes any statically configured NVMe-oF targets (config.json, discovery.conf, and cmdline.d) to be ignored. These parameters may be helpful to skip attempts to set up broken configurations. At initramfs build time, the nvmf module is now enabled if an NBFT table is detected in the system. [1] https://nvmexpress.org/wp-content/uploads/NVM-Express-Boot-Specification-2022.11.15-Ratified.pdf [2] NBFT support in nvme-cli requires the latest upstream code (> v2.4). Signed-off-by: Martin Wilck <mwilck@suse.com> Co-authored-by: John Meneghini <jmeneghi@redhat.com> Co-authored-by: Charles Rose <charles.rose@dell.com>
Connection issues during boot may cause the kernel to give up retrying the NVMe connect command prematurely. Use "rd.timeout" to set the ctrl_loss_tmo value for the NVMe subsystem. This means that the default timeout will be infinite. We use an udev rule to set the timeout. We want to do this only for controllers that are brought up in the initramfs. Therefore write the udev rule to /etc/udev/rules.d rather than /run/udev/rules.d. Strictly speaking, we'd want to set the timeouot only for those controllers that are necessary for booting, but it would be very difficult to identify these reliably, so just use this simple approach for now. Signed-off-by: Martin Wilck <mwilck@suse.com>
esac | ||
mkdir -p /etc/udev/rules.d | ||
printf -- 'ACTION=="add|change", SUBSYSTEM=="nvme", KERNEL=="nvme*", ATTR{ctrl_loss_tmo}="%d"\n' "$timeout" \ | ||
> /etc/udev/rules.d/90-nvmf-timeout.rules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, this rule will only be present in the initramfs, not in the real rootfs, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's the idea. If we establish a connection from the real root, this connection is obviously not required to mount the real root.
People may still want to use a different timeout than the default, but in the real root we have other means to configure that. We certainly should not use rd.timeout
in the real root.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining. I was not sure where this part of the script is executed. (no dracut hacker here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a clean solution to me.
@@ -297,6 +297,22 @@ parse_nvmf_discover() { | |||
return 0 | |||
} | |||
|
|||
nvmf_timeout_udev_rule() { | |||
local timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my edification, does this need to be posix compliant? I guess local
it is used elsewhere, but I thought this was a bash-ism really.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but local
is used a lot in the dracut code, also in scripts that pretend to be POSIX (#!/bin/sh
).
According to this stackoverflow article, all practically relevant shells support it, so it should be ok. the dracut checkers shellcheck
and shfmt
also don't take offense from it.
9fd4901
to
0a4d7f9
Compare
Connection issues during boot may cause the kernel to give up retrying the NVMe connect command prematurely. Use "rd.timeout" to set the ctrl_loss_tmo value for the NVMe subsystem. This means that the default timeout will be infinite.
We use an udev rule to set the timeout. We want to do this only for controllers that are brought up in the initramfs. Therefore write the udev rule to /etc/udev/rules.d rather than /run/udev/rules.d. Strictly speaking, we'd want to set the timeouot only for those controllers that are necessary for booting, but it would be very difficult to identify these reliably, so just use this simple approach for now.
Changes
Checklist
Fixes timberland-sig/nvme-cli#7