RFC: honor rd.timeout for nvme ctrl_loss_tmo #11

mwilck · 2023-06-01T15:09:42Z

Connection issues during boot may cause the kernel to give up retrying the NVMe connect command prematurely. Use "rd.timeout" to set the ctrl_loss_tmo value for the NVMe subsystem. This means that the default timeout will be infinite.

We use an udev rule to set the timeout. We want to do this only for controllers that are brought up in the initramfs. Therefore write the udev rule to /etc/udev/rules.d rather than /run/udev/rules.d. Strictly speaking, we'd want to set the timeouot only for those controllers that are necessary for booting, but it would be very difficult to identify these reliably, so just use this simple approach for now.

Changes

Checklist

I have tested it locally
I have reviewed and updated any documentation if relevant
I am providing new code and test(s) for it

Fixes timberland-sig/nvme-cli#7

Add code to parse the Nvme-oF Boot Firmware Table (NBFT) according to the NVM Express Boot Specification 1.0 [1]. The implementation in dracut follows a similar general approach as iBFT support in the iscsi module. NBFT support requires two steps: (1) Setting up the network and routing according to the HFI ("Host Fabric Interface") records in the NBFT, (2) Establishing the actual NVMe-oF connection. (1) is accomplished by reading the NBFT using JSON output from the "nvme nbft show" command, and transforming it into command line options ("ip=", "rd.neednet", etc.) understood by dracut's network module and its backends. The resulting network setup code is backend-agnostic. It has been tested with the "network-legacy" and "network-manager" network backend modules. The network setup code supports IPv4 and IPv6 with static, RA, or DHCP configurations, 802.1q VLANs, and simple routing / gateway setup. (2) is done using the "nvme connect-all" command [2] in the netroot handler, which is invoked by networking backends when an interface gets fully configured. This patch adds support for "netboot=nbft". The "nbftroot" handler calls nvmf-autoconnect.sh, which contains the actual connect logic. nvmf-autoconnect.sh itself is preserved, because there are other NVMe-oF setups like NVMe over FC which don't depend on the network. The various ways to configure NVMe-oF are prioritized like this: 1 FC autoconnect from kernel commandline (rd.nvmf.discover=fc,auto) 2 NBFT, if present 3 discovery.conf or config.json, if present, and cmdline.d parameters, if present (rd.nvmf.discovery=...) 4 FC autoconnect (without kernel command line) The reason for this priorization is that in the initial RAM fs, we try to activate only those connections that are necessary to mount the root file system. This avoids confusion, possible contradicting or ambiguous configuration, and timeouts from unavailable targets. A retry logic is implemented for enabling the NVMe-oF connections, using the "settled" initqueue, the netroot handler, and eventually, the "timeout" initqueue. This is similar to the retry logic of the iscsi module. In the "timeout" case, connection to all possible NVMe-oF subsystems is attempted. Two new command line parameters are introduced to make it possible to change the priorities above: - "rd.nvmf.nonbft" causes the NBFT to be ignored, - "rd.nvmf.nostatic" causes any statically configured NVMe-oF targets (config.json, discovery.conf, and cmdline.d) to be ignored. These parameters may be helpful to skip attempts to set up broken configurations. At initramfs build time, the nvmf module is now enabled if an NBFT table is detected in the system. [1] https://nvmexpress.org/wp-content/uploads/NVM-Express-Boot-Specification-2022.11.15-Ratified.pdf [2] NBFT support in nvme-cli requires the latest upstream code (> v2.4). Signed-off-by: Martin Wilck <mwilck@suse.com> Co-authored-by: John Meneghini <jmeneghi@redhat.com> Co-authored-by: Charles Rose <charles.rose@dell.com>

Connection issues during boot may cause the kernel to give up retrying the NVMe connect command prematurely. Use "rd.timeout" to set the ctrl_loss_tmo value for the NVMe subsystem. This means that the default timeout will be infinite. We use an udev rule to set the timeout. We want to do this only for controllers that are brought up in the initramfs. Therefore write the udev rule to /etc/udev/rules.d rather than /run/udev/rules.d. Strictly speaking, we'd want to set the timeouot only for those controllers that are necessary for booting, but it would be very difficult to identify these reliably, so just use this simple approach for now. Signed-off-by: Martin Wilck <mwilck@suse.com>

igaw · 2023-06-02T06:59:46Z

modules.d/95nvmf/parse-nvmf-boot-connections.sh

+    esac
+    mkdir -p /etc/udev/rules.d
+    printf -- 'ACTION=="add|change", SUBSYSTEM=="nvme", KERNEL=="nvme*", ATTR{ctrl_loss_tmo}="%d"\n' "$timeout" \
+        > /etc/udev/rules.d/90-nvmf-timeout.rules


Just to clarify, this rule will only be present in the initramfs, not in the real rootfs, right?

Yes, that's the idea. If we establish a connection from the real root, this connection is obviously not required to mount the real root.

People may still want to use a different timeout than the default, but in the real root we have other means to configure that. We certainly should not use rd.timeout in the real root.

Thanks for explaining. I was not sure where this part of the script is executed. (no dracut hacker here)

Douglas-Farley

Looks like a clean solution to me.

Douglas-Farley · 2023-06-02T11:28:10Z

modules.d/95nvmf/parse-nvmf-boot-connections.sh

@@ -297,6 +297,22 @@ parse_nvmf_discover() {
    return 0
 }

+nvmf_timeout_udev_rule() {
+    local timeout


For my edification, does this need to be posix compliant? I guess local it is used elsewhere, but I thought this was a bash-ism really.

Right but local is used a lot in the dracut code, also in scripts that pretend to be POSIX (#!/bin/sh).

According to this stackoverflow article, all practically relevant shells support it, so it should be ok. the dracut checkers shellcheck and shfmt also don't take offense from it.

mwilck and others added 2 commits May 26, 2023 16:47

mwilck requested review from igaw, charles-rose, johnmeneghini and Douglas-Farley June 1, 2023 15:09

github-actions bot added modules nvmf labels Jun 1, 2023

mwilck mentioned this pull request Jun 1, 2023

Delete unused branches in timberland-sig/dracut #12

Open

igaw reviewed Jun 2, 2023

View reviewed changes

Douglas-Farley approved these changes Jun 2, 2023

View reviewed changes

mwilck force-pushed the timberland_final branch 2 times, most recently from 9fd4901 to 0a4d7f9 Compare June 14, 2023 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: honor rd.timeout for nvme ctrl_loss_tmo #11

RFC: honor rd.timeout for nvme ctrl_loss_tmo #11

mwilck commented Jun 1, 2023

igaw Jun 2, 2023

mwilck Jun 2, 2023

igaw Jun 2, 2023

Douglas-Farley left a comment

Douglas-Farley Jun 2, 2023

mwilck Jun 2, 2023

RFC: honor rd.timeout for nvme ctrl_loss_tmo #11

Are you sure you want to change the base?

RFC: honor rd.timeout for nvme ctrl_loss_tmo #11

Conversation

mwilck commented Jun 1, 2023

Changes

Checklist

igaw Jun 2, 2023

Choose a reason for hiding this comment

mwilck Jun 2, 2023

Choose a reason for hiding this comment

igaw Jun 2, 2023

Choose a reason for hiding this comment

Douglas-Farley left a comment

Choose a reason for hiding this comment

Douglas-Farley Jun 2, 2023

Choose a reason for hiding this comment

mwilck Jun 2, 2023

Choose a reason for hiding this comment