Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: honor rd.timeout for nvme ctrl_loss_tmo #11

Open
wants to merge 2 commits into
base: timberland_final
Choose a base branch
from

Conversation

mwilck
Copy link
Collaborator

@mwilck mwilck commented Jun 1, 2023

Connection issues during boot may cause the kernel to give up retrying the NVMe connect command prematurely. Use "rd.timeout" to set the ctrl_loss_tmo value for the NVMe subsystem. This means that the default timeout will be infinite.

We use an udev rule to set the timeout. We want to do this only for controllers that are brought up in the initramfs. Therefore write the udev rule to /etc/udev/rules.d rather than /run/udev/rules.d. Strictly speaking, we'd want to set the timeouot only for those controllers that are necessary for booting, but it would be very difficult to identify these reliably, so just use this simple approach for now.

Changes

Checklist

  • I have tested it locally
  • I have reviewed and updated any documentation if relevant
  • I am providing new code and test(s) for it

Fixes timberland-sig/nvme-cli#7

mwilck and others added 2 commits May 26, 2023 16:47
Add code to parse the Nvme-oF Boot Firmware Table (NBFT) according
to the NVM Express Boot Specification 1.0 [1]. The implementation in
dracut follows a similar general approach as iBFT support in the
iscsi module.

NBFT support requires two steps:

(1) Setting up the network and routing according to the
    HFI ("Host Fabric Interface") records in the NBFT,
(2) Establishing the actual NVMe-oF connection.

(1) is accomplished by reading the NBFT using JSON output from
the "nvme nbft show" command, and transforming it into command
line options ("ip=", "rd.neednet", etc.) understood by dracut's
network module and its backends. The resulting network setup code
is backend-agnostic. It has been tested with the "network-legacy"
and "network-manager" network backend modules. The network setup
code supports IPv4 and IPv6 with static, RA, or DHCP configurations,
802.1q VLANs, and simple routing / gateway setup.

(2) is done using the "nvme connect-all" command [2] in the netroot handler,
which is invoked by networking backends when an interface gets fully
configured. This patch adds support for "netboot=nbft". The "nbftroot"
handler calls nvmf-autoconnect.sh, which contains the actual connect
logic. nvmf-autoconnect.sh itself is preserved, because there are
other NVMe-oF setups like NVMe over FC which don't depend on the
network.

The various ways to configure NVMe-oF are prioritized like this:

 1 FC autoconnect from kernel commandline (rd.nvmf.discover=fc,auto)
 2 NBFT, if present
 3 discovery.conf or config.json, if present, and cmdline.d parameters,
   if present (rd.nvmf.discovery=...)
 4 FC autoconnect (without kernel command line)

The reason for this priorization is that in the initial RAM fs, we try
to activate only those connections that are necessary to mount the root
file system. This avoids confusion, possible contradicting or ambiguous
configuration, and timeouts from unavailable targets.

A retry logic is implemented for enabling the NVMe-oF connections,
using the "settled" initqueue, the netroot handler, and eventually, the
"timeout" initqueue. This is similar to the retry logic of the iscsi module.
In the "timeout" case, connection to all possible NVMe-oF subsystems
is attempted.

Two new command line parameters are introduced to make it possible to
change the priorities above:

 - "rd.nvmf.nonbft" causes the NBFT to be ignored,
 - "rd.nvmf.nostatic" causes any statically configured NVMe-oF targets
   (config.json, discovery.conf, and cmdline.d) to be ignored.

These parameters may be helpful to skip attempts to set up broken
configurations.

At initramfs build time, the nvmf module is now enabled if an NBFT
table is detected in the system.

[1] https://nvmexpress.org/wp-content/uploads/NVM-Express-Boot-Specification-2022.11.15-Ratified.pdf
[2] NBFT support in nvme-cli requires the latest upstream code (> v2.4).

Signed-off-by: Martin Wilck <mwilck@suse.com>
Co-authored-by: John Meneghini <jmeneghi@redhat.com>
Co-authored-by: Charles Rose <charles.rose@dell.com>
Connection issues during boot may cause the kernel to give up retrying
the NVMe connect command prematurely. Use "rd.timeout" to set the
ctrl_loss_tmo value for the NVMe subsystem. This means that the default
timeout will be infinite.

We use an udev rule to set the timeout. We want to do this only for controllers
that are brought up in the initramfs. Therefore write the udev rule
to /etc/udev/rules.d rather than /run/udev/rules.d.
Strictly speaking, we'd want to set the timeouot only for those controllers
that are necessary for booting, but it would be very difficult to identify
these reliably, so just use this simple approach for now.

Signed-off-by: Martin Wilck <mwilck@suse.com>
esac
mkdir -p /etc/udev/rules.d
printf -- 'ACTION=="add|change", SUBSYSTEM=="nvme", KERNEL=="nvme*", ATTR{ctrl_loss_tmo}="%d"\n' "$timeout" \
> /etc/udev/rules.d/90-nvmf-timeout.rules
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, this rule will only be present in the initramfs, not in the real rootfs, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the idea. If we establish a connection from the real root, this connection is obviously not required to mount the real root.

People may still want to use a different timeout than the default, but in the real root we have other means to configure that. We certainly should not use rd.timeout in the real root.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. I was not sure where this part of the script is executed. (no dracut hacker here)

Copy link

@Douglas-Farley Douglas-Farley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a clean solution to me.

@@ -297,6 +297,22 @@ parse_nvmf_discover() {
return 0
}

nvmf_timeout_udev_rule() {
local timeout

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my edification, does this need to be posix compliant? I guess local it is used elsewhere, but I thought this was a bash-ism really.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right but local is used a lot in the dracut code, also in scripts that pretend to be POSIX (#!/bin/sh).

According to this stackoverflow article, all practically relevant shells support it, so it should be ok. the dracut checkers shellcheck and shfmt also don't take offense from it.

@mwilck mwilck force-pushed the timberland_final branch 2 times, most recently from 9fd4901 to 0a4d7f9 Compare June 14, 2023 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants