-
Notifications
You must be signed in to change notification settings - Fork 401
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVMeoF / TCP boot support #2184
NVMeoF / TCP boot support #2184
Conversation
Adding @aafeijoo-suse , @johnmeneghini, @igaw, @hreinecke for information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far it looks good, as I've followed your development process and already reviewed it internally over the past months. Thanks for your work here.
Just two comments:
- As you said, we must wait until
nvme-cli
consolidates the syntax of the commands used here. - It'd be great to add a new test for the
nvmf
module in the future (similar toTEST-30-ISCSI
), but that shouldn't block this PR, as we don't have a test for the current implementation at this time, and thenvme-cli
changes need to be available from our CI.
Updated with some minor IPv6 fixes (again, one for iSCSI, too). I will squash these fixes before this is merged. For now I guess keeping them separate makes the review easier. |
@LaszloGombos, I have created #2192 now, which separates out the non-nvmf related changes from this PR. |
96d67c6
to
91e1ee9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Martin, you need to fix the shfmt
issues in nvmf-autoconnect.sh
and parse-nvmf-boot-connections.sh
:
https://github.com/dracutdevs/dracut/actions/runs/4136435631/jobs/7150292822#step:4:56
Also, I assume the part of commit 91e1ee9 message "Also, make sure that the initqueue script doesn't call exit()." is outdated, right?
I see the nvmf-autoconnect.sh
never checks if the connection is already established, right? What if it runs again in this case, have you tried that? I'm just pointing this out, because initqueue --online
without --onetime
will run the nvmf-autoconnect.sh
script for every network interface configured on the kernel command line, and also initqueue --settled
and --timeout
can run after the connection is set.
Other than that, the changes look good.
Argh. Why do I keep forgetting to run it? Sorry.
No, that's currently impossible. It would be extremely hard and error-prone to get such a test right in a shell script.
Yes, it does nothing but prints a (currrently misleading) error message and exits with error status. Anyway, note that the NBFT can include multiple connections and it's possible that some are established while others are not.
If this happens we'll see a lot of warnings from nvme-cli. But we will work on fixing that on the nvme-cli side. The thing is, the single attempt that the original code is doing won't be sufficient in all cases, and it's better to see a few error messages than to fail booting. As this code evolves, we may be able to do this more elegantly in the future, without the need to retry blindly. But as I said, iSCSI does it the same way, and it has been around for more than a decade. |
91e1ee9
to
4dd4460
Compare
Pushed shfmt fixes. |
Thanks for clarifying.
Sorry to bother you again with this minor format requirement, but it's still failing for https://github.com/dracutdevs/dracut/actions/runs/4143068641/jobs/7164486549#step:4:15 |
4dd4460
to
26c0fab
Compare
We probably want to have this in a separated new NVMeoF module also @mwilck I notice that you are poking the network legacy module which we are planning on dropping so my question to you is how much do you think it's being used in the wild since downstream should have switched to iwd, connman,networkd,nm by now? |
I don't think that's a good idea. It would cause the nvmf and nvmeof modules to possibly conflict. There's is really just nvmeof (NVMe over Fabrics). Fibre Channel (which is what the current I agree that the name
The only changes to
My distro (SUSE/openSUSE) is still using network-legacy, and will continue to do so for the life time of SLE15 at least (EOL currently planned for 2031). At least that's what I expect. I designed the NVMeoF changes such that they just set cmdline parameters like |
Yes, that's a recurring discussion. We (SUSE) will not be removing the While the |
Yeah sure I just want to keep invested party aware that it's eventually going away. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In NBFT setups, VLAN can be configured in the firmware. Add the 8021q module in hostonly mode even if VLAN is currently not used to be prepared for such configuration change.
Since nvme-cli 2.0, configuration of subsystems to connect to is stored under `/etc/nvme` in either `discovery.conf` or `config.json`. Attempt discovery also if the latter exists, but not the former. Also, install "config.json" if it's present on the root FS. As before, "rd.nvmf.discover=fc,auto" will force either file to be ignored, and NBFT-defined targets take precedence if found.
ac66c00
to
f58e1d5
Compare
Merged and squashed timberland-sig#10, and rebased. This is ready for review now. |
@aafeijoo-suse, your issues should be addressed now. |
Add code to parse the Nvme-oF Boot Firmware Table (NBFT) according to the NVM Express Boot Specification 1.0 [1]. The implementation in dracut follows a similar general approach as iBFT support in the iscsi module. NBFT support requires two steps: (1) Setting up the network and routing according to the HFI ("Host Fabric Interface") records in the NBFT, (2) Establishing the actual NVMe-oF connection. (1) is accomplished by reading the NBFT using JSON output from the "nvme nbft show" command, and transforming it into command line options ("ip=", "rd.neednet", etc.) understood by dracut's network module and its backends. The resulting network setup code is backend-agnostic. It has been tested with the "network-legacy" and "network-manager" network backend modules. The network setup code supports IPv4 and IPv6 with static, RA, or DHCP configurations, 802.1q VLANs, and simple routing / gateway setup. (2) is done using the "nvme connect-all" command [2] in the netroot handler, which is invoked by networking backends when an interface gets fully configured. This patch adds support for "netboot=nbft". The "nbftroot" handler calls nvmf-autoconnect.sh, which contains the actual connect logic. nvmf-autoconnect.sh itself is preserved, because there are other NVMe-oF setups like NVMe over FC which don't depend on the network. The various ways to configure NVMe-oF are prioritized like this: 1 FC autoconnect from kernel commandline (rd.nvmf.discover=fc,auto) 2 NBFT, if present 3 discovery.conf or config.json, if present, and cmdline.d parameters, if present (rd.nvmf.discovery=...) 4 FC autoconnect (without kernel command line) The reason for this priorization is that in the initial RAM fs, we try to activate only those connections that are necessary to mount the root file system. This avoids confusion, possible contradicting or ambiguous configuration, and timeouts from unavailable targets. A retry logic is implemented for enabling the NVMe-oF connections, using the "settled" initqueue, the netroot handler, and eventually, the "timeout" initqueue. This is similar to the retry logic of the iscsi module. In the "timeout" case, connection to all possible NVMe-oF subsystems is attempted. Two new command line parameters are introduced to make it possible to change the priorities above: - "rd.nvmf.nonbft" causes the NBFT to be ignored, - "rd.nvmf.nostatic" causes any statically configured NVMe-oF targets (config.json, discovery.conf, and cmdline.d) to be ignored. These parameters may be helpful to skip attempts to set up broken configurations. At initramfs build time, the nvmf module is now enabled if an NBFT table is detected in the system. [1] https://nvmexpress.org/wp-content/uploads/NVM-Express-Boot-Specification-2022.11.15-Ratified.pdf [2] NBFT support in nvme-cli requires the latest upstream code (> v2.4). Signed-off-by: Martin Wilck <mwilck@suse.com> Co-authored-by: John Meneghini <jmeneghi@redhat.com> Co-authored-by: Charles Rose <charles.rose@dell.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
It would be good to add jq
package to at least to the Fedora test container. Even if we do not have test ready for the nvmf
dracut module, we can at least test to make sure that including this module does not interfere and does not have unexpected side effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor nitpicks, otherwise LGTM.
has_nbft() { | ||
local f found= | ||
for f in /sys/firmware/acpi/tables/NBFT*; do | ||
[ -f "$f" ] || continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This checks only for regular files. Is that intentional?
You could check for -e
(existence) instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok.
Actually, no. Technically, this check serves to avoid a case where no NBFT*
file exists, in which case $f
would equal /sys/firmware/acpi/tables/NBFT*
(as we don't use nullglob
). But in the (extremely unlikely) case that a directory entry NBFT*
existed in sysfs, and was not a regular file, we'd definitely want to skip it, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, folders, I didn't think of that - but It might be symlink or other file-like type as well f.e.; but if only regular files are expected/accepted, I'm fine with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, only regular files should be accepted.
Note that [
(aka test
) dereferences symlinks for all flags except -h
and -L
. Thus if /sys/firmware/acpi/tables/NBFT*
was a symlink to some existing regular file, this test wouldn't skip it, which looks wrong. But this case would indicate a severely broken kernel. I believe we can ignore it. In the worst case, we'd include the NBFT code in the initramfs on a broken system.
done | ||
[[ $found ]] | ||
} | ||
|
||
[[ $hostonly ]] || [[ $mount_needs ]] && { | ||
pushd . > /dev/null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is not part of the PR, but does anyone know why pushd .
is here? I think this is basically a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it just checking that for_each_host_dev_and_slaves
does not run popd
? I'd love to know the reasoning behind this implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed not part of this PR, therefore I'd like to close this conversation.
AFAICS, the pushd/popd
code (like lots of the initial nvmf code) has been copied from 95iscsi/module-setup.sh
, where it had been added by @haraldh in commit b093aa2 ("beautified shell code") 10y ago, without further explanation. Perhaps at that time for_each_host_dev_and_slaves
could change the working directory? I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, good to know the origin of it. Thanks for the response!
fi | ||
if [ ! -f /sys/class/fc/fc_udev_device/nvme_discovery ] \ | ||
&& [ ! -f /etc/nvme/discovery.conf ] \ | ||
&& [ ! -f /etc/nvme/config.json ] && ! has_nbft; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note: same usage of -f
vs -e
as stated above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-f
is correct here. We wouldn't want to account for non-regular files by the respective names.
[ "$RD_DEBUG" != yes ] || set -x | ||
|
||
if [ "$1" = timeout ]; then | ||
[ ! -f /sys/class/fc/fc_udev_device/nvme_discovery ] \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note: same usage of -f
vs -e
as stated above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, -f
is correct here. If this existed and was anything else but a regular file, we'd have to ignore it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, thanks!
pushd/popd are bash builtin commands that allow the user to change directories (cd) by pushing and popping them as though they
were on a stack. I use this bash command all of the time.
Say `man pushd`
/John
…On 6/22/23 14:32, Pavel Valena wrote:
***@***.**** approved this pull request.
Just minor nitpicks, otherwise LGTM.
--------------------------------------------------------------------------------------------------------------------------------
In modules.d/95nvmf/module-setup.sh <#2184 (comment)>:
> @@ -25,17 +25,27 @@ check() {
[[ $trtype == "fc" ]] || [[ $trtype == "tcp" ]] || [[ $trtype == "rdma" ]]
}
+ has_nbft() {
+ local f found=
+ for f in /sys/firmware/acpi/tables/NBFT*; do
+ [ -f "$f" ] || continue
This checks only for regular files. Is that intentional?
You could check for |-e| (existence) instead.
--------------------------------------------------------------------------------------------------------------------------------
In modules.d/95nvmf/module-setup.sh <#2184 (comment)>:
> @@ -25,17 +25,27 @@ check() {
[[ $trtype == "fc" ]] || [[ $trtype == "tcp" ]] || [[ $trtype == "rdma" ]]
}
+ has_nbft() {
+ local f found=
+ for f in /sys/firmware/acpi/tables/NBFT*; do
+ [ -f "$f" ] || continue
+ found=1
+ break
+ done
+ [[ $found ]]
+ }
+
[[ $hostonly ]] || [[ $mount_needs ]] && {
pushd . > /dev/null
I know this is not part of the PR, but does anyone know why |pushd .| is here? I think this is basically a no-op.
--------------------------------------------------------------------------------------------------------------------------------
In modules.d/95nvmf/module-setup.sh <#2184 (comment)>:
> [[ $hostonly ]] || [[ $mount_needs ]] && {
pushd . > /dev/null
for_each_host_dev_and_slaves is_nvmf
local _is_nvmf=$?
popd > /dev/null || exit
[[ $_is_nvmf == 0 ]] || return 255
- if [ ! -f /sys/class/fc/fc_udev_device/nvme_discovery ]; then
- if [ ! -f /etc/nvme/discovery.conf ]; then
- echo "No discovery arguments present"
- return 255
- fi
+ if [ ! -f /sys/class/fc/fc_udev_device/nvme_discovery ] \
+ && [ ! -f /etc/nvme/discovery.conf ] \
+ && [ ! -f /etc/nvme/config.json ] && ! has_nbft; then
Just a note: same usage of |-f| vs |-e| as stated above.
--------------------------------------------------------------------------------------------------------------------------------
In modules.d/95nvmf/nvmf-autoconnect.sh <#2184 (comment)>:
>
-[ -f /sys/class/fc/fc_udev_device/nvme_discovery ] || exit 1
-echo add > /sys/class/fc/fc_udev_device/nvme_discovery
+[ "$RD_DEBUG" != yes ] || set -x
+
+if [ "$1" = timeout ]; then
+ [ ! -f /sys/class/fc/fc_udev_device/nvme_discovery ] \
Just a note: same usage of |-f| vs |-e| as stated above.
--------------------------------------------------------------------------------------------------------------------------------
In modules.d/95nvmf/module-setup.sh <#2184 (comment)>:
> @@ -25,17 +25,27 @@ check() {
[[ $trtype == "fc" ]] || [[ $trtype == "tcp" ]] || [[ $trtype == "rdma" ]]
}
+ has_nbft() {
+ local f found=
+ for f in /sys/firmware/acpi/tables/NBFT*; do
+ [ -f "$f" ] || continue
+ found=1
+ break
+ done
+ [[ $found ]]
+ }
+
[[ $hostonly ]] || [[ $mount_needs ]] && {
pushd . > /dev/null
Is it just checking that |for_each_host_dev_and_slaves| does not run |popd|? I'd love to know the reasoning behind this
implementation.
—
Reply to this email directly, view it on GitHub <#2184 (review)>,
or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADCUXXI2QO4PHY3XCZIM4HDXMSFVXANCNFSM6AAAAAAUQCFMYI>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Note that these changes have been tested on Fedora with the Fedora QEMU based NVMe/TCP boot POC at: https://github.com/timberland-sig/rh-linux-poc I've been testing these patches with Fedora all along. I am able to boot with NVMe/TCP and they appear to work fine with a number of different network configurations, including a dual network nvme/tcp multipathing configuration. Note that Fedora uses Network Manager, so the network manager code path in dracut has been tested with these patches. |
Thank you all for your work here. |
@johnmeneghini Off course, it's not about the command itself, but about that full logic:
I.e. changing to current directory & exiting if popd fails.... what does logic this aim to capture/prevent/achieve? EDIT: nevermind, It was just copied from iscsi code; so there probably are some underlying reasons - not related to this PR. |
Changes
Add support for booting from NVMe over Fabrics according to the NVM Express Boot Specification v1.0 to dracut. Note that this is orthogonal to boot from NVMeoF over Fibre Channel, which is already supported by dracut (using features exclusive of the FC transport). The NVMe-oF Boot Spec is designed to be transport-agnostic (but v1.0 contains support for the TCP transport only).
The spec defines an ACPI table called "NBFT", in which the system firmware stores information about network configuration and NVMe subsystems used for booting (similar to the iBFT table for iSCSI). The code in this PR reads the NBFT table (parsing JSON as printed by nvme-cli), transforms it into
ip=
parameters that dracut's network setup code understands, and connects to the NVMe subsystems after network setup is complete. The code is modeled after the respective code for IBFT boot in the95iscsi
module.This PR depends on linux-nvme/nvme-cli#1791, which implements the nvme-cli subcommands
nvme show-nbft
andnvme connect-nbft
. Therefore this PR should not be merged before the syntax and semantics of the nvme subcommands have been consolidated (i.e. before the nvme-cli PR has been merged). In spite of that, I submit this PR here as RFC, to make the community aware of the feature. While it's expected that the nvme-cli command-line API will change somewhat, the fact that the NBFT content is printed in JSON format will not change, and the structure of the JSON itself will probably change only slightly, if at all. If the API changes in the nvme-cli PR, this PR will be adapted .Checklist
Notes
It is difficult to get one's hands on a sample setup for testing and playing around with boot over NVMeoF. A proof-of-concept setup using the Linux NVMe target (
nvmet
) and qemu is under preparation and will be published soon under https://github.com/timberland-sig.