-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE: Perform actual discovery in discover_from_nbft() #2219
Comments
Thanks for bringing this up. I thought that we already used the discovery records, but indeed we don't 😁 TL;DR: Yes, I think we should do this. The Boot spec is rather vague about the subject of discovery. What does the firmware actually write to the NBFT if it is configured to do discovery, after having found namespaces(s) for booting?
Figure 15 of the NVMe boot spec says about the Primary Discovery Controller Index in the SSNS record: "If a Discovery controller was used to establish this record this value shall be set to a non-zero value", and "shall" means mandatory, so the FW is not allowed to omit the discovery controller record(s) (although booting would probably still succeed if it did, and simply didn't set the "discovered" flag). I see nothing in the spec that would forbid 1). What should the OS do? IMO:
I wonder if we need a command line option to tell nvme-cli whether or not it should try to use NBFT discovery records, to differentiate 5. and 6. There's one corner case for 5): connection to the SSNS record(s) listed in the NBFT succeeds, but the root FS is not found on any of these namespaces. Maybe discovery would turn up additional subsystems/namespaces that contain the root FS. This would arguably be a misconfiguration. I wonder if we need to be able to deal with it. Alternatively, we could always do 6) (discovery from NBFT discovery descriptors), but depending on the setup, that might delay and complicate booting if the discovery controllers have lots of target records to connect to. In practice, if NVMeoF boot is used with discovery, I expect that customers won't configure the environment such that a boot environment would see hundreds of namespaces. The firmware would need a long long time to connect to all the subsystems and might actually fail if the list is too long (not sure what EFI can do in this respect). This, in turn, would mean that either dedicated discovery controllers would be used for booting, or that we would use a different host NQN at boot time (and the discovery controller would hand over only a subset of the available records to the "boot" host NQN).
I don't quite get this. If the FW wasn't able to obtain any SSNS records, how did it boot the OS, after all? Also, AFAICS the spec doesn't mandate that the FW write discovery records for non-functional discovery controllers into the NBFT. IMO a more likely scenario is that the SSNS record from the NBFT is inaccessible but discovery still works (and possibly turns up a different SSNS record with an alternative path to the root FS). This would be covered by 5) above. |
Thanks for your reply, there are some good points that I haven't really thought about. I really like the current
This is what I've been seeing with the Dell hardware, haven't tried with EDK2. See the sample table I added in linux-nvme/libnvme#781.
Either way, we won't be able to do much with that in userspace.
Totally agree.
during boot = dracut? I would suggest to take SSNS Primary Discovery Controller Index in account. I.e. if there's a Discovery record but none of the SSNS records are pointing to it, discovery should be performed. If there's is at least one SSNS associated, no re-discovery should be made during dracut phase (or perhaps it should in case all SSNS connections from this DC failed).
after booting = post-switchroot? We'll need to handle situations where some block devices referenced in fstab may need to be present (and systemd will wait for them, blocking startup). In such case, some service needs to take care of this in early boot phase. Think of placing /var or /srv on a different SSNS than rootfs. Sounds like we may need to split
It would certainly come handy for debugging, even if not actually used.
Hmm, this may be very well caused by quirky pre-OS driver (wrt. case 1.)). Of course, certain responsibility falls down on the admin - a good practice should suggest placing EFI system partition on the same namespace with the rootfs.
I tend to agree with the split. Would be interesting to get more opinions... @johnmeneghini @igaw ?
It already takes 2-3 minutes with our four-DC testbed. This might take much more time during pre-OS connection phase and potentially a similar amount during dracut phase. Might even possibly hit some global timeout. Shall we perhaps move NBFT connection attempts into threads within nvme-cli? :-)
I was trying to describe the case 4.) for e.g. a secondary path that is down, thus unable to discover during pre-OS phase. With expectation that re-discovery would happen later once the network is available.
Yes, precisely. |
Our original intent was that any namespace that is discovered (directly or via a discovery subsystem), will have a SNSS entry. A complaint pre-OS driver will also have minimally populated a SNSS record for each NID in question. Those namespaces should have all come from something like an attempt structure in the edk2 reference; and those attempts are administratively populated either towards an IO or Discovery subsystem. If it was a discovery subsystem, then there should have been a discovery record created an the SNSS in question should point to that discovery record. Perhaps a pre-OS driver, with mDNS, might populate the discovery table without there being an SNSS record (i.e. an unprovisioned host), but if there were any discovered namespaces they should both have a SNSS and point back to the first discovery controller in the chain. Starting from the discovery controller from an SNSS record is better because that allows for administrative orchestration and re-location of the underlying resource and resilience to better routes and topology steering. My point being; from the POV of an OS application trying to re-establish connectivity I would start from the discovery subsystem indicated in the SNSS record and search for the NID. If you can not find the NID from that direction, then falling back to direct connecting to the SNSS indicated IO subsystem is reasonable, but if there was a Discovery Subsys then obviously it had to have been administratively specified. Recapping:
|
What is the status here? I lost the overview... |
Working on it, will publish something hopefully next week or the week after. Now that the firmware has gotten related fixes through timberland-sig/edk2#35 I'm going to use it as a testing baseline. This ticket has grown large and might use breaking into smaller parts for the boot phases discussion. Lots of loose ideas though. |
Okay, posted #2315 as my first attempt for the actual discovery from NBFT implementation. Test setup described in linux-nvme/libnvme#821, generated by timberland-sig/edk2#35. This is roughly how it's currently set to work:
|
Good to close? |
Well, there are some thoughts in the above conversation that might be worth implementing. We're currently working with @johnmeneghini to identify further requirements and perhaps a way how NBFT discovery is done. |
Assume the NVMe/TCP boot attempts are configured to point to discovery controllers.
As called e.g. by
nvme connect-all --nbft
,discover_from_nbft()
uses SSNS records for actual connection. This is okay since it's primary the pre-OS (UEFI) driver responsibility to perform actual discovery.However, in case the defined discovery controller is inaccessible for some reason, e.g. in case of a broken multipath, the SSNS records won't get populated obviously. Still, the NBFT Discovery Descriptor List will likely contain original DC entries and userspace may want to perform additional discovery and connection (e.g. when a path comes back up).
SSNS records carry a link to the Discovery Controller list and a 'discovered' flag so that there's actualy an evidence of pre-OS discovery attempts.
Mechanisms for calling
nvme connect-all --nbft
are being discussed in #2179.The text was updated successfully, but these errors were encountered: