-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nbft: Fix (struct nbft_info_subsystem_ns).num_hfis off-by-one #766
Conversation
The num_hfis field only reflected the number of Secondary HFI Associations, resulting in the last parsed HFI being ignored by users (nvme-cli). According to the NVM Express Boot Specification, Revision 1.0, the Primary HFI Descriptor Index in the Subsystem Namespace (SSNS) Descriptor contains this note: "If multiple HFIs are associated with this record, subsequent interfaces should be populated in the Secondary HFI Associations field." As both the primary and secondary HFIs are parsed into a single array, it makes sense to reflect the proper number of elements. Signed-off-by: Tomas Bzatek <tbzatek@redhat.com>
I've been testing multipath (sort-of) NBFT boot within qemu and found some more issues. Hell, I'm not even sure that my local setup is correct. Let's start with this one and clarify the HFI indexes first. |
test/nbft/diffs/NBFT-dhcp-ipv6
Outdated
@@ -21,7 +21,7 @@ hfi_list[0]->tcp_info.host_name=nvmeof-sles | |||
hfi_list[0]->tcp_info.this_hfi_is_default_route=1 | |||
hfi_list[0]->tcp_info.dhcp_override=1 | |||
subsystem_ns_list[0]->index=1 | |||
subsystem_ns_list[0]->num_hfis=1 | |||
subsystem_ns_list[0]->num_hfis=2 | |||
subsystem_ns_list[0]->hfis[0]->index=1 | |||
subsystem_ns_list[0]->hfis[1]->index=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense for hfis[0]
and hfis[1]
to have the same value? AFAICS the spec doesn't explicitly say it's illegal. The wording is: "If multiple HFIs are associated with this record, subsequent interfaces should be populated in the Secondary HFI Associations field". Because a HFI index identifies an interface uniquely, it makes no sense to me to list the same interface multiple times. It has the potential to confuse consumers of these values.
Same comment multiple times below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Martin's comment, the HFI shouldn't be duplicated unnecessarily, but I don't think doing so should be bad.
The intent was always that the secondary HFI's list would allow you to list out other HFIs from say a 4-port NIC that all had L2 or L3 access to a common Subsystem/Namespace combo address rather than re-write 4-copies of a duplicate SNSS record for each HFI:SNSS association. The wording Because of this, IMO, if we're going to combine Primary and Secondary into one list ( |
libnvme/tests/test-nbft.py
Outdated
@@ -46,7 +46,7 @@ def setUp(self): | |||
"asqsz": 0, | |||
"controller_id": 5, | |||
"data_digest_required": False, | |||
"hfi_indexes": [0], | |||
"hfi_indexes": [0, 0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list confuses me. Why is HFI index 0
repeated here? Was this just a test case?
test/nbft/diffs/NBFT-dhcp-ipv6
Outdated
@@ -21,7 +21,7 @@ hfi_list[0]->tcp_info.host_name=nvmeof-sles | |||
hfi_list[0]->tcp_info.this_hfi_is_default_route=1 | |||
hfi_list[0]->tcp_info.dhcp_override=1 | |||
subsystem_ns_list[0]->index=1 | |||
subsystem_ns_list[0]->num_hfis=1 | |||
subsystem_ns_list[0]->num_hfis=2 | |||
subsystem_ns_list[0]->hfis[0]->index=1 | |||
subsystem_ns_list[0]->hfis[1]->index=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Martin's comment, the HFI shouldn't be duplicated unnecessarily, but I don't think doing so should be bad.
I am confused, not so much by this PR, but by the data we already have in the repository. It seems that all NBFT tables that we gathered so far have a non-zero "Secondary HFI Associations Heap Object Reference" field with a single element, duplicating the primary HFI index. I have never realized this, and as I noted above, I don't think it makes a lot of sense. Indeed, in the Timberland EDK2 code, in NvmeOfFillHostFabricInterfaceSections(), we have
(note that
So this does exactly what we observe, it creates a secondary association list with length one, with same value initialized from As a matter of fact, if this duplication didn't exist, our code would never have worked with any of the NBFs (without the current PR). Only because of the duplicate entry, our tools (like "nvme nbft show -o json) set num_hfis = 1, and would then print or use the first entry of the IMO this is pretty big mess for consumers. If this PR is merged, with the NBFT tables we currently know (both Dell's and those generated by Timberland's EDK2 code), the primary HFI will be duplicated by the first entry of the secondary HFI list. I can't think of any situation where trying to treat these two entries as "multipath" would make sense. IOW, if this PR is merged, every consumer (NM, wicked, the nvme nbft plugin, and most importantly, the connect code) will need to be "fixed" to check for duplicates in the HFI list and skip them. Which would be highly undesirable. OTOH, without this PR, NBFTs that don't have the duplicate "secondary HFI" entry won't be parsed correctly at all. In particular, connecting to any namespaces in such NBFTs would just not work, because I think that on top of this PR (or if @tbzatek agrees, in this PR), we must add code to filter out duplicates in libvme. Maybe it's sufficient to just check for the dupication of the primary HFI in the first element of the secondary HFI list, rather than checking the entire list. I am not sure about that. |
Side note: libnvme's
whereas the other consumers use |
I don't quite understand. Why would the driver create a new SNSS record for the same namespace? That makes even less sense than listing the same interface for one SNSS record multiple times. Also, this argument seems to be about multiple distinct HFIs, where here we are seeing just one HFI listed repeatedly. IMO the secondary HFI list should be populated if and only if a single SNSS record is reachable via multiple HFI entries, and multiple SNSS records should only be created if for multiple subsystems/namespaces (an exception being perhaps a SNSS record which is reachable via multiple different controllers, and thus listed in multiple distinct NBFTs—but that's a corner case). But we are now facing the situation that exisisting NBFT tables do this differently and that the spec doesn't disallow that, and we need to do something about it. |
Precisely. Let me share the second part of my story to make things even more complicated... All the sample NBFT tables we have only contain a single HFI. Things go wrong when a second HFI is present with two SSNSs each supposedly pointing to its own HFI (note that this setup might not be entirely valid, just experimenting with timberland-sig/edk2#24). This scenario results in:
Which due to the bug this ticket is about, the Secondary HFI Association never gets used by nvme-cli. It's still a mystery why the Primary HFI index always points to the first HFI while it clearly shouldn't for the second SSNS. Might be a misinterpretation of the specs or an actual bug in the EDK2 code. So that was my question for the start - how the primary and secondary indexes were supposed to be interpreted. Full dumps attached (unpatched libnvme):
Correct.
That's intentional. The libnvme's
Totally agree that we should play nice to existing consumers. A real hardware is out as well and needs to be supported. https://github.com/linux-nvme/nvme-cli/blob/master/nbft.c#L146-L149 The current nvme-cli code will just ignore already connected duplicates in the success scenario, though in the failure scenario a second connection attempt will be made, leading to delays and duplicate error messages reported. |
This was also one of the questions I was curious what the correct answer is. How does multipath actually should look like and what are the supported/correct scenarios? I.e.
(Note that sysadmins are often creative leading to surprising results)
And how would the pre-os driver actually handle these scenarios? I guess we should be consistent in an ideal case. Actual hardware vendor implementations may vary.
Right. |
Looking at the EDK2 code linked above, I am pretty sure multiple different HFIs per SNSS (and a non-trivial secondary HFI list in general) is not supported by the current timberland EDK2 code, and AFAICS the latest code in @trevor-cockrell's PR#25 hasn't changed in this respect. So the OVMF/EDK2 code simply doesn't support this. That's bad, because it will make testing and development much harder for us. I know Dell has made some multipath tests with their server BIOS with our current code, and these worked for some reason. I need to take a closer look at the NBFT tables involved to understand why this worked. |
I believe the reason is the fallback code in |
The NVMe boot specification does not disallow listing the primary HFI index again in the secondary HFI list, or listing the same index multiple times in the secondary HFI list. But such duplicate entries aren't helpful for consumers of this data. In the worst case, they might lead to confusion and misconfiguration. Suppress them. Signed-off-by: Martin Wilck <mwilck@suse.com>
With commit "nbft: avoid duplicate entries in ssns->hfis" applied, nbft-dump will not see any duplicate HFI indices any more. Fix the reference output for generating the diffs. Signed-off-by: Martin Wilck <mwilck@suse.com> Signed-off-by: Tomas Bzatek <tbzatek@redhat.com>
Covered by the Python tests already, let's include this table in the other NBFT parser tests (which are all QEMU dumps). Signed-off-by: Tomas Bzatek <tbzatek@redhat.com>
(rebased to clean the clutter a little, no functional changes) |
I understand that @mwilck is happy with the current version. @Douglas-Farley are also happy with this? (I read through the issue and decided to be just the patch monkey) |
What's special on this table is the second SSNS record that is roughly equal to the first one except of the 'nsid' and 'nid' values, although only a single subsystem has been configured and enabled in the EFI Setup. Signed-off-by: Tomas Bzatek <tbzatek@redhat.com>
I've added one more table from the R660 machine we have inhouse. I can't explain presence of the second SSNS record (yet). The good news is that nvme-cli would match an existing controller from the first SSNS record and will ignore the second one (i.e. not taking the different |
Thanks! |
The
num_hfis
field only reflected the number of Secondary HFI Associations, resulting in the last parsed HFI being ignored by users (nvme-cli).According to the NVM Express Boot Specification, Revision 1.0, the Primary HFI Descriptor Index in the Subsystem Namespace (SSNS) Descriptor contains this note:
As both the primary and secondary HFIs are parsed into a single array, it makes sense to reflect the proper number of elements.
Now, this may potentially have great impact. Looking at the sample NBFT tables we have, all SSNSs typically carry the same HFI index in both the Primary and Secondary HFI Descriptor Indexes, leading to duplicate entries. This may potentially lead to duplicate connection attempts and failure messages reported by nvme-cli.