Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nbft: Fix (struct nbft_info_subsystem_ns).num_hfis off-by-one #766

Merged
merged 5 commits into from
Feb 5, 2024

Conversation

tbzatek
Copy link
Contributor

@tbzatek tbzatek commented Jan 5, 2024

The num_hfis field only reflected the number of Secondary HFI Associations, resulting in the last parsed HFI being ignored by users (nvme-cli).

According to the NVM Express Boot Specification, Revision 1.0, the Primary HFI Descriptor Index in the Subsystem Namespace (SSNS) Descriptor contains this note:

If multiple HFIs are associated with this record, subsequent interfaces should be populated in the Secondary HFI Associations field.

As both the primary and secondary HFIs are parsed into a single array, it makes sense to reflect the proper number of elements.

Now, this may potentially have great impact. Looking at the sample NBFT tables we have, all SSNSs typically carry the same HFI index in both the Primary and Secondary HFI Descriptor Indexes, leading to duplicate entries. This may potentially lead to duplicate connection attempts and failure messages reported by nvme-cli.

The num_hfis field only reflected the number of Secondary HFI
Associations, resulting in the last parsed HFI being ignored
by users (nvme-cli).

According to the NVM Express Boot Specification, Revision 1.0,
the Primary HFI Descriptor Index in the Subsystem Namespace
(SSNS) Descriptor contains this note:

  "If multiple HFIs are associated with this record, subsequent
   interfaces should be populated in the Secondary HFI
   Associations field."

As both the primary and secondary HFIs are parsed into a single
array, it makes sense to reflect the proper number of elements.

Signed-off-by: Tomas Bzatek <tbzatek@redhat.com>
@tbzatek
Copy link
Contributor Author

tbzatek commented Jan 5, 2024

I've been testing multipath (sort-of) NBFT boot within qemu and found some more issues. Hell, I'm not even sure that my local setup is correct. Let's start with this one and clarify the HFI indexes first.

Cc: @mwilck @stuarthayes @Douglas-Farley @johnmeneghini

@@ -21,7 +21,7 @@ hfi_list[0]->tcp_info.host_name=nvmeof-sles
hfi_list[0]->tcp_info.this_hfi_is_default_route=1
hfi_list[0]->tcp_info.dhcp_override=1
subsystem_ns_list[0]->index=1
subsystem_ns_list[0]->num_hfis=1
subsystem_ns_list[0]->num_hfis=2
subsystem_ns_list[0]->hfis[0]->index=1
subsystem_ns_list[0]->hfis[1]->index=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense for hfis[0] and hfis[1] to have the same value? AFAICS the spec doesn't explicitly say it's illegal. The wording is: "If multiple HFIs are associated with this record, subsequent interfaces should be populated in the Secondary HFI Associations field". Because a HFI index identifies an interface uniquely, it makes no sense to me to list the same interface multiple times. It has the potential to confuse consumers of these values.

Same comment multiple times below.

Copy link

@Douglas-Farley Douglas-Farley Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Martin's comment, the HFI shouldn't be duplicated unnecessarily, but I don't think doing so should be bad.

@Douglas-Farley
Copy link

As both the primary and secondary HFIs are parsed into a single array, it makes sense to reflect the proper number of elements.

The intent was always that the secondary HFI's list would allow you to list out other HFIs from say a 4-port NIC that all had L2 or L3 access to a common Subsystem/Namespace combo address rather than re-write 4-copies of a duplicate SNSS record for each HFI:SNSS association.

The wording subsequent interfaces should be populated in the Secondary HFI Associations field. was intended to say do not duplicate the primary HFI into the secondary HFI list, but should doesn't forbid it. IIRC we went with "should" because the driver could ALSO just create a new SNSS record and NOT use a secondary field at all (style preferences debated during SIG meetings).

Because of this, IMO, if we're going to combine Primary and Secondary into one list (hfi_indexes) you need to dedupe primary and secondary entries in-case the pre-os driver did that to avoid accidently re-processing the connection.

@@ -46,7 +46,7 @@ def setUp(self):
"asqsz": 0,
"controller_id": 5,
"data_digest_required": False,
"hfi_indexes": [0],
"hfi_indexes": [0, 0],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list confuses me. Why is HFI index 0 repeated here? Was this just a test case?

@@ -21,7 +21,7 @@ hfi_list[0]->tcp_info.host_name=nvmeof-sles
hfi_list[0]->tcp_info.this_hfi_is_default_route=1
hfi_list[0]->tcp_info.dhcp_override=1
subsystem_ns_list[0]->index=1
subsystem_ns_list[0]->num_hfis=1
subsystem_ns_list[0]->num_hfis=2
subsystem_ns_list[0]->hfis[0]->index=1
subsystem_ns_list[0]->hfis[1]->index=1
Copy link

@Douglas-Farley Douglas-Farley Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Martin's comment, the HFI shouldn't be duplicated unnecessarily, but I don't think doing so should be bad.

@mwilck
Copy link
Contributor

mwilck commented Jan 8, 2024

I am confused, not so much by this PR, but by the data we already have in the repository. It seems that all NBFT tables that we gathered so far have a non-zero "Secondary HFI Associations Heap Object Reference" field with a single element, duplicating the primary HFI index. I have never realized this, and as I noted above, I don't think it makes a lot of sense.

Indeed, in the Timberland EDK2 code, in NvmeOfFillHostFabricInterfaceSections(), we have

   gNvmeOfNbftList[Index].DeviceAdapterIndex = HfiHeader->Index;

(note that gNvmeOfNbftList is just a 1-byte array in the EDK2 code), and later in NvmeOfFillSubsystemNamespaceSection(), there is this code:

    SubsystemNamespace->PrimaryHfiDescriptorIndex = HfiHeader->Index;
    // HFI association
    NvmeOfAddHeapItem (Heap, &(gNvmeOfNbftList[Index].DeviceAdapterIndex), sizeof (UINT8));
    SubsystemNamespace->HfiAssociationLen = sizeof (UINT8);

So this does exactly what we observe, it creates a secondary association list with length one, with same value initialized from HfiHeader->Index, which AFAICS is simply the first HFI that's referenced in the boot attempt at hand.

As a matter of fact, if this duplication didn't exist, our code would never have worked with any of the NBFs (without the current PR). Only because of the duplicate entry, our tools (like "nvme nbft show -o json) set num_hfis = 1, and would then print or use the first entry of the ssns->hfis array, which actually refers to the primary HFI.

IMO this is pretty big mess for consumers. If this PR is merged, with the NBFT tables we currently know (both Dell's and those generated by Timberland's EDK2 code), the primary HFI will be duplicated by the first entry of the secondary HFI list. I can't think of any situation where trying to treat these two entries as "multipath" would make sense. IOW, if this PR is merged, every consumer (NM, wicked, the nvme nbft plugin, and most importantly, the connect code) will need to be "fixed" to check for duplicates in the HFI list and skip them. Which would be highly undesirable.

OTOH, without this PR, NBFTs that don't have the duplicate "secondary HFI" entry won't be parsed correctly at all. In particular, connecting to any namespaces in such NBFTs would just not work, because discover_from_nbft() walks the hfis list.

I think that on top of this PR (or if @tbzatek agrees, in this PR), we must add code to filter out duplicates in libvme. Maybe it's sufficient to just check for the dupication of the primary HFI in the first element of the secondary HFI list, rather than checking the entire list. I am not sure about that.

@mwilck
Copy link
Contributor

mwilck commented Jan 8, 2024

Side note: libnvme's nbft-dump tool is inconsistent with the other consumers (nvme connect, nvme nbft show) as it walks the hfis list like this:

		for (hfi = (*ssns)->hfis, j = 0; hfi && *hfi; hfi++, j++)

whereas the other consumers use i < ssns->num_hfis as loop control statement. Therefore (for NBFTs with duplicate entries) nbft-dump prints 2 HFIs with the same index, while the other tools only see a single HFI.

@mwilck
Copy link
Contributor

mwilck commented Jan 9, 2024

The wording subsequent interfaces should be populated in the Secondary HFI Associations field. was intended to say do not duplicate the primary HFI into the secondary HFI list, but should doesn't forbid it. IIRC we went with "should" because the driver could ALSO just create a new SNSS record and NOT use a secondary field at all (style preferences debated during SIG meetings).

I don't quite understand. Why would the driver create a new SNSS record for the same namespace? That makes even less sense than listing the same interface for one SNSS record multiple times. Also, this argument seems to be about multiple distinct HFIs, where here we are seeing just one HFI listed repeatedly.

IMO the secondary HFI list should be populated if and only if a single SNSS record is reachable via multiple HFI entries, and multiple SNSS records should only be created if for multiple subsystems/namespaces (an exception being perhaps a SNSS record which is reachable via multiple different controllers, and thus listed in multiple distinct NBFTs—but that's a corner case).

But we are now facing the situation that exisisting NBFT tables do this differently and that the spec doesn't disallow that, and we need to do something about it.

@tbzatek
Copy link
Contributor Author

tbzatek commented Jan 10, 2024

I am confused, not so much by this PR, but by the data we already have in the repository. It seems that all NBFT tables that we gathered so far have a non-zero "Secondary HFI Associations Heap Object Reference" field with a single element, duplicating the primary HFI index. I have never realized this, and as I noted above, I don't think it makes a lot of sense.

Precisely. Let me share the second part of my story to make things even more complicated...

All the sample NBFT tables we have only contain a single HFI. Things go wrong when a second HFI is present with two SSNSs each supposedly pointing to its own HFI (note that this setup might not be entirely valid, just experimenting with timberland-sig/edk2#24). This scenario results in:

hfi_list[0]->index=1
hfi_list[0]->tcp_info.mac_addr=5254072c5ae
hfi_list[0]->tcp_info.ipaddr=192.168.122.158
...
hfi_list[1]->index=2
hfi_list[1]->tcp_info.mac_addr=5254072c5af
hfi_list[1]->tcp_info.ipaddr=192.168.123.158
...
subsystem_ns_list[0]->hfis[0]->index=1
subsystem_ns_list[0]->hfis[1]->index=1
...
subsystem_ns_list[1]->hfis[0]->index=1
subsystem_ns_list[1]->hfis[1]->index=2

Which due to the bug this ticket is about, the Secondary HFI Association never gets used by nvme-cli. It's still a mystery why the Primary HFI index always points to the first HFI while it clearly shouldn't for the second SSNS. Might be a misinterpretation of the specs or an actual bug in the EDK2 code.

So that was my question for the start - how the primary and secondary indexes were supposed to be interpreted.

Full dumps attached (unpatched libnvme):
nbft-dump.log
nvme-nbft-show.json

As a matter of fact, if this duplication didn't exist, our code would never have worked with any of the NBFs (without the current PR). Only because of the duplicate entry, our tools (like "nvme nbft show -o json) set num_hfis = 1, and would then print or use the first entry of the ssns->hfis array, which actually refers to the primary HFI.

Correct.

Side note: libnvme's nbft-dump tool is inconsistent with the other consumers (nvme connect, nvme nbft show) as it walks the hfis list like this:

		for (hfi = (*ssns)->hfis, j = 0; hfi && *hfi; hfi++, j++)

whereas the other consumers use i < ssns->num_hfis as loop control statement. Therefore (for NBFTs with duplicate entries) nbft-dump prints 2 HFIs with the same index, while the other tools only see a single HFI.

That's intentional. The libnvme's nbft-dump tool was created (for the purpose of libnvme unit tests) to dump unfiltered data from the parsed NBFT table while nvme nbft show -o json contains some logic and might hide certain data. Still, the (*ssns)->hfis array contains actually parsed data, just the ssns->num_hfis count caused an unreachable last element of the array.

IMO this is pretty big mess for consumers. If this PR is merged, with the NBFT tables we currently know (both Dell's and those generated by Timberland's EDK2 code), the primary HFI will be duplicated by the first entry of the secondary HFI list. I can't think of any situation where trying to treat these two entries as "multipath" would make sense. IOW, if this PR is merged, every consumer (NM, wicked, the nvme nbft plugin, and most importantly, the connect code) will need to be "fixed" to check for duplicates in the HFI list and skip them. Which would be highly undesirable.

OTOH, without this PR, NBFTs that don't have the duplicate "secondary HFI" entry won't be parsed correctly at all. In particular, connecting to any namespaces in such NBFTs would just not work, because discover_from_nbft() walks the hfis list.

I think that on top of this PR (or if @tbzatek agrees, in this PR), we must add code to filter out duplicates in libvme. Maybe it's sufficient to just check for the dupication of the primary HFI in the first element of the secondary HFI list, rather than checking the entire list. I am not sure about that.

Totally agree that we should play nice to existing consumers. A real hardware is out as well and needs to be supported.

https://github.com/linux-nvme/nvme-cli/blob/master/nbft.c#L146-L149

The current nvme-cli code will just ignore already connected duplicates in the success scenario, though in the failure scenario a second connection attempt will be made, leading to delays and duplicate error messages reported.

@tbzatek
Copy link
Contributor Author

tbzatek commented Jan 10, 2024

As both the primary and secondary HFIs are parsed into a single array, it makes sense to reflect the proper number of elements.

The intent was always that the secondary HFI's list would allow you to list out other HFIs from say a 4-port NIC that all had L2 or L3 access to a common Subsystem/Namespace combo address rather than re-write 4-copies of a duplicate SNSS record for each HFI:SNSS association.

The wording subsequent interfaces should be populated in the Secondary HFI Associations field. was intended to say do not duplicate the primary HFI into the secondary HFI list, but should doesn't forbid it. IIRC we went with "should" because the driver could ALSO just create a new SNSS record and NOT use a secondary field at all (style preferences debated during SIG meetings).

This was also one of the questions I was curious what the correct answer is. How does multipath actually should look like and what are the supported/correct scenarios? I.e.

  • a single SSNS record pointing to multiple HFIs --> a common traddr reachable via multiple NIC ports, or
  • several corresponding SSNS:HFI mappings with the target reachable on multiple (isolated) subnets and separate traddrs?

(Note that sysadmins are often creative leading to surprising results)

Because of this, IMO, if we're going to combine Primary and Secondary into one list (hfi_indexes) you need to dedupe primary and secondary entries in-case the pre-os driver did that to avoid accidently re-processing the connection.

And how would the pre-os driver actually handle these scenarios? I guess we should be consistent in an ideal case. Actual hardware vendor implementations may vary.

But we are now facing the situation that exisisting NBFT tables do this differently and that the spec doesn't disallow that, and we need to do something about it.

Right.

@mwilck
Copy link
Contributor

mwilck commented Jan 10, 2024

All the sample NBFT tables we have only contain a single HFI. Things go wrong when a second HFI is present with two SSNSs each supposedly pointing to its own HFI

Looking at the EDK2 code linked above, I am pretty sure multiple different HFIs per SNSS (and a non-trivial secondary HFI list in general) is not supported by the current timberland EDK2 code, and AFAICS the latest code in @trevor-cockrell's PR#25 hasn't changed in this respect. So the OVMF/EDK2 code simply doesn't support this. That's bad, because it will make testing and development much harder for us.

I know Dell has made some multipath tests with their server BIOS with our current code, and these worked for some reason. I need to take a closer look at the NBFT tables involved to understand why this worked.

@mwilck
Copy link
Contributor

mwilck commented Jan 11, 2024

I know Dell has made some multipath tests with their server BIOS with our current code, and these worked for some reason. I need to take a closer look at the NBFT tables involved to understand why this worked.

I believe the reason is the fallback code in discover_from_nbft(). The multipath NBFTs simply contained multiple HFI entries, and the routing was such that the target SSNS records would be reachable through multiple routes. I bet that no real NVMe multipath had been set up. Rather, it was some pseudo-multipath on the IP level, given by the fact that if one link was down, the IP stack would figure out a different route.

@mwilck
Copy link
Contributor

mwilck commented Jan 11, 2024

@tbzatek, I've created tbzatek#1 implementing the removal of duplicates.

mwilck and others added 3 commits January 15, 2024 17:49
The NVMe boot specification does not disallow listing the primary
HFI index again in the secondary HFI list, or listing the same
index multiple times in the secondary HFI list. But such duplicate
entries aren't helpful for consumers of this data. In the worst
case, they might lead to confusion and misconfiguration.
Suppress them.

Signed-off-by: Martin Wilck <mwilck@suse.com>
With commit "nbft: avoid duplicate entries in ssns->hfis" applied,
nbft-dump will not see any duplicate HFI indices any more.
Fix the reference output for generating the diffs.

Signed-off-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Tomas Bzatek <tbzatek@redhat.com>
Covered by the Python tests already, let's include this
table in the other NBFT parser tests (which are all
QEMU dumps).

Signed-off-by: Tomas Bzatek <tbzatek@redhat.com>
@tbzatek
Copy link
Contributor Author

tbzatek commented Jan 15, 2024

(rebased to clean the clutter a little, no functional changes)

@igaw igaw requested a review from Douglas-Farley January 18, 2024 14:50
@igaw
Copy link
Collaborator

igaw commented Jan 18, 2024

I understand that @mwilck is happy with the current version. @Douglas-Farley are also happy with this?

(I read through the issue and decided to be just the patch monkey)

What's special on this table is the second SSNS record that
is roughly equal to the first one except of the 'nsid' and
'nid' values, although only a single subsystem has been
configured and enabled in the EFI Setup.

Signed-off-by: Tomas Bzatek <tbzatek@redhat.com>
@tbzatek
Copy link
Contributor Author

tbzatek commented Jan 23, 2024

I've added one more table from the R660 machine we have inhouse. I can't explain presence of the second SSNS record (yet). The good news is that nvme-cli would match an existing controller from the first SSNS record and will ignore the second one (i.e. not taking the different nid and nsid values in account).

@igaw igaw merged commit cde0754 into linux-nvme:master Feb 5, 2024
14 checks passed
@igaw
Copy link
Collaborator

igaw commented Feb 5, 2024

Thanks!

@Douglas-Farley
Copy link

Sorry, was sick and missed the @! Thanks @igaw / @tbzatek / @mwilck for running the patch to ground. I'll take a look at the edk2 component this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants