Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nvmeof_Attempt* variables are only effective on warm boot (not on cold boot) #11

Open
LennySzubowicz opened this issue Mar 16, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@LennySzubowicz
Copy link

The Nvmeof_Attemptconfig and NvmeofGlobalData variables that are created by NvmeOfCli.efi are persistent EFI boot-time variables. However, it appears that they are only acted upon by the Nvme-oF/TCP driver stack on a warm reset and not on a cold boot of the EFI system.

The persistent Nvmeof_Attempt variables should be effective on both warm and cold boots.

@LennySzubowicz
Copy link
Author

LennySzubowicz commented Mar 17, 2023

To clarify, by "cold boot" I mean of start of a qemu process on the host os to create a vm running ovmf with a previously set up ovmf vars file.

In contrast, by "warm boot" I mean invoking the "reset" command from the EFI shell with no other qualifiers. From EFI's point of view, that might be a "cold reset." The ovmf firmware needs to reboot. But it's doing so in the context of the same qemu process that it was running in before.

Steps to reproduce were:

  1. Boot the target vm and run start-tcp-target.sh
  2. Boot the host vm and use the EFI Boot Device Selection (BDS) menu to boot to the efi shell and allow it to run the built startup.nsh, which runs nvmeofcli.efi with a crafted config file.
  3. rename the startup.nsh file so it won't get invoked again
  4. using dmpstore -b -all nvme* observe the presence of the expected EFI persistent (NV+BS) nvme of attempt variables
  5. EFI shell reset command
  6. On the reboot use EFI BDS to boot to the EFI shell again.
  7. Observe that the nvmeof namespace device is now present along with fs1:
  8. dmpstore -b -all nvme* shows nvme attempt variables are still there
  9. you could now exit from the EFI shell back to EFI BDS and boot Fedora from the boot entry for the nvmeof device

The above is all good and demonstrates the working case.

  1. quit out of the host-vm, i.e. shut it down
  2. restart the host vm and use BDS to boot the efi shell
  3. observe the continued presence of the expected nvme attempt variables.
  4. But the nvmeof device is not present.

This demonstrates the problem. If one now goes back to step 5, then the nvmeof boot of Fedora works. But this reset step should not be necessary if the nvmeof attempt variables were previously defined and are still present.

@LennySzubowicz
Copy link
Author

The bootlog from: -debugcon file:bootlog -global isa-debugcon.iobase=0x402

Attempt variables are already defined from a prior boot.
First a cold boot, stopping in EFI shell. Then reset (evidence of that at around line 2690). The boot after reset got all the way into grub before I interrupted it with ESC and exited to the EFI shell:

bootlog-cold-and-reset.txt

The output of devices, drivers after cold boot to EFI shell:

devices-cold.txt
drivers-cold.txt

The output of devices, drivers, and map after reset to EFI shell:

devices-reset.txt
drivers-reset.txt
map-reset.txt

@Douglas-Farley
Copy link
Collaborator

The persistent Nvmeof_Attempt variables should be effective on both warm and cold boots.

We attempted to previously fix this with masking | EFI_VARIABLE_NON_VOLATILE but I suspect that didn't quite cover this case.

@Douglas-Farley Douglas-Farley added the bug Something isn't working label Mar 20, 2023
@Douglas-Farley
Copy link
Collaborator

@Ajay-Khadolia / @swamy-kadaba - Have you observed this by chance?

@amit-jain9
Copy link

The bootlog from: -debugcon file:bootlog -global isa-debugcon.iobase=0x402

Attempt variables are already defined from a prior boot. First a cold boot, stopping in EFI shell. Then reset (evidence of that at around line 2690). The boot after reset got all the way into grub before I interrupted it with ESC and exited to the EFI shell:

bootlog-cold-and-reset.txt

The output of devices, drivers after cold boot to EFI shell:

devices-cold.txt drivers-cold.txt

The output of devices, drivers, and map after reset to EFI shell:

devices-reset.txt drivers-reset.txt map-reset.txt

From the attached logs for cold boot it looks like, a connection to the target is attempted by the NVMe-oF driver. The socket connection looks to be aborted due to a network transmit failure. The logs are as below:

Line no 1929 to 1934 i.e., before reset at line no 2690:

Probe/Connect NQN: nqn.2014-08.org.nvmexpress:uuid:0c468c4d-a385-47e0-8299-6e95051277db
NVMeOFLog:892:spdk_nvme_probe:NVMe target address: 192.168.101.20
NVMeOFLog:1530:spdk_nvme_probe_async:trid trtype 3
NVMeOFLog:807:nvme_probe_internal:trid trstring TCP
NVMeOFLog:148:nvme_transport_ctrlr_scan:trid trstring TCP
Attaching to 192.168.101.20

Line no 2007 to 2014:

TcpTxCallback: Tx error reported: No mapping
NVMeOFLog:228:edk_sock_connect:TcpIoConnect error: 21

NVMeOFLog:1750:nvme_tcp_ctrlr_connect_qpair:sock connection error of tqpair=7E0CA018 with addr=192.168.101.20, port=4420
NVMeOFLog:1863:nvme_tcp_ctrlr_construct:failed to connect admin qpair
NVMeOFLog:674:nvme_ctrlr_probe:Failed to construct NVMe controller for SSD: 192.168.101.20
NVMeOFLog:818:nvme_probe_internal:NVMe ctrlr scan failed
NVMeOFLog:896:spdk_nvme_probe:Create probe context failed
spdk_nvme_probe() failed for 192.168.101.20

Does this happen always when we try to do a cold boot?
We run Qemu on ubuntu machine to boot to UEFI shell and are unable to reproduce this across resets or multiple invocations of the Qemu.
We have not tested this using a host vm to boot to UEFI shell directly, instead we run Qemu and then boot to UEFI shell.

@Douglas-Farley
Copy link
Collaborator

hi @amit-jain9

Does this happen always when we try to do a cold boot?

yes, during the Timberland call yesterday it was reported both Redhat and SUSE POCs from this repo were experiencing this - roughly with the pattern:

cold boot to startup.nsh -> set config -> verify connections -> power off -> cold boot -> doesnt create connections -> warm reset -> reads vars and creates connections

@Ajay-Khadolia
Copy link

We run Qemu on ubuntu machine using below command:

sudo qemu-system-x86_64 --bios bios/OVMF.fd -m 8G -netdev tap,id=mynet0,ifname=tap1,script=no --device virtio-net-pci,netdev=mynet0,id=tap1,mac=52:54:00:12:34:56,romfile=empty.rom -drive file=file.qcow2 -cpu host -debugcon file:debug.log -global isa-debugcon.iobase=0x402 -enable-kvm

we have tested following scenario:

  1. QEMU -> Machine -> options -> reset
  2. Kill the QEMU session and restarted
  3. reset command from QEMU command line
  4. Restarted the ubuntu VM

Attaching the logs for reference.
We are unable to reproduce using these scenarios. Please suggest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants