Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change controller loss timeout for NBFT connections? #7

Open
mwilck opened this issue May 24, 2023 · 13 comments
Open

Change controller loss timeout for NBFT connections? #7

mwilck opened this issue May 24, 2023 · 13 comments
Assignees

Comments

@mwilck
Copy link
Collaborator

mwilck commented May 24, 2023

Problem1

If a NVMe connection to a subsystem is unavailable during boot for more than 10 minutes, the kernel will cease doing connect attempts after the default timeout of 600s (ctrl_loss_tmo). This will cause booting to fail, even if the subsystem becomes online later.

Should we enforce an infinite timeout for NBFT-specified subsystem connections? If yes, how should it be done?

  1. nvme connect-all supports the --ctrl-loss-tmo option, which can be used to override the setting manually. We could change our dracut code to use this option by default2. dracut could use the standard rd.timeout option3 to make sure that the NVMe timeout and the user-defined timeout are consistent.
  2. Alternatively, we could use an udev rule to set the timeout. This can only be done in the initramfs, otherwise we'd change general system behavior in undesirable ways. Even in the initramfs, we might apply the setting to subsystems that aren't necessary for booting, which is probably not what users expect. SUSE is currently testing this for the SLE15-SP5 environment, where connect-all --nbft --ctrl-loss-tmo=$X is not available.
  3. Question: Should we use an infinite timeout for NBFT devices by default? Or if connect-all --nbft is invoked? Or should we rather require that the user explicitly uses --ctrl-loss-tmo?

Summary of discussion (WIP)4

  • The timeouts we set while attempting the initial connection will persist after mounting the root file system, where they will govern the behavior for connection loss. Users may have different preferences for either case.
  • Infinite timeout has pros and cons, during boot as well as during runtime. This is policy; we shouldn't hard-code anything, but we need to set reasonable defaults.
  • Therefore we should make this configurable. @mwilck's first approach used the existing dracut parameter rd.timeout for this. Do we need a separate parameter?
  • The NBFT may contain entries that are not required for booting. We shouldn't set timeouts for these, but it's hard to determine which entries are "required" and which aren't.
  • The same applies for local configuration such as discovery.conf, but that's a minor issue, because we'll normally just call connect-all --nbft from dracut.
  • The Non-bootable Entry flag could be leveraged to determine whether a controller is "required". To be clarified whether that would work, and how. Probably the flag can only be used from inside nvme connect-nbft; else the dracut logic would become exceedingly complex.
  • --ctrl-loss-tmo will apply to both IO controllers and discovery controllers. For the initial connect attempt, it makes sense to use the same value, but at runtime, an infinite timeout for a discovery controller seems weird. Minor issue.
  • Short-term, we will assume that the same timeout must be applied to all NBFT-defined controllers.
  • After switching root, the timeouts which were set in the initramfs can be changed using udev rules.
  • In udev rules, both in the initramfs and in the real root, determining whether a given controller is required for the root FS is hard. A dedicated tool may be necessary for this. This is out of scope for the timberland discussion.
  • In multipath scenarios, different timeout settings might be reasonable, but it's complicated to figure out whether or not a given controller is part of an nvme multipath. NVMe multipath is beyond the scope of Timberland. Ignore this for now.

Footnotes

  1. this problem was originallreported to SUSE by Dell in private bug 1211080.

  2. the connect-nbft command did not support this option.

  3. rd.timeout defaults to inifinity.

  4. I will try to maintain a summary here, lest people must read the entire therad.

@mwilck
Copy link
Collaborator Author

mwilck commented May 24, 2023

@johnmeneghini @LennySzubowicz @Douglas-Farley @rahlvers @igaw for information.

@mwilck mwilck self-assigned this May 24, 2023
@Douglas-Farley
Copy link

If a NVMe connection to a subsystem is unavailable during boot for more than 10 minutes

It's not just boot, correct? If the rootfs is on that connection and another connection to that NID hasn't been established rootfs can hang post boot to, right?

Conversely, this is only a problem for all-paths down too, correct?

Even in the initramfs, we might apply the setting to subsystems that aren't necessary for booting, which is probably not what users expect.

The NBFT can't even discriminate between those that are necessary or not. At best administratively some might have the Non-bootable Entry Flag in SNSS set; but it's optional at best because DXE can't know anything about that and relies on administrative input.

Question: Should we use an infinite timeout for NBFT devices by default?

Today this seems like the most reasonable solution in my mind.

A more complex approach might be:

  1. If a controller connection was discovered directly or indirectly by BIOS (has a snss record associated), and
  2. The Non-Bootable Entry Flag is cleared on any namespaces associated with that controller; then
  3. apply and infinite timeout.

This is complicated though because it would require pre-processing all the SNSS records and grouping them. jq might be able to do some of the lifting here, but it's still going to look messy in bash.

In the future, we might need a policy like "infinite timeout for named NID of an attempt" or something convoluted in the future; but attempt data is a EDK UI reference construct not exactly a NBFT-ism. We could propose some longer term code that only assigns the non-bootable flag to namespaces that are not directly referenced (assuming an attempt defines a namespace NID only).

cc: @charles-rose

@mwilck
Copy link
Collaborator Author

mwilck commented May 24, 2023

It's not just boot, correct? If the rootfs is on that connection and another connection to that NID hasn't been established rootfs can hang post boot to, right?

Yes, of course. Sorry for not stating that more clearly.

Conversely, this is only a problem for all-paths down too, correct?

Right, I was considering a single-path setup here. For multipath, you want to set fast_io_fail_tmo to a low value, to allow quick failover. But there may be some complexities with NVMe KATO and TCP timeouts/retries, especially for NVMe over TCP. I'm not an expert in that area. @igaw could tell more perhaps.

Even in the initramfs, we might apply the setting to subsystems that aren't necessary for booting, which is probably not what users expect.

The NBFT can't even discriminate between those that are necessary or not. At best administratively some might have the Non-bootable Entry Flag in SNSS set; but it's optional at best because DXE can't know anything about that and relies on administrative input.

I think that initially it'd be ok to assume that all NBFT defined connections are necessary. But the initramfs may also contain config.json or discovery.conf, and these files may contain lots of entries that are totally unrelated to booting.

Question: Should we use an infinite timeout for NBFT devices by default?

Today this seems like the most reasonable solution in my mind.

A more complex approach might be:
[...]
it's still going to look messy in bash.

Indeed. If we want to go down that route, I think it should be done in nvme-cli itself. The logic in connect-all --nbft is able to see the SNSS records and the non-bootable entry flag, and could implement this timeout logic directly, without the need of implementing complex logic on top of it. I wonder if it would be ok to do this always for NBFT entries, or only if no explicit --ctrl-loss-tmo option was specified. Or we could define a special value for the option, such as --ctrl-loss-tmo=nbft, meaning that the logic you described should be applied.

@mwilck
Copy link
Collaborator Author

mwilck commented May 24, 2023

For completeness, in the SUSE bug there was also the idea to add a field for the timeout in the NBFT itself. But IMHO we don't want to do that.

@igaw
Copy link

igaw commented May 25, 2023

Right, I was considering a single-path setup here. For multipath, you want to set fast_io_fail_tmo to a low value, to allow quick failover. But there may be some complexities with NVMe KATO and TCP timeouts/retries, especially for NVMe over TCP. I'm not an expert in that area. @igaw could tell more perhaps.

Unfortunately, there is a lot of complexity involved with all these different timeouts. The nvme subsystem doesn't
enforce any policy, so it is possible to set timeout values which can easily lead to a non stable working setups (see TP4129).

As far I understood the question here, it's about the ctrl object getting removed after the dev_loss_tmo fires. In this context the other timeouts are not relevant, as they are just handling the error case when a path fails not the live time behavior of the ctrl object. The main issue I am aware of setting dev_loss_tmo to infinity for TCP is that it will constantly use resources (reconnect attempts). In comparison to FC, there is currently no way for the target to signal the host please try to reconnect. @hreinecke might have some more insights here as well.

@mwilck
Copy link
Collaborator Author

mwilck commented May 25, 2023

The main issue I am aware of setting dev_loss_tmo to infinity for TCP is that it will constantly use resources (reconnect attempts)

True, but what else can we do if the controller is necessary for accessing the root FS?

In the multipath case, it's suboptimal, if there are other healthy paths. But it is out of scope for us here in the Timberland project to discuss generic timeout issues for NVMe-multipath.

@igaw
Copy link

igaw commented May 25, 2023

Obviously, I wanted to say ctrl_loss_tmo instead of dev_loss_tmo.

True, but what else can we do if the controller is necessary for accessing the root FS?

The error case handling depends on the requirements. Is the system expected retry for ever? Without a timeout the system will behave the same as with a NFS mounted root filesystem. If this is the expectation I don't know of any other way to achieve this but disabling the ctrl_loss_tmo handler.

@mwilck
Copy link
Collaborator Author

mwilck commented May 25, 2023

To avoid confusion (/me had been confused):

For NVMe over TCP (and RDMA), ctrl_loss_tmo sets both the timeout for removing a previously existing controller that went offline, and the timeout for the initial connection attempt.

@mwilck
Copy link
Collaborator Author

mwilck commented May 25, 2023

Without a timeout the system will behave the same as with a NFS mounted root filesystem

Right. And I think that that's what most users would expect. If a timeout strikes, the root FS will see IO errors and go read-only and depending on system settings, the system may halt, reboot, or even panic.

It's worth discussing whether the behavior with timeout is actually more reasonable than the system just getting stalled and unresponsive. This is another question that goes beyond the scope of the Timberland group.

But it makes me realize that we should not hard-code an infinite timeout. Some users will actually prefer something different, even without multipath. My current approach for SUSE was to use the existing rd.timeout parameter. I still think that this makes sense, but rd.timeout is infinite by default.

Users who want a finite timeout even for controllers that support the root FS could add their own udev rules for changing the timeout after the system came up. I think that's sufficient flexibility.

@Douglas-Farley
Copy link

For completeness, in the SUSE bug there was also the idea to add a field for the timeout in the NBFT itself. But IMHO we don't want to do that.

Please look at the BTG TPAR 8029 then with any objections, because it is proposing that actually.

@Douglas-Farley
Copy link

Douglas-Farley commented May 25, 2023

Right. And I think that that's what most users would expect. If a timeout strikes, the root FS will see IO errors and go read-only and depending on system settings, the system may halt, reboot, or even panic.

This is certainly what comes to mind to me when I think of a single path only system failure or a all paths down failure on the root fs.

This is another question that goes beyond the scope of the Timberland group.

I'm agreed here. I view our role in Timberland and from BTG to help the ecosystem align on a common standard way to converge the administrative domain from pre-OS to OS interop. I think handling of namespaces attached as part of that can be something we help make a better experience for via standardization but choosing the "what is the right default policy for xyz app" is out of our hands here.

@mwilck
Copy link
Collaborator Author

mwilck commented May 25, 2023

@Douglas-Farley: I re-read the paragraph about the "non-bootable entry flag" in the NVMe boot spec. The wording in the spec says that if this flag is 1, it means that the FW didn't use the SNSS record, but that "an OS runtime may still require the contents of such a namespace to complete later stages of boot". To me that does not sound as if such SNSS records can be assumed to be "not required", or that they should use a shorter timeout value than other SNSS records.

@Douglas-Farley
Copy link

The typical pre-OS driver doesn't know much about a namespace, only if it contains a ESP. We could add in logic via UEFI Hii for administrative flags to in turn influence the NBFT, but we don't have a lot of options yet. Certainly a space for NVME.org to add features if we can clearly define them and their utility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants