You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a torture test CI/CD setup in our datacenter that cycles host power on a Talos II at a high rate. Sporadically, the BMC sees the NCSI link drop out during the power on process (presumably due to PERST# assertion by the CPU during IPL), and rarely the entire BMC kernel will lock up / fail to recover from the NCSI link drop.
It looks like there may have been some work in this general area in 2cc234e, but I don't know if this needs tweaking or even applies to the main PCIe reset being asserted.
The BMC just dumps the standard NCSI transmit link lost warning:
It does look like this has been a problem for a very long time, including back on the original proprietary firmware: openbmc/openbmc#2288
The proprietary firmware is much more prone to entering this condition than the open firmware, so it seems something is being handled better in this firmware stack, just not enough to catch 100% of whatever corner case / race condition is in play.
The text was updated successfully, but these errors were encountered:
Can you provide a test case / the script that you're using to trigger this?
I could see something like "if in poweroff, turn on system" and an service in the os that shutdown immediately on bootup" being a reasonable way, but would be good to use exactly what you have if possible.
You're correct that the linked commit was an attempt to improve this (reboot cycles on the OS) - I had noticed the BMC printout late during development and added a fix that definitely improved things.
Looking at that commit, I see that when the reset is happening, we don't respond to BMC packets for it looks like up to 150ms. It would be possible to allow responding to packets even during a reset, however we would need to be careful to make sure we don't have a race between the bmc packets and reset completing.
So, 150ms min can certainly go over the 200ms boundary, triggering the watchdog depending on other events in the APE fw or sequencing.
It may be possible to shorted the reset time, however I'm not sure how long the PERST$ is being held low. Looking at the code now, though, it looks like this may be doable w/o the timer entirely to reduce it to the minimum reset time. The other (better) option is to allow control packets through, but not data, or to send a temporary link down message to the bmc. (I don't like that as much, as it's nice having the reset be transparent to the bmc)
Note that I won't be able to test this for a couple of weeks at the earliest.
We have a torture test CI/CD setup in our datacenter that cycles host power on a Talos II at a high rate. Sporadically, the BMC sees the NCSI link drop out during the power on process (presumably due to
PERST#
assertion by the CPU during IPL), and rarely the entire BMC kernel will lock up / fail to recover from the NCSI link drop.It looks like there may have been some work in this general area in 2cc234e, but I don't know if this needs tweaking or even applies to the main PCIe reset being asserted.
The BMC just dumps the standard NCSI transmit link lost warning:
It does look like this has been a problem for a very long time, including back on the original proprietary firmware:
openbmc/openbmc#2288
The proprietary firmware is much more prone to entering this condition than the open firmware, so it seems something is being handled better in this firmware stack, just not enough to catch 100% of whatever corner case / race condition is in play.
The text was updated successfully, but these errors were encountered: