Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup behaviour if sensors missing is inconsistent and unreliable #33

Open
Krellan opened this issue Mar 25, 2023 · 1 comment
Open

Comments

@Krellan
Copy link
Member

Krellan commented Mar 25, 2023

Currently, there are a number of bugs in swampd that affect the behavior of the program upon startup, if any of the configured sensors are missing.

This came up in the discussion of https://gerrit.openbmc.org/c/openbmc/phosphor-pid-control/+/61545 and it has been mentioned previous times also, an example being #31 for MissingIsAcceptable (which provides only a partial solution).

Here's some of the misfeatures:

  • The entity-manager D-Bus Inventory JSON path is different, in behavior, than the hardcoded text file JSON path. One of the paths simply ignores missing sensors, as if they never existed, which is wrong behavior. The other path simply does a retry for missing sensors, preventing the program from ever completing its initialization, which is also wrong behavior.

  • The internal retry loop of swampd is disruptive, causing D-Bus queries, such as those sent from busctl commands, to be lost. This is a different bug, but becomes noticeable because of the retry loop that keeps happening if a sensor is missing.

  • If a sensor goes missing during normal operation, that is treated differently than a sensor that was noticed to be missing during initialization. This behavior should be made consistent.

  • The correct behavior has not been specified, so it is difficult to determine what is a bug and what isn't. Also, people might have implementations now where they depend on the current behavior, and changing it to be "better" might just introduce bugs in their already-working systems.

Here's something I propose:

  • For each sensor, if a PID loop specifies it with MissingIsAcceptable true, and the sensor is not found at startup, then ignore it, as if it was never configured to be part of that PID loop.

  • For each sensor, if a PID loop specifies it with MissingIsAcceptable false, and the sensor is not found at startup, then that PID loop should be forced to be in failsafe mode (regardless of what the other sensors say).

  • For each sensor, if it was not found at startup, but then the sensor is found later, re-evaluate all the PID loops with that sensor, so that the sensor can be correctly added to those PID loops. This will help to take a PID loop out of failsafe mode, if that PID loop was earlier missing a sensor that's been found just now.

  • If enough sensors are missing, such that a PID loop ends up containing zero input sensors, then still instantiate that PID loop, but force that PID loop into failsafe mode.

  • However, if all of the sensors in the PID loop have been specified with MissingIsAcceptable true, this changes the behavior, because this means it is acceptable for the entire PID loop to be missing. In this case, do not set the PID loop into failsafe mode. Initialize the PID loop input to be always equal to the setpoint. This will have the effect of making the PID loop a no-op.

  • There should be no difference in behavior between startup and normal operation. Instead of retrying at startup, just take whatever is there during the first pass. During normal operation, monitor D-Bus object creation and deletion, and react accordingly when the sensors get created or deleted.

Are there any other behavioral things that should be considered and more accurately specified here? Once we have agreement on what the program should be doing, instead of the current buggy behavior, then we can go about fixing these bugs.

@Krellan
Copy link
Member Author

Krellan commented May 3, 2023

Here's some more background.

There are known issues with phosphor-pid-control and missing sensors at the start of the run.

The failsafe feature only covers sensors that exist at the start of the run, but then start reporting bad values later during the run.

The failsafe feature also only covers bad values. The sensor must still be there, to report the bad values. If the sensor goes missing during the run, that exposes undefined behavior.

The stability of sensor existence is really a challenge that needs to be handled below the layer of phosphor-pid-control, as it is more of an issue for dbus-sensors and entity-manager to manage.

However, there should be better defined behavior in phosphor-pid-control if a configured sensor is missing at the start of the run, hence this feature request.

As of now, the behavior is different between old config.json and new entity-manager code paths! That's a bug.

The correct behavior of phosphor-pid-control is only defined if the sensors exist at the start of the run, and the sensors remain in existence throughout the entire run.

If sensors are added or deleted during the run, that is UB (undefined behavior). Even if it is just a quick delete-then-add-again, during an initialization or recovery code path, that is still UB. Sometimes it will work correctly, and sometimes it will not, and it cannot be relied upon, because it's UB.

It would be desirable to fix this, I believe, however, are there any users out there now who depend on certain quirks of this undefined behavior? Please make it known now, otherwise this UB is fair game to be changed, in an effort to clean it up and make it better defined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant