-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP redfishpower: support cray supercomputing ex chassis & power control hierarchy #150
Conversation
c8c31b6
to
7e2883e
Compare
7e2883e
to
3f4200d
Compare
re-pushed with some changes given initial testing against hardware
Now, if ancestor is off a power off is considered to be successful not an error. Power on you still an error.
"phased" power on requires delays in between levels. In the case between a blade and node would be > 2 mins, probably 3 minutes to be on the safe side. Adding the delay between cmm & blade, we're probably looking at a 5-7 minute powerman timeout for A) horrible Sooo ... we made a rule for now. You cannot power on two targets that have a parent/child relationship. So you can't do You can still power off targets with a parent/child relationship. |
9eaf69e
to
85dda4d
Compare
986fe67
to
53b4b5e
Compare
53b4b5e
to
aeae52a
Compare
Problem: A comment about the status polling interval is out of date. Update it to indicate the power on/off wait range is upwards of 50 seconds.
Problem: The status polling interval is hard coded to 1 second long. This can result in an excessive number of polling messages being sent when it is known that some hardware takes 20-50 seconds to complete a power operation. Solution: Support a modified "exponential backoff" of the status polling interval. The modified algorithm is based on observations of how long it typically takes to complete power operations on hardware. The status polling interval begins at one second, but it gets capped at 4 seconds.
Problem: When power control/query to a target fails, there is no way for a user to know why it failed except through the very verbose --telemetry output. Add a new --diag to powerman that will inform powermand to send diagnostic information about why a power operation failed. Common errors from the same host will be collapsed into a hostrange. This only works with setplugstate and the new setresult statement.
Problem: The new --diag option is not documented. Add it in powerman(1).
Problem: The --bad-plug option in vpcd cannot be called multiple times to specify multiple bad plugs. Support calling --bad-plug multiple times by putting the bad plugs into an array.
Problem: There is no coverage for the new --diag option. Add tests in new t0036-diag.t tests.
aeae52a
to
6b5f38d
Compare
Add device file for a HPE Cray Supercomputing EX Chassis. Fixes chaos#128
Problem: There is no testing for the new HPE Cray Supercomputing EX Chassis device file. Add new tests in t0029-redfish.t.
6b5f38d
to
b95676d
Compare
closing this as almost everything has been divided up |
|
Per discussion in #81, #128, #129
Will split this into multiple PRs later on ... unfortunately all of this had to be done before I even begin to test :P
the series of commits;
what parenting does
If ancestors are all on, then redfishpower can perform power on/off/status on the target.
If ancestor is off, power status is defined as off for all descendants. Power on/off cannot be done.
As special case, if powering on both ancestors and children (e.g
pm --on cmm,blade0,node[0-1]
), descendants will wait until ancestor power ons are completed first. This could lead to increase runtime of powerman client b/c of multiple "rounds" of power on and we need bigger timeout. I chose 100 seconds for time being (need to test).As special case, if powering off both ancestors and children, descendants will wait until ancestor power offs are complete. Then by definition, descendants are now all off.
(Just to help differentiate the special cases, there is a difference between
pm --on blade0
andpm --on cmm0,blade0
, the former will check cmm0 first to determine if blade0 can be turned on. The latter will turn on cmm0 first, then if that is successful it can then power on blade0)If any ancestors have status unknown or get an error, that status carries to descendants and results in errors (for status query, that results in "unknown" status).
annoyances
Gotta do "%%p" for plug substitution instead of "%p", b/c we're passing a string into redfishpower in which it'll store and parse. vs most device scripts that is probably using the '%s' as a "format". This is also one of the reasons I chose "%p" instead of "%s".
testing
All testing is done in simulation mode, i need to test on real hardware. Hopefully it works. When running in verbose mode, I see the right order of messages going on.
assumptions
with parenting, some code simply assumes no loops possible. I think is fair assumption to avoid adding excess code for rare case.
with "%p" substitution, assumes at max one "%p" and no need for "%%" escaping. I didn't implement a proper "loop through string, look for escape chars" kinda thing. Seemed excessive given what we actually need. But maybe I shouldn't have been so lazy.
if you power on parent, once status of parent is "on", all children can be powered on, no delay is necessary. But delay wouldn't be too hard to add, "delay code" is already there b/c of status polling delay code. (note, this "delay" is not in powerman land but in redfishpower land).
If user does
pm --on cmm0,blade0,node0
and cmm0/blade0 on but node0 not, I just assume redfish protocol will work out (ie sending an "on" will return success that it's already on). But need to try against real hardware.mini concerns
Due to parent being "off", it is assumed all children are "off", including any node that is missing. This may not necessarily be what user expects powerman to output. I think this is acceptable b/c 99% of the time parents on are on (i.e. chassis management head). It's nodes down below that are typically turned off. Unfortunately, I think we just have to go with this, as the alternate (send messages to nodes that won't respond) is what we are trying to avoid. If this is a big deal, we can add some type of "whatsup"-like kinda support, where we can ping in the background and identify those targets as gone/missing.