-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
redfishpower: allow timeout to be set by device script #72
Conversation
885a467
to
44a4647
Compare
re-pushed with some minor tweaks on types ... forgot the fields in timeval are signed longs. Also did an integer overflow check. Maybe it's excessive and unnecessary, but with a future change, it felt important for consistency. |
@@ -234,7 +234,8 @@ _prompt_loop(void) | |||
|| !strcmp(buf, "setstatpath") | |||
|| !strcmp(buf, "setonpath") | |||
|| !strcmp(buf, "setoffpath") | |||
|| !strcmp(buf, "setcyclepath")) { | |||
|| !strcmp(buf, "setcyclepath") | |||
|| !strcmp(buf, "settimeout")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on your note in #67 where on and off took very different amounts of time, does it make sense to have a per-command timeout rather than just one?
I can't tell from the code whether "cycle" is covered by this timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In redfishpower land, yes "cycle" is covered by this timeout. But in powerman land, it depends if "cycle" is native or not.
If "cycle" is an "off" and then "on", then the timeout is for a single off/on. powerman's 60 second timeout would encompass the entire "off" and "on" together.
hmmm ... interesting idea. I'm going back and forth on it. It's a good idea, but presumably we only need to cover the worst case timeout. Since powerman would normally return faster for the faster cases anyways.
Also, i like the idea of making powerman's timeout and redfishpower's timeout the same, to make things easier for configuration. But I see your point ... if you're modifying device files, maybe the user is advanced enough already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to think all's fair in the device files as long as we document what's going on.
I think we have to make sure that a redfishpower timeout is interpreted in some way as failed completion by the script rather than failure to produce expected output as the latter would cause the powerman timeout to kick in and redfishpower to be restarted, only after it expires. I know we can fail a status query right away but I'm not so sure about an on/off operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally there would be no timeouts but if we are going to occasionally fail I think we want to do it as fast as possible so powerman can become responsive again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to make sure that a redfishpower timeout is interpreted in some way as failed completion by the script rather than failure to produce expected output
Right now it'll output something like node1: timeout
, which is what I did in ipmipower. Would that trigger the scriptlet in the expected way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we have any way to fail a power operation other than to let the powerman timeout expire. E.g. if you return "timeout" and it is not expected, that will eventually fail. But it won't fail fast. I'll open a bug.
etc/redfishpower-cray-r272z30.dev
Outdated
# - CAUTION: If you intend to use this file as the basis for a different | ||
# Redfish system, please note the following. | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thought to avoid the repetition would be to add a section to the redfishpower(8) man page and in each dev file just replace the whole block with a pointer to the man page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, i forgot I had a redfishpower manpage :-) (so it needs to be updated with the new prompt additions too)
44a4647
to
922811e
Compare
re-pushed, putting all the information and more "how to find the right paths" info in the manpage. For the time being, i elected to let it just be |
etc/redfishpower-cray-r272z30.dev
Outdated
@@ -33,6 +33,8 @@ specification "redfishpower-cray-r272z30" { | |||
expect "redfishpower> " | |||
send "setcyclepath redfish/v1/Systems/Self/Actions/ComputerSystem.Reset {\"ResetType\":\"ForceRestart\"}\n" | |||
expect "redfishpower> " | |||
send "settimeout 60\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hang on, how does this even work? If the redfish timeout is reached, then the on/off command is not going to return a string that matches what the device script expects, so powerman will sit there until its timeout is reached, and then restart redfishpower. Is there any point to having a timeout for on/off commands in redfishpower if this is the case?
It's a different story with status since that can be made to return unknown
and be matched in the expect part of the script but not assigned as a plug state (leaving it unknown). So it could have its own timeout that is much lower than the powerman timeout to make sure that returns quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hang on, how does this even work? If the redfish timeout is reached, then the on/off command is not going to return a string that matches what the device script expects, so powerman will sit there until its timeout is reached, and then restat redfishpower. Is there any point to having a timeout for on/off commands in redfishpower if this is the case?
Lets say the average time it takes to turn on and verify something is on is 70 seconds. With redfishpower's current 60 second timeout, an "on" will never work. It'll report "timeout" all the time. So we need to have a way to increase it so that it can appropriately wait a little longer and report "ok" when the "on" is done.
So I think in this made up example, we want powerman's timeout and redfishpower's timeout to be (lets say) 90 seconds. powerman's timeout is 90 seconds so it won't give up on the script too quickly. and redfishpower's increased timeout is to make sure it has the right amount of time to wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but why have both timeouts? There is no way for the redfish timeout to occur and not also trigger the powerman timeout. At that point, may as well just use the powerman timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is the redfishpower timeout is independent of the powerman timeout.
Using the 70 second "on" example earlier. If powerman timeout is 90 seconds and redfishpower's timeout stays at the default 60 seconds, then "on" will never work. Redfish will always report "timeout" to powerman. We want to set redfishpower's timeout to (lets say) 90 seconds, so it has the chance to report "success" to powerman.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, i think I get what you're saying now. Are you basically saying don't have a timeout in redfishpower
? Just have it spin infinitely for "on" or "off" to complete? And if powerman times out, it'll kill redfishpower?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what we're going to get done in the near term, but I agree #76 should be high on our list.
I got sort of lost here. Right now does a power command succeed (exit 0) when the command times out underneath? If so that's an egregious bug IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my understanding is that "command completed successfully" is always output.
Example me setting redfishpower timeout to 20 seconds, powerman timeout to 60 seconds, "on" normally takes around 40 seconds to do an "on".
$ time pm -T --off hetchy33; time pm -T --on hetchy33
send(redfishpower-peaks): 'off phetchy33\n'
recv(redfishpower-peaks): 'phetchy33: ok\nredfishpower> '
Command completed successfully
0.000u 0.001s 0:07.97 0.0% 0+0k 0+8io 0pf+0w
send(redfishpower-peaks): 'on phetchy33\n'
recv(redfishpower-peaks): 'phetchy33: timeout\nredfishpower> '
Command completed successfully
0.000u 0.001s 0:21.30 0.0% 0+0k 0+0io 0pf+0w
I suspect it's due to the way the scriptlet is normally written?
script on_ranged {
send "on %s\n"
expect "redfishpower> "
}
b/c "redfishpower > " is eventually seen, it's always "command completed successfully"? Related to discussion in #79??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK IMHO that needs to be fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this issue covered by #79? Or do you feel is different issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go ahead and open a separate issue. The problem is that the expect script allows failure to be interpreted as success. It should only expect success and fail otherwise (after timeout, unfortunately, but that's what we have to work with right now).
922811e
to
7a1b145
Compare
just rebased, fixing up random conflicts from ongoing work |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think we decided offilne that this can go in just so that it's possible to increase the overall timeout in a dev script beyond 60s (by increasing both the powerman timeout and this timeout).
Problem: The timeout in redfishpower is currently hard coded to 60 seconds. However, some slow systems may require it to be longer than 60 seconds. Support a prompt command that will allow change of the power command timeout. Document new command in redfishpower(8). Update redfishpower test helper to parse settimeout.
Problem: If a user modifies the powerman timeout in a redfishpower device file, they may not be aware that it should also be modified in redfishpower. Add a call to settimeout in the login script of redfishpower devices. This can alert users that any changes to the powerman timeout should also be adjusted in redfishpower.
Problem: Due to the nature of redfishpower being a REST interface, many users are copying and pasting the redfishpower device files and slightly tweaking them for their systems. Users could use more information on how to do this. In addition, there are potential pitfalls on other systems that users need to be aware of. Add a section on modifications to redfishpower(8). Add pointers to this section in all redfishpower device files. Fixes chaos#67
7a1b145
to
b561dd4
Compare
Problem: Due to the nature of redfishpower being a REST interface, many users are copying and pasting the redfishpower device files and slightly tweaking them for their systems. Sometimes those changes may not operate exactly as desired.
Add some additional comments and warnings to all redfishpower device files. In addition, have the timeout of redfishpower configured in the login script section. That way if the timeout is updated by the user, they know to update it in redfishpower as well.
Fixes #67