Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redfishpower: allow timeout to be set by device script #72

Merged
merged 3 commits into from
Jan 25, 2024

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Jan 21, 2024

Problem: Due to the nature of redfishpower being a REST interface, many users are copying and pasting the redfishpower device files and slightly tweaking them for their systems. Sometimes those changes may not operate exactly as desired.

Add some additional comments and warnings to all redfishpower device files. In addition, have the timeout of redfishpower configured in the login script section. That way if the timeout is updated by the user, they know to update it in redfishpower as well.

Fixes #67

@chu11 chu11 force-pushed the redfishpower_timeout branch 2 times, most recently from 885a467 to 44a4647 Compare January 22, 2024 19:28
@chu11
Copy link
Member Author

chu11 commented Jan 22, 2024

re-pushed with some minor tweaks on types ... forgot the fields in timeval are signed longs. Also did an integer overflow check. Maybe it's excessive and unnecessary, but with a future change, it felt important for consistency.

@@ -234,7 +234,8 @@ _prompt_loop(void)
|| !strcmp(buf, "setstatpath")
|| !strcmp(buf, "setonpath")
|| !strcmp(buf, "setoffpath")
|| !strcmp(buf, "setcyclepath")) {
|| !strcmp(buf, "setcyclepath")
|| !strcmp(buf, "settimeout")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your note in #67 where on and off took very different amounts of time, does it make sense to have a per-command timeout rather than just one?

I can't tell from the code whether "cycle" is covered by this timeout?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In redfishpower land, yes "cycle" is covered by this timeout. But in powerman land, it depends if "cycle" is native or not.
If "cycle" is an "off" and then "on", then the timeout is for a single off/on. powerman's 60 second timeout would encompass the entire "off" and "on" together.

hmmm ... interesting idea. I'm going back and forth on it. It's a good idea, but presumably we only need to cover the worst case timeout. Since powerman would normally return faster for the faster cases anyways.

Also, i like the idea of making powerman's timeout and redfishpower's timeout the same, to make things easier for configuration. But I see your point ... if you're modifying device files, maybe the user is advanced enough already?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to think all's fair in the device files as long as we document what's going on.

I think we have to make sure that a redfishpower timeout is interpreted in some way as failed completion by the script rather than failure to produce expected output as the latter would cause the powerman timeout to kick in and redfishpower to be restarted, only after it expires. I know we can fail a status query right away but I'm not so sure about an on/off operation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally there would be no timeouts but if we are going to occasionally fail I think we want to do it as fast as possible so powerman can become responsive again...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to make sure that a redfishpower timeout is interpreted in some way as failed completion by the script rather than failure to produce expected output

Right now it'll output something like node1: timeout, which is what I did in ipmipower. Would that trigger the scriptlet in the expected way?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we have any way to fail a power operation other than to let the powerman timeout expire. E.g. if you return "timeout" and it is not expected, that will eventually fail. But it won't fail fast. I'll open a bug.

Comment on lines 13 to 15
# - CAUTION: If you intend to use this file as the basis for a different
# Redfish system, please note the following.
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought to avoid the repetition would be to add a section to the redfishpower(8) man page and in each dev file just replace the whole block with a pointer to the man page?

Copy link
Member Author

@chu11 chu11 Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, i forgot I had a redfishpower manpage :-) (so it needs to be updated with the new prompt additions too)

@chu11 chu11 changed the title etc: add redfishpower modification warnings man: add redfishpower modification info Jan 24, 2024
@chu11
Copy link
Member Author

chu11 commented Jan 24, 2024

re-pushed, putting all the information and more "how to find the right paths" info in the manpage.

For the time being, i elected to let it just be settimeout as a configuration command, since I think that is the common case. If it's necessary in the future, we can create a setOntimeout and setOffTimeout, and settimeout will simply be shorthand command to do both.

@@ -33,6 +33,8 @@ specification "redfishpower-cray-r272z30" {
expect "redfishpower> "
send "setcyclepath redfish/v1/Systems/Self/Actions/ComputerSystem.Reset {\"ResetType\":\"ForceRestart\"}\n"
expect "redfishpower> "
send "settimeout 60\n"
Copy link
Member

@garlick garlick Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hang on, how does this even work? If the redfish timeout is reached, then the on/off command is not going to return a string that matches what the device script expects, so powerman will sit there until its timeout is reached, and then restart redfishpower. Is there any point to having a timeout for on/off commands in redfishpower if this is the case?

It's a different story with status since that can be made to return unknown and be matched in the expect part of the script but not assigned as a plug state (leaving it unknown). So it could have its own timeout that is much lower than the powerman timeout to make sure that returns quickly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hang on, how does this even work? If the redfish timeout is reached, then the on/off command is not going to return a string that matches what the device script expects, so powerman will sit there until its timeout is reached, and then restat redfishpower. Is there any point to having a timeout for on/off commands in redfishpower if this is the case?

Lets say the average time it takes to turn on and verify something is on is 70 seconds. With redfishpower's current 60 second timeout, an "on" will never work. It'll report "timeout" all the time. So we need to have a way to increase it so that it can appropriately wait a little longer and report "ok" when the "on" is done.

So I think in this made up example, we want powerman's timeout and redfishpower's timeout to be (lets say) 90 seconds. powerman's timeout is 90 seconds so it won't give up on the script too quickly. and redfishpower's increased timeout is to make sure it has the right amount of time to wait.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but why have both timeouts? There is no way for the redfish timeout to occur and not also trigger the powerman timeout. At that point, may as well just use the powerman timeout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the redfishpower timeout is independent of the powerman timeout.

Using the 70 second "on" example earlier. If powerman timeout is 90 seconds and redfishpower's timeout stays at the default 60 seconds, then "on" will never work. Redfish will always report "timeout" to powerman. We want to set redfishpower's timeout to (lets say) 90 seconds, so it has the chance to report "success" to powerman.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, i think I get what you're saying now. Are you basically saying don't have a timeout in redfishpower? Just have it spin infinitely for "on" or "off" to complete? And if powerman times out, it'll kill redfishpower?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what we're going to get done in the near term, but I agree #76 should be high on our list.

I got sort of lost here. Right now does a power command succeed (exit 0) when the command times out underneath? If so that's an egregious bug IMHO.

Copy link
Member Author

@chu11 chu11 Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my understanding is that "command completed successfully" is always output.

Example me setting redfishpower timeout to 20 seconds, powerman timeout to 60 seconds, "on" normally takes around 40 seconds to do an "on".

$ time pm -T --off hetchy33; time pm -T --on hetchy33
send(redfishpower-peaks): 'off phetchy33\n'
recv(redfishpower-peaks): 'phetchy33: ok\nredfishpower> '
Command completed successfully
0.000u 0.001s 0:07.97 0.0%	0+0k 0+8io 0pf+0w

send(redfishpower-peaks): 'on phetchy33\n'
recv(redfishpower-peaks): 'phetchy33: timeout\nredfishpower> '
Command completed successfully
0.000u 0.001s 0:21.30 0.0%	0+0k 0+0io 0pf+0w

I suspect it's due to the way the scriptlet is normally written?

	script on_ranged {
		send "on %s\n"
		expect "redfishpower> "
	}

b/c "redfishpower > " is eventually seen, it's always "command completed successfully"? Related to discussion in #79??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK IMHO that needs to be fixed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this issue covered by #79? Or do you feel is different issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go ahead and open a separate issue. The problem is that the expect script allows failure to be interpreted as success. It should only expect success and fail otherwise (after timeout, unfortunately, but that's what we have to work with right now).

@chu11
Copy link
Member Author

chu11 commented Jan 25, 2024

just rebased, fixing up random conflicts from ongoing work

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think we decided offilne that this can go in just so that it's possible to increase the overall timeout in a dev script beyond 60s (by increasing both the powerman timeout and this timeout).

@garlick garlick changed the title man: add redfishpower modification info redfishpower: allow timeout to be set by device script Jan 25, 2024
Problem: The timeout in redfishpower is currently hard coded to 60
seconds.  However, some slow systems may require it to be longer
than 60 seconds.

Support a prompt command that will allow change of the power command
timeout.  Document new command in redfishpower(8).  Update redfishpower
test helper to parse settimeout.
Problem: If a user modifies the powerman timeout in a redfishpower
device file, they may not be aware that it should also be modified
in redfishpower.

Add a call to settimeout in the login script of redfishpower devices.
This can alert users that any changes to the powerman timeout should
also be adjusted in redfishpower.
Problem: Due to the nature of redfishpower being a REST interface,
many users are copying and pasting the redfishpower device files
and slightly tweaking them for their systems.  Users could use
more information on how to do this.  In addition, there are potential
pitfalls on other systems that users need to be aware of.

Add a section on modifications to redfishpower(8).  Add pointers to
this section in all redfishpower device files.

Fixes chaos#67
@mergify mergify bot merged commit 0e76a2c into chaos:master Jan 25, 2024
2 checks passed
@chu11 chu11 deleted the redfishpower_timeout branch January 25, 2024 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

redfish: add notes about changing device files and increasing timeout
2 participants