-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
powerman: support new setresult directive #168
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is seriously nice!
This is hard to review so I kind of did a skim and expect that we'll shake anything that is wrong out in testing. It would be good to cover this with valgrind.
I'm not super enthusiastic about accumulating many one-off copies of that cray dev script in test. If we could generate simpler dev scripts for testing right in the test with a here-is that would be a little better IMHO.
This one is not a redfish thing so it might actually be nice to see if we could use the more generic vpcd instead of redfsihpower for the testing. Maybe just use this feature in the vpcd.dev and add in a way to simulate failure with a command line option.
Great that you pushed this through though!
@@ -17,11 +17,16 @@ | |||
#define PM_ARGLIST_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the commit message:
Problem: In many cases, there is no way for a power operation (i.e.
power on, but not power status) to inform powerman that an error has
occurred in the operation. The user will always get a
"Command completed successfully" output and exit status 0 after
issuing a power operation.
That's not really correct. Powerman was designed with the idea that you could "expect" a string that indicates success. In the absence of that, there is a timeout and an error. What this does is add a "fast fail" so you can match an error string and avoid the timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this commit message was fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did tweak it, but lets try again
Problem: Powerman device files can be scripted to "expect" a string to indicate
success. In the event of a failure (i.e. an unexpected string output), the device
script will ultimately time out. This is undesirable as it can slow down powerman
when simple errors (e.g. "password failure") fail immediately. The lack of the
ability to "fail fast" has lead to many device scripts being scripted to simply "succeed"
most of the time. Powerman will complete with "Command completed successfully"
regardless if the power operation was a success or not.
src/powerman/device.c
Outdated
if (arg->val) | ||
xfree(arg->val); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to check for non-NULL before free
src/powerman/device.c
Outdated
if (str) | ||
xfree(str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
Yeah, I did look at vcpd and didn't fully grok its output format, so just defaulted to redfishpower. Lemme not be so lazy and look to put it into vcpd. Edit: actually we'd need some vpc dev files to cover singlet, range, and all variants. |
Any reason not to add that to the main vpc.dev? |
i don't think powerman can cover all 3. skimming code real quick, if "ranged" is available, singlet will never be covered. |
Tell me if I'm off base but could we change the powerman logic so if a device script defines singlet and ranged, singlet is preferred when there is one plug? If it would avoid needing to copy and slightly change device scripts in test, it might be a useful change? |
now that you mention, it that would be a minor and probably useful change. |
Can you think of a super obvious way this can be tested? Because we go off of the power control output, there's no obvious way to discern if "all" vs "ranged" were used (or if 1 node was specified, "singlet" vs "ranged"). Edit: off the top of my head, all I could think of doing is have the "on_all" vs "on_ranged" vs "on" device scripts call different commands. But that seems like a lot of work for that testing. |
Maybe have vpc dump a count of the number of times each script was called to stderr and capture it in sharness? |
c486c47
to
050d20d
Compare
050d20d
to
55f172d
Compare
This turned out really nice! I added a couple of comments. |
Problem: in the near future we would like to create a PlugList from a List of Plugs. It is currently inconvenient to create such a list given the current PlugList API. Add a function pluglist_copy_from_list() function to create a PlugList from a List of Plugs.
Problem: In powerman device files the only "interpretation" that is done is the output of a power status query via the "setplugstate" statement. Other interpretations will be added in the future against other power operations. Internally anything associated with "setplugstate" is called an "interpretation" (or "interp" or "interps", etc.). This will get confusing with future interpretations. Solution: Rename any "interpretation" related code associated with "setplugstate" to "state interpretation" (i.e. "Interp" to "StateInterp", "interp" to "state_interp", etc.).
Problem: Several variables are checked for NULL before calling xfree(). This is unnecessary. Remove NULL checks before calling xfree().
Problem: Powerman device files can be scripted to "expect" a string to indicate success. In the event of a failure (i.e. an unexpected string output), the device script will ultimately time out. This is undesirable as it can slow down powerman when simple errors (e.g. "password failure") fail immediately. The lack of the ability to "fail fast" has lead to many device scripts being scripted to simply "succeed" most of the time. Powerman will complete with "Command completed successfully" regardless if the power operation was a success or not. Solution: Support a new "setresult" statement that can inform powerman that a power operation did not succeed. A regex can be used to determine what output is expected of a successful power operation. If any are not successful, powerman can immediately report to the user an error has occurred, leading to a "Command completed with errors" message and exit status 1. Some example uses: script on_all { send "on\n" foreachnode { expect "([^\n:]+): ([^\n]+\n)" setresult $1 $2 success="^ok\n" } expect "redfishpower> " } script on { send "on %s\n" expect "([^\n:]+): ([^\n]+\n)" setresult $1 $2 success="^ok\n" expect "redfishpower> " } Fixes chaos#79
Problem: The new setresult statement is not documented ni powerman.dev(5) Add information about the new device statement.
Problem: There is no way to test setresult with vpcd. Update vpc.dev to use setresult in on/off scripts. Update vpcd to output "OK" for each plug in on/off scripts so foreachplug can be used.
Problem: Sometimes it would be useful to have errors reported from vpcd, but there is no way to do that. Solution: Add a --bad-plug option to vpcd. It will ensure that power operations to that specific plug never work. Update vpc.dev to handle the new possible output of "ERROR".
Problem: The new setresult statement is not covered in the testsuite. Add tests in a new t0035-power-result.t.
55f172d
to
8241ce4
Compare
Problem: In many cases, there is no way for a power operation (i.e. power on, but not power status) to inform powerman that an error has occurred in the operation. The user will always get a "Command completed successfully" output and exit status 0 after
issuing a power operation.
Solution: Support a new "setresult" statement that can inform powerman that a power operation did not succeed. A regex can be used to determine what output is expected of a successful power operation. If any are not successful, powerman can subsequently inform the user an error has occurred, leading to a "Command completed with errors" message and exit status 1.
Some example uses:
Side note, PR #167 and this are part of the general "cray EX" PR series, but they can be reviewed independently of each other)