powerman: support new setresult directive #168

chu11 · 2024-03-26T19:59:26Z

Problem: In many cases, there is no way for a power operation (i.e. power on, but not power status) to inform powerman that an error has occurred in the operation. The user will always get a "Command completed successfully" output and exit status 0 after
issuing a power operation.

Solution: Support a new "setresult" statement that can inform powerman that a power operation did not succeed. A regex can be used to determine what output is expected of a successful power operation. If any are not successful, powerman can subsequently inform the user an error has occurred, leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Side note, PR #167 and this are part of the general "cray EX" PR series, but they can be reviewed independently of each other)

garlick

This is seriously nice!

This is hard to review so I kind of did a skim and expect that we'll shake anything that is wrong out in testing. It would be good to cover this with valgrind.

I'm not super enthusiastic about accumulating many one-off copies of that cray dev script in test. If we could generate simpler dev scripts for testing right in the test with a here-is that would be a little better IMHO.

This one is not a redfish thing so it might actually be nice to see if we could use the more generic vpcd instead of redfsihpower for the testing. Maybe just use this feature in the vpcd.dev and add in a way to simulate failure with a command line option.

Great that you pushed this through though!

garlick · 2024-03-28T22:23:15Z

src/powerman/arglist.h

@@ -17,11 +17,16 @@
 #define PM_ARGLIST_H


In the commit message:

Problem: In many cases, there is no way for a power operation (i.e.
power on, but not power status) to inform powerman that an error has
occurred in the operation. The user will always get a
"Command completed successfully" output and exit status 0 after
issuing a power operation.

That's not really correct. Powerman was designed with the idea that you could "expect" a string that indicates success. In the absence of that, there is a timeout and an error. What this does is add a "fast fail" so you can match an error string and avoid the timeout.

I don't think this commit message was fixed?

I did tweak it, but lets try again

Problem: Powerman device files can be scripted to "expect" a string to indicate success. In the event of a failure (i.e. an unexpected string output), the device script will ultimately time out. This is undesirable as it can slow down powerman when simple errors (e.g. "password failure") fail immediately. The lack of the ability to "fail fast" has lead to many device scripts being scripted to simply "succeed" most of the time. Powerman will complete with "Command completed successfully" regardless if the power operation was a success or not.

garlick · 2024-03-28T22:25:52Z

src/powerman/device.c

+                if (arg->val)
+                    xfree(arg->val);


no need to check for non-NULL before free

garlick · 2024-03-28T22:26:04Z

src/powerman/device.c

+        if (str)
+            xfree(str);


chu11 · 2024-03-28T22:53:20Z

This one is not a redfish thing so it might actually be nice to see if we could use the more generic vpcd instead of redfsihpower for the testing. Maybe just use this feature in the vpcd.dev and add in a way to simulate failure with a command line option.

Yeah, I did look at vcpd and didn't fully grok its output format, so just defaulted to redfishpower. Lemme not be so lazy and look to put it into vcpd.

Edit: actually we'd need some vpc dev files to cover singlet, range, and all variants.

garlick · 2024-04-02T17:47:33Z

Edit: actually we'd need some vpc dev files to cover singlet, range, and all variants.

Any reason not to add that to the main vpc.dev?

chu11 · 2024-04-02T17:53:15Z

Any reason not to add that to the main vpc.dev?

i don't think powerman can cover all 3. skimming code real quick, if "ranged" is available, singlet will never be covered.

garlick · 2024-04-02T18:46:26Z

Tell me if I'm off base but could we change the powerman logic so if a device script defines singlet and ranged, singlet is preferred when there is one plug? If it would avoid needing to copy and slightly change device scripts in test, it might be a useful change?

chu11 · 2024-04-02T19:11:33Z

now that you mention, it that would be a minor and probably useful change.

chu11 · 2024-04-02T21:15:25Z

Tell me if I'm off base but could we change the powerman logic so if a device script defines singlet and ranged, singlet is preferred when there is one plug? If it would avoid needing to copy and slightly change device scripts in test, it might be a useful change?

Can you think of a super obvious way this can be tested? Because we go off of the power control output, there's no obvious way to discern if "all" vs "ranged" were used (or if 1 node was specified, "singlet" vs "ranged").

Edit: off the top of my head, all I could think of doing is have the "on_all" vs "on_ranged" vs "on" device scripts call different commands. But that seems like a lot of work for that testing.

garlick · 2024-04-02T22:25:19Z

Maybe have vpc dump a count of the number of times each script was called to stderr and capture it in sharness?

chu11 · 2024-04-04T17:51:26Z

re-pushed, this is now on top of PR #170

redid all of the tests under vpc, and found a mem-leak to boot.

@garlick could you skim the test changes? Since they are quite different / new.

garlick · 2024-04-04T19:03:54Z

This turned out really nice! I added a couple of comments.

Problem: in the near future we would like to create a PlugList from a List of Plugs. It is currently inconvenient to create such a list given the current PlugList API. Add a function pluglist_copy_from_list() function to create a PlugList from a List of Plugs.

Problem: In powerman device files the only "interpretation" that is done is the output of a power status query via the "setplugstate" statement. Other interpretations will be added in the future against other power operations. Internally anything associated with "setplugstate" is called an "interpretation" (or "interp" or "interps", etc.). This will get confusing with future interpretations. Solution: Rename any "interpretation" related code associated with "setplugstate" to "state interpretation" (i.e. "Interp" to "StateInterp", "interp" to "state_interp", etc.).

Problem: Several variables are checked for NULL before calling xfree(). This is unnecessary. Remove NULL checks before calling xfree().

Problem: Powerman device files can be scripted to "expect" a string to indicate success. In the event of a failure (i.e. an unexpected string output), the device script will ultimately time out. This is undesirable as it can slow down powerman when simple errors (e.g. "password failure") fail immediately. The lack of the ability to "fail fast" has lead to many device scripts being scripted to simply "succeed" most of the time. Powerman will complete with "Command completed successfully" regardless if the power operation was a success or not. Solution: Support a new "setresult" statement that can inform powerman that a power operation did not succeed. A regex can be used to determine what output is expected of a successful power operation. If any are not successful, powerman can immediately report to the user an error has occurred, leading to a "Command completed with errors" message and exit status 1. Some example uses: script on_all { send "on\n" foreachnode { expect "([^\n:]+): ([^\n]+\n)" setresult $1 $2 success="^ok\n" } expect "redfishpower> " } script on { send "on %s\n" expect "([^\n:]+): ([^\n]+\n)" setresult $1 $2 success="^ok\n" expect "redfishpower> " } Fixes chaos#79

Problem: The new setresult statement is not documented ni powerman.dev(5) Add information about the new device statement.

Problem: There is no way to test setresult with vpcd. Update vpc.dev to use setresult in on/off scripts. Update vpcd to output "OK" for each plug in on/off scripts so foreachplug can be used.

Problem: Sometimes it would be useful to have errors reported from vpcd, but there is no way to do that. Solution: Add a --bad-plug option to vpcd. It will ensure that power operations to that specific plug never work. Update vpc.dev to handle the new possible output of "ERROR".

Problem: The new setresult statement is not covered in the testsuite. Add tests in a new t0035-power-result.t.

chu11 mentioned this pull request Mar 27, 2024

redfishpower: fix memleaks and test under valgrind #169

Merged

garlick approved these changes Mar 28, 2024

View reviewed changes

chu11 mentioned this pull request Apr 3, 2024

powerman: use singlet script if targeting one plug #170

Merged

chu11 force-pushed the powerman_setresult branch 2 times, most recently from c486c47 to 050d20d Compare April 4, 2024 17:50

chu11 force-pushed the powerman_setresult branch from 050d20d to 55f172d Compare April 4, 2024 17:52

chu11 added 8 commits April 4, 2024 12:21

powerman: remove unnecessary NULL checks

71ea403

Problem: Several variables are checked for NULL before calling xfree(). This is unnecessary. Remove NULL checks before calling xfree().

man: document setresult in powerman.dev(5)

612d01e

Problem: The new setresult statement is not documented ni powerman.dev(5) Add information about the new device statement.

t: support setresult in vpc.dev

5183da3

Problem: There is no way to test setresult with vpcd. Update vpc.dev to use setresult in on/off scripts. Update vpcd to output "OK" for each plug in on/off scripts so foreachplug can be used.

t: cover new setresult device statement

8241ce4

Problem: The new setresult statement is not covered in the testsuite. Add tests in a new t0035-power-result.t.

chu11 force-pushed the powerman_setresult branch from 55f172d to 8241ce4 Compare April 4, 2024 19:30

chu11 mentioned this pull request Apr 4, 2024

should we add "setresult" to older powerman device files #171

Open

chu11 added the merge-when-passing label Apr 4, 2024

mergify bot merged commit 8a8a848 into chaos:master Apr 4, 2024
8 checks passed

chu11 deleted the powerman_setresult branch April 4, 2024 21:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

powerman: support new setresult directive #168

powerman: support new setresult directive #168

chu11 commented Mar 26, 2024

garlick left a comment

garlick Mar 28, 2024

garlick Apr 4, 2024

chu11 Apr 4, 2024

garlick Mar 28, 2024

garlick Mar 28, 2024

chu11 commented Mar 28, 2024 •

edited

Loading

garlick commented Apr 2, 2024

chu11 commented Apr 2, 2024

garlick commented Apr 2, 2024

chu11 commented Apr 2, 2024

chu11 commented Apr 2, 2024 •

edited

Loading

garlick commented Apr 2, 2024

chu11 commented Apr 4, 2024

garlick commented Apr 4, 2024

powerman: support new setresult directive #168

powerman: support new setresult directive #168

Conversation

chu11 commented Mar 26, 2024

garlick left a comment

Choose a reason for hiding this comment

garlick Mar 28, 2024

Choose a reason for hiding this comment

garlick Apr 4, 2024

Choose a reason for hiding this comment

chu11 Apr 4, 2024

Choose a reason for hiding this comment

garlick Mar 28, 2024

Choose a reason for hiding this comment

garlick Mar 28, 2024

Choose a reason for hiding this comment

chu11 commented Mar 28, 2024 • edited Loading

garlick commented Apr 2, 2024

chu11 commented Apr 2, 2024

garlick commented Apr 2, 2024

chu11 commented Apr 2, 2024

chu11 commented Apr 2, 2024 • edited Loading

garlick commented Apr 2, 2024

chu11 commented Apr 4, 2024

garlick commented Apr 4, 2024

chu11 commented Mar 28, 2024 •

edited

Loading

chu11 commented Apr 2, 2024 •

edited

Loading