Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

powerman: support new setresult directive #168

Merged
merged 8 commits into from
Apr 4, 2024

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Mar 26, 2024

Problem: In many cases, there is no way for a power operation (i.e. power on, but not power status) to inform powerman that an error has occurred in the operation. The user will always get a "Command completed successfully" output and exit status 0 after
issuing a power operation.

Solution: Support a new "setresult" statement that can inform powerman that a power operation did not succeed. A regex can be used to determine what output is expected of a successful power operation. If any are not successful, powerman can subsequently inform the user an error has occurred, leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}
script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Side note, PR #167 and this are part of the general "cray EX" PR series, but they can be reviewed independently of each other)

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is seriously nice!

This is hard to review so I kind of did a skim and expect that we'll shake anything that is wrong out in testing. It would be good to cover this with valgrind.

I'm not super enthusiastic about accumulating many one-off copies of that cray dev script in test. If we could generate simpler dev scripts for testing right in the test with a here-is that would be a little better IMHO.

This one is not a redfish thing so it might actually be nice to see if we could use the more generic vpcd instead of redfsihpower for the testing. Maybe just use this feature in the vpcd.dev and add in a way to simulate failure with a command line option.

Great that you pushed this through though!

@@ -17,11 +17,16 @@
#define PM_ARGLIST_H
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit message:

Problem: In many cases, there is no way for a power operation (i.e.
power on, but not power status) to inform powerman that an error has
occurred in the operation. The user will always get a
"Command completed successfully" output and exit status 0 after
issuing a power operation.

That's not really correct. Powerman was designed with the idea that you could "expect" a string that indicates success. In the absence of that, there is a timeout and an error. What this does is add a "fast fail" so you can match an error string and avoid the timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this commit message was fixed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did tweak it, but lets try again

    Problem: Powerman device files can be scripted to "expect" a string to indicate
    success.  In the event of a failure (i.e. an unexpected string output), the device
    script will ultimately time out.  This is undesirable as it can slow down powerman
    when simple errors (e.g. "password failure") fail immediately.  The lack of the
    ability to "fail fast" has lead to many device scripts being scripted to simply "succeed"
    most of the time.  Powerman will complete with "Command completed successfully"
    regardless if the power operation was a success or not.

Comment on lines 1161 to 1162
if (arg->val)
xfree(arg->val);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to check for non-NULL before free

Comment on lines 1166 to 1167
if (str)
xfree(str);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@chu11
Copy link
Member Author

chu11 commented Mar 28, 2024

This one is not a redfish thing so it might actually be nice to see if we could use the more generic vpcd instead of redfsihpower for the testing. Maybe just use this feature in the vpcd.dev and add in a way to simulate failure with a command line option.

Yeah, I did look at vcpd and didn't fully grok its output format, so just defaulted to redfishpower. Lemme not be so lazy and look to put it into vcpd.

Edit: actually we'd need some vpc dev files to cover singlet, range, and all variants.

@garlick
Copy link
Member

garlick commented Apr 2, 2024

Edit: actually we'd need some vpc dev files to cover singlet, range, and all variants.

Any reason not to add that to the main vpc.dev?

@chu11
Copy link
Member Author

chu11 commented Apr 2, 2024

Any reason not to add that to the main vpc.dev?

i don't think powerman can cover all 3. skimming code real quick, if "ranged" is available, singlet will never be covered.

@garlick
Copy link
Member

garlick commented Apr 2, 2024

Tell me if I'm off base but could we change the powerman logic so if a device script defines singlet and ranged, singlet is preferred when there is one plug? If it would avoid needing to copy and slightly change device scripts in test, it might be a useful change?

@chu11
Copy link
Member Author

chu11 commented Apr 2, 2024

now that you mention, it that would be a minor and probably useful change.

@chu11
Copy link
Member Author

chu11 commented Apr 2, 2024

Tell me if I'm off base but could we change the powerman logic so if a device script defines singlet and ranged, singlet is preferred when there is one plug? If it would avoid needing to copy and slightly change device scripts in test, it might be a useful change?

Can you think of a super obvious way this can be tested? Because we go off of the power control output, there's no obvious way to discern if "all" vs "ranged" were used (or if 1 node was specified, "singlet" vs "ranged").

Edit: off the top of my head, all I could think of doing is have the "on_all" vs "on_ranged" vs "on" device scripts call different commands. But that seems like a lot of work for that testing.

@garlick
Copy link
Member

garlick commented Apr 2, 2024

Maybe have vpc dump a count of the number of times each script was called to stderr and capture it in sharness?

@chu11 chu11 force-pushed the powerman_setresult branch 2 times, most recently from c486c47 to 050d20d Compare April 4, 2024 17:50
@chu11
Copy link
Member Author

chu11 commented Apr 4, 2024

re-pushed, this is now on top of PR #170

redid all of the tests under vpc, and found a mem-leak to boot.

@garlick could you skim the test changes? Since they are quite different / new.

@garlick
Copy link
Member

garlick commented Apr 4, 2024

This turned out really nice! I added a couple of comments.

Problem: in the near future we would like to create a PlugList
from a List of Plugs.  It is currently inconvenient to create
such a list given the current PlugList API.

Add a function pluglist_copy_from_list() function to create a PlugList
from a List of Plugs.
Problem: In powerman device files the only "interpretation" that
is done is the output of a power status query via the
"setplugstate" statement.  Other interpretations will be added
in the future against other power operations.  Internally anything
associated with "setplugstate" is called an "interpretation" (or
"interp" or "interps", etc.).  This will get confusing with future
interpretations.

Solution:  Rename any "interpretation" related code associated
with "setplugstate" to "state interpretation" (i.e. "Interp" to
"StateInterp", "interp" to "state_interp", etc.).
Problem: Several variables are checked for NULL before calling
xfree().  This is unnecessary.

Remove NULL checks before calling xfree().
Problem: Powerman device files can be scripted to "expect" a string to indicate
success.  In the event of a failure (i.e. an unexpected string output), the device
script will ultimately time out.  This is undesirable as it can slow down powerman
when simple errors (e.g. "password failure") fail immediately.  The lack of the
ability to "fail fast" has lead to many device scripts being scripted to simply "succeed"
most of the time.  Powerman will complete with "Command completed successfully"
regardless if the power operation was a success or not.

Solution: Support a new "setresult" statement that can inform powerman
that a power operation did not succeed.  A regex can be used to determine
what output is expected of a successful power operation.  If any are not
successful, powerman can immediately report to the user an error has occurred,
leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Fixes chaos#79
Problem: The new setresult statement is not documented ni powerman.dev(5)

Add information about the new device statement.
Problem: There is no way to test setresult with vpcd.

Update vpc.dev to use setresult in on/off scripts.  Update vpcd
to output "OK" for each plug in on/off scripts so foreachplug can
be used.
Problem: Sometimes it would be useful to have errors reported
from vpcd, but there is no way to do that.

Solution: Add a --bad-plug option to vpcd.  It will ensure that
power operations to that specific plug never work.  Update vpc.dev
to handle the new possible output of "ERROR".
Problem: The new setresult statement is not covered in the testsuite.

Add tests in a new t0035-power-result.t.
@mergify mergify bot merged commit 8a8a848 into chaos:master Apr 4, 2024
8 checks passed
@chu11 chu11 deleted the powerman_setresult branch April 4, 2024 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants