Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide a way for an individual power operation to fail without triggering powerman global device timeout #79

Closed
garlick opened this issue Jan 23, 2024 · 12 comments
Assignees

Comments

@garlick
Copy link
Member

garlick commented Jan 23, 2024

Problem: as noted in #72, there is no way for a coproc to communicate to a device script that the operation has failed, other than to not produce the expected output, then let the powerman action timeout expire. This prevents the operation from failing as fast as it could.

@garlick garlick changed the title provide a way for a device to indicate a failed power operation other than timeout provide a way for an individual power operation to fail without triggering powerman global device timeout Jan 23, 2024
@garlick
Copy link
Member Author

garlick commented Jan 24, 2024

Idea: allow there to be an optional second regex argument to the expect command.

  • The first regex works as it does now, where powerman blocks until it is matched (with timeout).
  • The second regex would apply an additional regex to the match from the first one. If this one matches, proceed. If it doesn't, abort the script and declare the operation failed.

So for example instead of

       script on_ranged {
                send "on %s\n"
                expect "redfishpower> "
        }

you might have

       script on_ranged {
                send "on %s\n"
                expect "([^\n]+)" "command successful\n"
                expect "redfishpower> "
        }

In the example, the purpose of the first regex is to block until a line has been read. The purpose of the second one is to decide if that indicated success or not. In this fictious example, "command successful" would be printed by redfishpower on success and something else would be printed on failure.

Powerman could report to the user that the action failed, without waiting for the powerman timeout, or restarting the coproc. This would let redfish produce errors that are immediately processed by powerman, rather than making everything look like a timeout.

@chu11
Copy link
Member

chu11 commented Jan 24, 2024

what would the behavior of powerman be in this scenaio:

redfishpower> on foo[1-4]
foo1: authentication failure
foo2: ok
foo3: ok
foo4: ok

would it fail on the first one? B/c we wouldn't want it to fail too quickly.

@garlick
Copy link
Member Author

garlick commented Jan 24, 2024

Why not? If something is broken with power control then it seems like you would want it to abort right away rather than take a long time to fail and still leave you in an unknown state.

@chu11
Copy link
Member

chu11 commented Jan 24, 2024

Why not? If something is broken with power control then it seems like you would want it to abort right away rather than take a long time to fail and still leave you in an unknown state.

So in my mind, I guess there are "normal ish" errors vs "bad errors". "authentication failure" is some minor config issue, maybe a new node was brought online and someone forgot to config something. In this case, "authentication failure" may return immediately and all other power control could fail as a result, even if it would have succeeded otherwise. (Edit: and when I say "fail" here, I mean do not do the power control they could have done).

And without knowing which one has this minor config issue, i guess power control simply may not work until it is resolved.

@garlick
Copy link
Member Author

garlick commented Jan 24, 2024

I think it's a fundamental assumption with powerman that if the command returns successfully, then the thing you asked for is unequivocally done.

@chu11
Copy link
Member

chu11 commented Jan 25, 2024

Thinking about this more. I think part of the issue is with ipmipower/redfishpower, passing in 100s or 1000s of nodes is pretty normal, and given the large cluster sizes, a few nodes being bad is to be expected.

ipmipower> on nodes[1-1000]
node1: ok
node3: ok
...
node587: authentication error
...
node888: busy

Perhaps something like this, stealing the pattern from stat.

       script on_ranged {
                send "on %s\n"
		foreachnode %s {
			expect "([^\n:]+): ([^\n]+\n)"
			setresult $1 $2 success="^ok\n" error="^(auth error|timeout|busy)\n"
		}
                expect "redfishpower> "
        }

@garlick
Copy link
Member Author

garlick commented Jan 25, 2024

Maybe although that would force redfishpower to produce a thousand lines of output that powerman has to individually parse, even when things are working. I'm not sure that helps?

@chu11
Copy link
Member

chu11 commented Jan 25, 2024

Maybe although that would force redfishpower to produce a thousand lines of output

It's already doing it :-) I guess in hindsight wasn't necessary, it was copying style from ipmipower.

that powerman has to individually parse, even when things are working. I'm not sure that helps?

Talking to the admins, I get the feelig that if powerman could do something like this:

$ pm --off foo[1-1000]
foo128: busy
foo972: authentication error

or whatever, that's what they really would like. A few nodes will probably always be bad and don't work, but knowing they didn't work on those few is important (and knowing when everything did work is also important). Because right now they don't know how many worked, if any failed, etc. so it's a guessing game.

(BTW, this example output and conversation is probably also bleeding into #76 issue ... dunno if needs to be merged together? Is solution to one problem the same for both?)

@chu11
Copy link
Member

chu11 commented Jan 26, 2024

@garlick and I chatted offline and think this is a good approach for the short-medium term.

Just wanted to throw out an idea I just thought of that's a tweak on the original idea above. How hard would something like this be?

       script on_ranged {
                send "on %s\n"
                summary_expect "summary: ok=$1  bad=$2\n"
                expect "redfishpower> "
        }

i.e. instead of a regex, a very specific string powerman is expecting. redfishpower would output a "summary output" at the end of all the power ons. And in this case, can inform powerman of the nodes that are ok or had errors in a pre-defined format.

if that's too difficult in the short term, perhaps

       script on_ranged {
                send "on %s\n"
                summary_expect
                expect "redfishpower> "
        }

where summary_expect is expecting something specific, possibly summary: ok or summary: error or something would be more than fine.

@garlick
Copy link
Member Author

garlick commented Jan 26, 2024

I was just having similar ideas! The regex thing is too confining and the dev script "language" is already a monstrosity that it feels lame to build on. I was wondering if we could just bypass it somehow. Maybe define a plugin that powermand could dlopen? Or even just a well defined protocol over the same type of communication channel as we have now?

Food for thought.

@chu11
Copy link
Member

chu11 commented Jan 26, 2024

Or even just a well defined protocol over the same type of communication channel as we have now?

Yeah, a "defined protocol" is more what I was thinking. Especially since redfishpower "lives" inside powerman we can change things more easily if we think it is best (vs ipmipower, which I think being around for almost 20 years, I'm not sure it's a good idea to change it ... atleast by default).

there's powerman internals i'm not super familiar with and unsure how far down the rabbit hole to go. On one end dlopen a plugin that does super special sauce or (per my example above) a few special pre-defined string formats that can communicate back to powerman what's going on. If we define just a few special string-formats, it wouldn't be that hard to add to ipmipower and have them enabled with some --powerman-special-sauce option.

Edit: oh, ii guess we could copy and build a special ipmipower inside powerman? they are both GPL, so no licensing issues I think

@garlick
Copy link
Member Author

garlick commented Jan 26, 2024

The test suite relies fairly heavily on "coproc" being able to work with the expect scripts so I think we'd just have to activate a special mode when we're using "the protocol". So there wouldn't be any urgency to update things like ipmipower. OTOH it might be helpful to do so later, especially if it suffers from the same problems as redfishpower.

@chu11 chu11 self-assigned this Mar 6, 2024
chu11 added a commit to chu11/powerman that referenced this issue Mar 7, 2024
Problem: In many cases, there is no way for a power operation (i.e.
power on, but not power status) to inform powerman that an error has
occurred in the operation.  The user will always get a
"Command completed successfully" output after issuing a power operation.

Solution: Support a new "setresult" statement that can inform powerman
that a power operation did not succeed.  A regex can be used to determine
what output is expected of a successful power operation.  If any are not
successful, powerman can subsequently inform the user an error has occurred,
leading to a "Command completed with errors" message.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Fixes chaos#79
chu11 added a commit to chu11/powerman that referenced this issue Mar 7, 2024
Problem: In many cases, there is no way for a power operation (i.e.
power on, but not power status) to inform powerman that an error has
occurred in the operation.  The user will always get a
"Command completed successfully" output and exit status 0 after
issuing a power operation.

Solution: Support a new "setresult" statement that can inform powerman
that a power operation did not succeed.  A regex can be used to determine
what output is expected of a successful power operation.  If any are not
successful, powerman can subsequently inform the user an error has occurred,
leading to a "Command completed with errors" message.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Fixes chaos#79
chu11 added a commit to chu11/powerman that referenced this issue Mar 7, 2024
Problem: In many cases, there is no way for a power operation (i.e.
power on, but not power status) to inform powerman that an error has
occurred in the operation.  The user will always get a
"Command completed successfully" output and exit status 0 after
issuing a power operation.

Solution: Support a new "setresult" statement that can inform powerman
that a power operation did not succeed.  A regex can be used to determine
what output is expected of a successful power operation.  If any are not
successful, powerman can subsequently inform the user an error has occurred,
leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Fixes chaos#79
chu11 added a commit to chu11/powerman that referenced this issue Mar 16, 2024
Problem: In many cases, there is no way for a power operation (i.e.
power on, but not power status) to inform powerman that an error has
occurred in the operation.  The user will always get a
"Command completed successfully" output and exit status 0 after
issuing a power operation.

Solution: Support a new "setresult" statement that can inform powerman
that a power operation did not succeed.  A regex can be used to determine
what output is expected of a successful power operation.  If any are not
successful, powerman can subsequently inform the user an error has occurred,
leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Fixes chaos#79
chu11 added a commit to chu11/powerman that referenced this issue Mar 19, 2024
Problem: In many cases, there is no way for a power operation (i.e.
power on, but not power status) to inform powerman that an error has
occurred in the operation.  The user will always get a
"Command completed successfully" output and exit status 0 after
issuing a power operation.

Solution: Support a new "setresult" statement that can inform powerman
that a power operation did not succeed.  A regex can be used to determine
what output is expected of a successful power operation.  If any are not
successful, powerman can subsequently inform the user an error has occurred,
leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Fixes chaos#79
chu11 added a commit to chu11/powerman that referenced this issue Mar 21, 2024
Problem: In many cases, there is no way for a power operation (i.e.
power on, but not power status) to inform powerman that an error has
occurred in the operation.  The user will always get a
"Command completed successfully" output and exit status 0 after
issuing a power operation.

Solution: Support a new "setresult" statement that can inform powerman
that a power operation did not succeed.  A regex can be used to determine
what output is expected of a successful power operation.  If any are not
successful, powerman can subsequently inform the user an error has occurred,
leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Fixes chaos#79
chu11 added a commit to chu11/powerman that referenced this issue Mar 26, 2024
Problem: In many cases, there is no way for a power operation (i.e.
power on, but not power status) to inform powerman that an error has
occurred in the operation.  The user will always get a
"Command completed successfully" output and exit status 0 after
issuing a power operation.

Solution: Support a new "setresult" statement that can inform powerman
that a power operation did not succeed.  A regex can be used to determine
what output is expected of a successful power operation.  If any are not
successful, powerman can subsequently inform the user an error has occurred,
leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Fixes chaos#79
chu11 added a commit to chu11/powerman that referenced this issue Mar 26, 2024
Problem: In many cases, there is no way for a power operation (i.e.
power on, but not power status) to inform powerman that an error has
occurred in the operation.  The user will always get a
"Command completed successfully" output and exit status 0 after
issuing a power operation.

Solution: Support a new "setresult" statement that can inform powerman
that a power operation did not succeed.  A regex can be used to determine
what output is expected of a successful power operation.  If any are not
successful, powerman can subsequently inform the user an error has occurred,
leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Fixes chaos#79
chu11 added a commit to chu11/powerman that referenced this issue Mar 26, 2024
Problem: In many cases, there is no way for a power operation (i.e.
power on, but not power status) to inform powerman that an error has
occurred in the operation.  The user will always get a
"Command completed successfully" output and exit status 0 after
issuing a power operation.

Solution: Support a new "setresult" statement that can inform powerman
that a power operation did not succeed.  A regex can be used to determine
what output is expected of a successful power operation.  If any are not
successful, powerman can subsequently inform the user an error has occurred,
leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Fixes chaos#79
chu11 added a commit to chu11/powerman that referenced this issue Apr 4, 2024
Problem: In many cases, power operations cannot capture that an
error has occurred.  Due to how device scripts are scripted, in many
cases a user will get "Command completed successfully" regardless
if the power operation was a success or not.

Solution: Support a new "setresult" statement that can inform powerman
that a power operation did not succeed.  A regex can be used to determine
what output is expected of a successful power operation.  If any are not
successful, powerman can immediately report to the user an error has occurred,
leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Fixes chaos#79
chu11 added a commit to chu11/powerman that referenced this issue Apr 4, 2024
Problem: In many cases, power operations cannot capture that an
error has occurred.  Due to how device scripts are scripted, in many
cases a user will get "Command completed successfully" regardless
if the power operation was a success or not.

Solution: Support a new "setresult" statement that can inform powerman
that a power operation did not succeed.  A regex can be used to determine
what output is expected of a successful power operation.  If any are not
successful, powerman can immediately report to the user an error has occurred,
leading to a "Command completed with errors" message and exit status 1.

Some example uses:

script on_all {
	send "on\n"
	foreachnode {
		expect "([^\n:]+): ([^\n]+\n)"
		setresult $1 $2 success="^ok\n"
	}
	expect "redfishpower> "
}

script on {
	send "on %s\n"
	expect "([^\n:]+): ([^\n]+\n)"
	setresult $1 $2 success="^ok\n"
	expect "redfishpower> "
}

Fixes chaos#79
@mergify mergify bot closed this as completed in cfc1972 Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants