Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redfishpower: adapt status polling interval #167

Merged
merged 1 commit into from
Apr 8, 2024

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Mar 26, 2024

Problem: The status polling interval is hard coded to 1 second long. This can result in an excessive number of polling messages being sent when it is known that some hardware takes 20-50 seconds to complete a power operation.

Solution: Support a modified "exponential backoff" of the status polling interval. The modified algorithm is based on observations of how long it typically takes to complete power operations on hardware. The status polling interval begins at one second, but it gets capped at 4 seconds.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would an exponential backoff with a slower start allow you to avoid the need for the more logic based approach you've got ? E.g.

#include <stdio.h>
int main(int argc, char **argv)
{
	for (double d = 1.0; d < 10; d *= 1.3)
		printf ("%.2f\n", d);
}

which gets you

1.00
1.30
1.69
2.20
2.86
3.71
4.83
6.27
8.16

It seems like a fixed polling interval is unlikely to be anybody's choice and we could skip that?

Thoughts?

@chu11
Copy link
Member Author

chu11 commented Mar 28, 2024

It seems like a fixed polling interval is unlikely to be anybody's choice and we could skip that?

It was half added just as an "emergency" in case some random board we come upon doesn't quite meet the adapted algorithm I came up with. That commit could be removed, it isn't super necessary./

would an exponential backoff with a slower start allow you to avoid the need for the more logic based approach you've got ? E.g.

hmmmm, I guess that could work. I guess on the boards I've seen that are 47-50 seconds, I'm just afraid of the exponential backoff growing too large around that point in time, i.e. the exponential backoff has grown to 10+ seconds. In your example it would have grown to ~13 seconds and the overall wait would be 56 seconds. With my current adapted one we'd wait 48 or 52 seconds.

I think it's sort of a draw? The adapted one was calculated specifically for the cases I've seen.

@garlick
Copy link
Member

garlick commented Mar 28, 2024

Oh I was figuring you would cap the delay at 10s or so (or whatever makes sense). If we need to make it tunable, then I would add a command that lets you sets the start, multiplicative factor, and cap instead of only allowing only a non-adaptive delay to be configured. But I'm not sure its needed.

@chu11
Copy link
Member Author

chu11 commented Mar 28, 2024

Oh I was figuring you would cap the delay at 10s or so (or whatever makes sense)

ahhh ok. Then perhaps capping it at ... ehhh 5 seconds or so should be a good balance. maybe the multiplicative would be like 1.1 or 1.2 then. I'll play around with it.

@chu11 chu11 force-pushed the redfishpower_status_polling branch 2 times, most recently from fed746e to a3a211b Compare April 2, 2024 22:26
@chu11
Copy link
Member Author

chu11 commented Apr 2, 2024

re-pushed, removing the setstatuspollinginterval configuration, which we deemed unnecessary.

I ended up just keeping the "logic based" exponential backoff, b/c doing the 1.3X (or similar) multiplier ended up not quite as logically simple as we would have thought (i.e. "if this is the first poll, it is 1 second", "if this poll is > X seconds, cap at X seconds", etc. logic added just as much logic)

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I made the same comment multiple times - we just need to document the specific hardware that this algorithm was developed for IMHO.

Comment on lines 78 to 79
* shows wait ranges from a few seconds to 20 seconds
* shows wait for on/off to complete ranges from a few seconds to 50
* seconds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this test is very hardware specific the specific hardware/firmware should be mentioned.

Comment on lines 121 to 117
* timeout - when the overall power command times out
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit message should say what hardware the observations were based on and perhaps have a bit more detail about the observations themselves? Without that context, when some hardware that behaves differently comes a long, we can't make an informed decision about how to move forward.

Comment on lines 960 to 962
/* testing a range of hardware shows that the amount of time it
* takes to complete an on/off falls into two bands. Either it
* completes in the 3-6 second range OR it takes 20-50 seconds.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about hardware specifics.

@chu11 chu11 force-pushed the redfishpower_status_polling branch from a3a211b to ebb1c36 Compare April 8, 2024 19:00
Problem: The status polling interval is hard coded to 1 second long.
This can result in an excessive number of polling messages being sent
when it is known that some hardware takes 20-60 seconds to complete
a power operation.

As an example, on an HPE Cray Supercomputing EX chassis, the power on
of a node takes around 50 seconds, while a power off takes around 6 seconds.

Solution: Support a modified "exponential backoff" of the status polling
interval.  The modified algorithm is based on observations of how long it
typically takes to complete power operations on hardware.  The status
polling interval begins at one second, but it gets capped at 4 seconds.
@chu11 chu11 force-pushed the redfishpower_status_polling branch from ebb1c36 to 2d295d9 Compare April 8, 2024 19:02
@chu11
Copy link
Member Author

chu11 commented Apr 8, 2024

re-pushed

  • added some big detailed notes in the comment block
+     * Some example timings from a HPE Cray Supercomputing EX Chassis
+     *
+     * - Turn switch off - 1.18 seconds
+     * - Turn switch on - 4.5 seconds
+     * - Turn blade off - 1.18 seconds
+     * - Turn blade on - 3.76 seconds
+     * - Turn node off - 6.86 seconds
+     * - Turn node on - 54.53 seconds
+     *
+     * (achu: Going off memory, the Supermicro H12DSG-O-CPU took
+     * around 20 seconds for on/off.)
  • added some small notes into the commit message
  • removed the extra commit w/ the comment tweak and squashed that into the big one since the new big comment covers everything.

@garlick
Copy link
Member

garlick commented Apr 8, 2024

Perfect - thanks!

@mergify mergify bot merged commit 47a7071 into chaos:master Apr 8, 2024
8 checks passed
@chu11 chu11 deleted the redfishpower_status_polling branch April 8, 2024 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants