Add option for minimal reboot period #904

leonnicolas · 2024-02-29T16:16:38Z

The flag --min-reboot-period can be used to define the minimal duration
between reboots of a node.

The flag --min-reboot-period can be used to define the minimal duration between reboots of a node. Signed-off-by: leonnicolas <leonloechner@gmx.de>

ckotzbauer

Thanks for your PR.

cmd/kured/main.go

- only warn on misconfiguration - use `node.Annotations` not `node.GetAnnotations()` - explicitly check `annotate-node` value before reading an annotation Signed-off-by: leonnicolas <leonloechner@gmx.de>

leonnicolas · 2024-03-01T08:51:13Z

Thanks @ckotzbauer for the super fast review! I pushed a commit according to your comments.

leonnicolas · 2024-03-06T08:21:28Z

Can I do anything about the failing e2e tests?

jackfrancis · 2024-03-06T17:29:44Z

cmd/kured/main.go

+		}
+
+		if lastSuccessfulRebootWithinMinRebootPeriod(node) {
+			log.Infof("Last successful reboot within minimal reboot period")


I think it would be preferable to log the next time that a reboot will be allowed. So maybe we convert lastSuccessfulRebootWithinMinRebootPeriod() into a "getNextAllowedRebootTime()" func, and then do something like

nextRebootTime := getNextAllowedRebootTime(node) if time.Now().Before(nextRebootTime) { log.Infof("Not allowed to reboot until %s", nextRebootTime) }

example confuses me. Why add minRebootPeriod to the value returned by getNextAllowedRebootTime?

Either way returning the next allowed reboot time (without adding the minRebootPeriod) or returning the last successful reboot time (and adding minRebootPeriod) sound good to me.

But what should we do if we don't know the last reboot time? Right now we just return "you should reboot" when there is no last successful reboot time. Would the getNextAllowedRebootTime just return &time.Time{} if we don't know about the last successful reboot?

Sorry, copy/paste error in example (updated!).

In the instance where we aren't able to determine the last reboot time we can probably return time.Now(), and then update the statement to

if time.Now().Compare(nextRebootTime) < 0 { log.Infof("Not allowed to reboot until %s", nextRebootTime) }

are you ok with returning and error explicitly? Like

func nextRebootTime(node) (*time.Time, error)

So we don't have to return time.Now() when we don't know.

for example we could do

if minRebootPeriod > 0 { if t, err := nextAllowedReboot(node); err != nil { log.Warnf("Failed to determine next allowed reboot time: %s", err.Error()) } else if diff := t.Sub(time.Now()); diff > 0 { log.Infof("Reboot not allowed until %s (%s)", t.String(), diff.String()) } }

if we don't care about logging the duration (in how much time can we reboot), we should use time.Now().Before(t) instead, I guess. I am really bad with subtracting times in my head.

I see that it would work without returning an error. It would even take less code, but the users would never see the warning about a missing annotation or anything.

I'm happy with an error, but in the case of the annotation not being found, that's not actually an error, as this is a new feature and plenty of users won't configure kured to use it. So silently ignoring that and returning time.Now() seems reasonable.

We would only check the annotation if minRebootPeriod is configured to be larger than 0.

Right, if you suddenly turn on the feature you would get one "wrong" error log entry before the first reboot. After the first reboot with the --minRebootPeriod flag, a missing annotation would actually be a real error.

If we don't care so much about logging that error, I will totally agree to returning time.Now() if something unexpected happens.

Right, if you suddenly turn on the feature you would get one "wrong" error log entry before the first reboot.

Could you clarify why this is so?

You run cured without the --min-reboot-period flag. Kured will not annotate the node with the weave.works/kured-last-successful-reboot flag. Now you want to use the "new" feature, edit your manifest and roll out the new kured daemon set. Normally none of the pods will hold the lock from

kured/cmd/kured/main.go

Line 666 in d0bdc11

if holding(lock, &nodeMeta, concurrency > 1) {

. That means they won't add the weave.works/kured-last-successful-reboot annotation to its nodes. We wouldn't want that anyways because then restarting/rollilng out kured daemon set would falsify the last successful reboot time. Since the feature is enabled now, kured would check for the weave.works/kured-last-successful-reboot annotation, but fail since it is not there yet. So if we log an error when the annotation is missing, we would see one error log for each tick until the first reboot. That the annotation is missing is expected before the first reboot. However, after the first reboot the missing annotation would be an unexpected error.

Only after the first reboot with the --min-reboot-period flag, this feature would actually work.

I am not sure what is right here. Should we never log an error on missing annotations to avoid confusion before the first reboot to the cost of silencing real errors after the first reboot? Or the other way around? (I added a commit that would log the errors, happy to change it though)

If we should not reboot because the last successful reboot is within the `min-reboot-period`, log the soonest reboot time. Note, this will also log an error if we cannot determine the last successful reboot time. However this error is expected when running kured for the first time with `--min-reboot-period`. The `last-successful-reboot` annotation will be added to the node after the first reboot with this feature enabled and then only unexpected error should be logged. Signed-off-by: leonnicolas <leonloechner@gmx.de>

ant31 · 2024-04-29T09:38:19Z

any blocker to have this merged?

github-actions · 2024-06-29T01:48:13Z

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

ant31 · 2024-06-29T14:27:07Z

How can I help?

ckotzbauer · 2024-07-08T12:28:01Z

@jackfrancis Can you maybe have a look again please?

evrardjp · 2024-08-07T14:21:04Z

I read the comments here above and the code.
Here are my $0.02:

If not allowed to reboot because the last reboot was too recent, there is no reason to go into the rebootAsRequired function. The refactoring with the blockers is made for that. The minimal reboot should therefore be a blocker instead. (You should have a blocker right away tell you that it doesn't work, before going further).
Assuming the deployers picks to annotate the nodes, I fail to see why anyone would consider your new annotation (of the last reboot) irrelevant. Hence, I believe that information could be exposed, in all cases. Yes, it has a large impact on large clusters, but I don't see why it couldn't be covered in a release note. On top of that, it could be exposed for operational needs (prom monitoring).
If we consider that the annotation will always be present, it's okay to seed it on the run of kured. I wouldn't do it every tick though. But the function can be used at first run if the annotation does not exist yet, and the value of the annotation gets rewritten when you see fit (to be clarified for me, I don't know if this needs to recorded post reboot, or pre-reboot ). To seed it, you have two choices. You could check and parse the content of /proc/uptime (permissions get in the way here) or set it to a "reasonable" value (timestamp of the "now" for example). The later case is not representing "real" uptime ("fake" uptime) until the first successful reboot campaign, while the first approach represents real uptime. Amongst the two, I prefer the "fake" uptime: the code will be easier to maintain (comments would go a long way here) and less complex.
With that in mind, IMO a log at the maximum level of "warning" (info is fine) should be thrown on the first run of the daemonset ("Migrating to new kured version, adding annotation xxx"), and that's it. No errors should be thrown.
Functionally, a "fake" uptime in the first cycle is not really hurting in the long run.

There are 4 cases I see (tell me if I am wrong):
A) min-reboot-period small and you want your first campaign of reboots happening soon
B) min-reboot-periiod large and you want your first campaign of reboots happening soon
C) min-reboot-period small but you don't want your first campaign of reboots happening soon (you rebooted recently)
D) min-reboot-period large and you don't want your first campaign of reboots happening soon

Case A: Everything is good, setting a fake value on the first kured boot is fine (value is small).
Case B: This is easily fixed by overriding the node annotation's value manually once.
Case C: Clever setting of the annotation value would work too. Equals to "Don't start before"
Case D: Everything is good.

Did I miss something?

github-actions · 2024-10-07T02:01:24Z

This PR was automatically considered stale due to lack of activity. Please refresh it and/or join our slack channels to highlight it, before it automatically closes (in 7 days).

Use the `RebootBlocker` interface to implement the min-reboot-period feature. Signed-off-by: leonnicolas <leonloechner@gmx.de>

leonnicolas force-pushed the min-reboot-period branch from f398712 to ac20ae6 Compare February 29, 2024 16:18

Add option for minimal reboot period

12e3818

The flag --min-reboot-period can be used to define the minimal duration between reboots of a node. Signed-off-by: leonnicolas <leonloechner@gmx.de>

leonnicolas force-pushed the min-reboot-period branch from ac20ae6 to 12e3818 Compare February 29, 2024 16:19

leonnicolas mentioned this pull request Feb 29, 2024

Cron schedule for node reboots #891

Closed

ckotzbauer requested changes Feb 29, 2024

View reviewed changes

cmd/kured/main.go Outdated Show resolved Hide resolved

cmd/kured/main.go Outdated Show resolved Hide resolved

Add suggestions from code review

3fb674f

- only warn on misconfiguration - use `node.Annotations` not `node.GetAnnotations()` - explicitly check `annotate-node` value before reading an annotation Signed-off-by: leonnicolas <leonloechner@gmx.de>

leonnicolas force-pushed the min-reboot-period branch from 9138acf to 3fb674f Compare March 1, 2024 08:49

leonnicolas requested a review from ckotzbauer March 4, 2024 16:34

jackfrancis reviewed Mar 6, 2024

View reviewed changes

github-actions bot added the no-pr-activity label Jun 29, 2024

github-actions bot removed the no-pr-activity label Jun 30, 2024

github-actions bot added the no-pr-activity label Oct 7, 2024

Merge branch 'main' into min-reboot-period

a4d71f6

leonnicolas force-pushed the min-reboot-period branch from aa627d9 to c4e56b9 Compare October 18, 2024 20:46

Use blockers

c4e56b9

Use the `RebootBlocker` interface to implement the min-reboot-period feature. Signed-off-by: leonnicolas <leonloechner@gmx.de>

github-actions bot removed the no-pr-activity label Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option for minimal reboot period #904

Add option for minimal reboot period #904

leonnicolas commented Feb 29, 2024 •

edited

Loading

ckotzbauer left a comment

leonnicolas commented Mar 1, 2024

leonnicolas commented Mar 6, 2024

jackfrancis Mar 6, 2024 •

edited

Loading

leonnicolas Mar 6, 2024

jackfrancis Mar 6, 2024

leonnicolas Mar 6, 2024

leonnicolas Mar 6, 2024

jackfrancis Mar 6, 2024

leonnicolas Mar 7, 2024

jackfrancis Mar 7, 2024

leonnicolas Mar 9, 2024

leonnicolas Mar 9, 2024

ant31 commented Apr 29, 2024

github-actions bot commented Jun 29, 2024

ant31 commented Jun 29, 2024

ckotzbauer commented Jul 8, 2024

evrardjp commented Aug 7, 2024

github-actions bot commented Oct 7, 2024

Add option for minimal reboot period #904

Are you sure you want to change the base?

Add option for minimal reboot period #904

Conversation

leonnicolas commented Feb 29, 2024 • edited Loading

ckotzbauer left a comment

Choose a reason for hiding this comment

leonnicolas commented Mar 1, 2024

leonnicolas commented Mar 6, 2024

jackfrancis Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ant31 commented Apr 29, 2024

github-actions bot commented Jun 29, 2024

ant31 commented Jun 29, 2024

ckotzbauer commented Jul 8, 2024

evrardjp commented Aug 7, 2024

github-actions bot commented Oct 7, 2024

leonnicolas commented Feb 29, 2024 •

edited

Loading

jackfrancis Mar 6, 2024 •

edited

Loading