Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delay and smarter verification between node restarts #12

Open
jahkeup opened this issue Nov 11, 2019 · 4 comments
Open

Add delay and smarter verification between node restarts #12

jahkeup opened this issue Nov 11, 2019 · 4 comments

Comments

@jahkeup
Copy link
Member

jahkeup commented Nov 11, 2019

What I'd like:

Dogswatch should add some delay between the restart of Nodes in a cluster. During this time, the Controller should check in with the Node that has been updated to confirm that it has come up healthy and that Workloads have returned to it. After this, the Controller should have a configurable duration used to delay between each Node restart.

@webern webern transferred this issue from bottlerocket-os/bottlerocket Feb 26, 2020
@jahkeup jahkeup changed the title dogswatch: add delay, verification between Node restarts Add delay and smarter verification between node restarts Feb 27, 2020
@samuelkarp
Copy link
Contributor

This seems potentially related to before and after reboot checks.

@anguslees
Copy link

anguslees commented Aug 21, 2020

Indeed. In case you want concrete use cases for before/after reboot checks, I use them (with coreos/flatcar update operator) to delay until the rook/ceph cluster is healthy[1], and to signal to rook/ceph that the rook storage cluster should set the "noout" flag[2]. After reboot clears the noout flag and again blocks until cluster is healthy again.

[1] eg: Data is replicated sufficiently. This signal is "global" and much more complex than what a single pod readinessProbe can represent, which is why it can't be just a PodDisruptionBudget. A better implementation might only consider the redundancy of the data on "this" node. In particular, a naive time "delay", or generic check that pods were running again (as suggested in the issue description) would not be sufficient here.

[2] noout means the node outage that is about to happen is expected to be brief, and rook should not start frantically re-replicating "lost" data onto new nodes.

This wasn't my idea at all, the standard rook docs for this are: https://rook.io/docs/rook/v1.4/container-linux.html

Having used this for a long time now, it works great. What might not be obvious at first is that the reboot script itself is deployed as a daemonset limited to nodes with the "before-reboot" label. That means it automatically "finds" and installs itself only on the relevant nodes, and only for the relevant time, which is pretty neat. Debugging the system when updates are not proceeding does require an understanding of the various state machine interactions though, of course.

I would expect very similar challenges exist for something like an elasticsearch cluster, where data replication is important and also not represented in the "health" of any specific container. I agree this probably points to a missing feature in PodDisruptionBudget, since it is still fundamentally a question of "is it ok to make $pod unavailable now".

@chancez
Copy link

chancez commented Sep 2, 2020

I'm not sure about the best approach, but one of my use-cases is jupyterhub notebook pods. These pods can't be interrupted, but we regularly cull inactive/idle ones. I'd like to be able to cordon the node that needs updating, and wait for the notebook pods to be stopped (which could be a while) before allowing with the node to be rebooted. This might be done using a tool like https://github.com/planetlabs/draino, but the update-operator would need coordination.

@jahkeup
Copy link
Member Author

jahkeup commented Sep 2, 2020

Thanks for sharing your use case and laying out what your ideal operation would look like.

This might be done using a tool like planetlabs/draino, but the update-operator would need coordination.

Draino looks very closely related to this problem space. The project appears to build on the Kubernetes autoscaler in order to accomplish its task. I'm curious what other projects are integrating with the autoscaler and what they use to enhance the features provided.

We'll likely check out both of these projects as the design is sketched out.

@gthao313 gthao313 modified the milestones: brupop 1.0.0, Backlog Oct 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants