-
Notifications
You must be signed in to change notification settings - Fork 9
The agent should clean up "lost" systemd units #180
Comments
If I understand my experience correctly, this is a significant risk - since it means on boot, the old/wrong systemd unit may happen to start first, and that unit's application would then bind to a TCP socket so the intended applicationcan not start correctly. I see this now, since I have ended up with:
and so the journal is very noisy with the "wrong" zookeeper starting up and failing repeatedly due to the TCP port already being bound. |
@pipern Yes, this issue has a high impact. The priority label and milestone reflect the urgency but not necessarily the severity. |
Related to this, will the agent be periodically checking the systemd unit files (in a 'controller loop' fashion)? Such that if one is removed or edited by mistake, the agent will put it back? I'm also thinking about which controller is the one doing the actual monitoring/controlling. If a pod stops, then, as I understand it, the kubelet reports this to the controller and the controller schedules a replacement. In stackable, if the daemon (e.g., zookeeper) stops, then actually systemd will restart it, so the krustlet would not report this to k8s. Do I understand correctly? Similarly, on boot, "normally" in k8s the scheduler would allocate a pod to a node; so without kubelet running, the server would not run any applications. But in stackable, systemd will start up the applications regardless of the state of the krustlet or k8s. If the node has been offline a while, it means it could start up with out of date configuration, since krustlet hasn't had new information from the api-server while it was offline. This relates to https://docs.stackable.tech/home/adr/ADR005-systemd_unit_file_location.html - did you consider having the unit files on volatile storage, so during a reboot, they are intentionally lost, so that the operator is the one who decides if an application should start or not? (https://github.com/stackabletech/agent/discussions is empty, but maybe that is the place for this. Shall I make a post there?) |
@pipern Discussions is probably better suited for your questions because they are out of the scope of this issue. I am looking forward to answer them there. |
Moved over to #303 |
We need to introduce some identifier which can be used by the agent to identify all services that were created by this agent.
Might be a fixed prefix, putting all services in a specific slice or something similar.
Currently, when the agent crashes and a pod is removed from Kubernetes the associated systemd unit will become orphaned and never get cleaned up.
The text was updated successfully, but these errors were encountered: